[ 
https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721664#comment-16721664
 ] 

Michael Gibney edited comment on LUCENE-8610 at 12/15/18 6:57 AM:
------------------------------------------------------------------

[^LUCENE-8610.patch] adds a test to illustrate the issue, and a minor change to 
delay call to {{invertState.setAttributeSource(...)}} until after 
{{stream.incrementToken()}} first returns {{true}}.

It occurs to me that this might raise other issues about the way stream 
Attributes are cached in {{invertState}}, but in any case this patch fixes the 
only real-world use case I'm currently aware of (manifesting for 
{{solr.PreAnalyzedField$PreAnalyzedTokenizer}} instances for empty fields 
(i.e., no tokens) [*EDIT: PreAnalyzedTokenizer instances are _not_ reused at 
index time, so this is a bad example*]), and I it doesn't seem likely to be 
otherwise problematic.

Briefly, the other issues around caching that this raises:
 1. Is there official guidance from the TokenStream API regarding whether it's 
acceptable to lazily instantiate token Attributes?
 2. If ok to lazily instantiate, how about lazily instantiating _after_ the 
initial call to {{incrementToken()}} returns true? (I'd guess not?)


was (Author: mgibney):
[^LUCENE-8610.patch] adds a test to illustrate the issue, and a minor change to 
delay call to {{invertState.setAttributeSource(...)}} until after 
{{stream.incrementToken()}} first returns {{true}}. 

It occurs to me that this might raise other issues about the way stream 
Attributes are cached in {{invertState}}, but in any case this patch fixes the 
only real-world use case I'm currently aware of (manifesting for 
{{solr.PreAnalyzedField$PreAnalyzedTokenizer}} instances for empty fields 
(i.e., no tokens), and I it doesn't seem likely to be otherwise problematic.

Briefly, the other issues around caching that this raises:
1. Is there official guidance from the TokenStream API regarding whether it's 
acceptable to lazily instantiate token Attributes?
2. If ok to lazily instantiate, how about lazily instantiating _after_ the 
initial call to {{incrementToken()}} returns true? (I'd guess not?)
3. {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} uses 
{{removeAllAttributes()}}, so caching is going to be a problem there under any 
circumstances, right? I'm going to dig into this and possibly submit a separate 
Solr issue for that one ...

> NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated 
> token Attributes
> -----------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8610
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8610
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: core/index
>    Affects Versions: 7.4, master (8.0)
>            Reporter: Michael Gibney
>            Priority: Minor
>         Attachments: LUCENE-8610.patch
>
>
> {{DefaultIndexingChain.invert(...)}} calls 
> {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} 
> is called.
> For instances of {{stream}} that lazily instantiate token attributes (e.g., 
> as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to 
> {{incrementToken()}} that returns {{true}}), this can result in caching a 
> {{null}} value in {{invertState.termAttribute}} for a given {{stream}} 
> instance.
> Subsequent calls that reuse the same {{stream}} instance (reusing 
> {{TokenStreamComponents}}) for field values with at least 1 token will call 
> {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from 
> the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE 
> would be thrown when {{termsHashPerField.add()}} reasonably but incorrectly 
> assumes a non-null value for {{termAtt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to