[ https://issues.apache.org/jira/browse/LUCENE-8610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721664#comment-16721664 ]
Michael Gibney edited comment on LUCENE-8610 at 12/15/18 6:57 AM: ------------------------------------------------------------------ [^LUCENE-8610.patch] adds a test to illustrate the issue, and a minor change to delay call to {{invertState.setAttributeSource(...)}} until after {{stream.incrementToken()}} first returns {{true}}. It occurs to me that this might raise other issues about the way stream Attributes are cached in {{invertState}}, but in any case this patch fixes the only real-world use case I'm currently aware of (manifesting for {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} instances for empty fields (i.e., no tokens) [*EDIT: PreAnalyzedTokenizer instances are _not_ reused at index time, so this is a bad example*]), and I it doesn't seem likely to be otherwise problematic. Briefly, the other issues around caching that this raises: 1. Is there official guidance from the TokenStream API regarding whether it's acceptable to lazily instantiate token Attributes? 2. If ok to lazily instantiate, how about lazily instantiating _after_ the initial call to {{incrementToken()}} returns true? (I'd guess not?) was (Author: mgibney): [^LUCENE-8610.patch] adds a test to illustrate the issue, and a minor change to delay call to {{invertState.setAttributeSource(...)}} until after {{stream.incrementToken()}} first returns {{true}}. It occurs to me that this might raise other issues about the way stream Attributes are cached in {{invertState}}, but in any case this patch fixes the only real-world use case I'm currently aware of (manifesting for {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} instances for empty fields (i.e., no tokens), and I it doesn't seem likely to be otherwise problematic. Briefly, the other issues around caching that this raises: 1. Is there official guidance from the TokenStream API regarding whether it's acceptable to lazily instantiate token Attributes? 2. If ok to lazily instantiate, how about lazily instantiating _after_ the initial call to {{incrementToken()}} returns true? (I'd guess not?) 3. {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} uses {{removeAllAttributes()}}, so caching is going to be a problem there under any circumstances, right? I'm going to dig into this and possibly submit a separate Solr issue for that one ... > NPE in TermsHashPerField.add() for TokenStreams with lazily instantiated > token Attributes > ----------------------------------------------------------------------------------------- > > Key: LUCENE-8610 > URL: https://issues.apache.org/jira/browse/LUCENE-8610 > Project: Lucene - Core > Issue Type: Wish > Components: core/index > Affects Versions: 7.4, master (8.0) > Reporter: Michael Gibney > Priority: Minor > Attachments: LUCENE-8610.patch > > > {{DefaultIndexingChain.invert(...)}} calls > {{invertState.setAttributeSource(stream)}} before {{stream.incrementToken()}} > is called. > For instances of {{stream}} that lazily instantiate token attributes (e.g., > as {{solr.PreAnalyzedField$PreAnalyzedTokenizer}} does upon the first call to > {{incrementToken()}} that returns {{true}}), this can result in caching a > {{null}} value in {{invertState.termAttribute}} for a given {{stream}} > instance. > Subsequent calls that reuse the same {{stream}} instance (reusing > {{TokenStreamComponents}}) for field values with at least 1 token will call > {{termHashPerField.start(...)}} which sets {{termsHashPerField.termAtt}} from > the {{null}} value cached in the {{FieldInvertState.termAttribute}}. An NPE > would be thrown when {{termsHashPerField.add()}} reasonably but incorrectly > assumes a non-null value for {{termAtt}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org