[ 
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491312#comment-16491312
 ] 

David Smiley commented on LUCENE-8332:
--------------------------------------

See GitHub PR.  comments welcome!

* Made it a TokenFilter.
* Move some package accessible constants from CompletionAnalyzer to public ones 
on the moved TokenStream class.
* Renamed SEP_LABEL => SEP_CHAR and made it of type char, not an int. This 
seems more natural/clear to me and it removed some now needless casting in 
tests. I'm tempted to make this char configurable but I suppose it's value 
doesn't really matter.
* Changed position increment handling of this filter to be stacked instead of 
consecutive. I think this just makes sense for what's going on. This affected 
some test assertions but otherwise it shouldn't matter for its practical uses. 
Changing this also revealed a bug...
* clearAttributes() was being called before toAutomaton() consumed the input 
but this left the attribute state dirty; a test failed on me relating to offset 
ordering for this reason. I moved toAutomaton to occur in reset() and now 
that's a non-issue. This is also a slight optimization since clearAttributes() 
need not be called if finiteStrings.next() returns false (thus 
clearAttributes() was being called one too many times per stream).
* added captureState and restoreState for end(). Though I do wonder if we 
actually need to anything in end(); based on my reading of 
TokenStreamToAutomaton end of stream calculations, the automaton contains the 
final token state already; so perhaps end() ought to do nothing? When I tried 
that, the state machine of the tests complained I didn't propagate end() -- but 
this didn't account for the fact that end() was in fact called earlier via 
reset() since the tokenStream is consumed there. _Any way_, at least what's in 
this patch is no worse than what was before, I think.
* in reset(), removed needless hasAttribute() guard around getAttribute()
* There are nocommits surrounding about the configuration aspects of the filter 
being public so that 
org.apache.lucene.search.suggest.document.ContextSuggestField#wrapTokenStream 
can access it. IMO this seems quirky... it'd be nice to remove the need to 
access it. But I suppose just adding some getters to the filter is innocent 
enough.
* The setPayload feature of this filter is debatable. Arguably this ought to be 
done by a standalone TokenFilter or be done in some internal way to the NRT 
suggester. I marked the setter lucene.internal.
* added testWithStopword which found a bug...
* Bug: TokenStreamToAutomaton does not handle trailing stop words correctly 
when preservePositionIncrements==false

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-8332
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8332
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the 
> analysis module renamed as ConcatenateGraphTokenStream. See comments in 
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has 
> several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a 
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from 
> the suggest module as it's marked experimental and it only seems to be public 
> now perhaps due to some technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to