[
https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16491312#comment-16491312
]
David Smiley commented on LUCENE-8332:
--------------------------------------
See GitHub PR. comments welcome!
* Made it a TokenFilter.
* Move some package accessible constants from CompletionAnalyzer to public ones
on the moved TokenStream class.
* Renamed SEP_LABEL => SEP_CHAR and made it of type char, not an int. This
seems more natural/clear to me and it removed some now needless casting in
tests. I'm tempted to make this char configurable but I suppose it's value
doesn't really matter.
* Changed position increment handling of this filter to be stacked instead of
consecutive. I think this just makes sense for what's going on. This affected
some test assertions but otherwise it shouldn't matter for its practical uses.
Changing this also revealed a bug...
* clearAttributes() was being called before toAutomaton() consumed the input
but this left the attribute state dirty; a test failed on me relating to offset
ordering for this reason. I moved toAutomaton to occur in reset() and now
that's a non-issue. This is also a slight optimization since clearAttributes()
need not be called if finiteStrings.next() returns false (thus
clearAttributes() was being called one too many times per stream).
* added captureState and restoreState for end(). Though I do wonder if we
actually need to anything in end(); based on my reading of
TokenStreamToAutomaton end of stream calculations, the automaton contains the
final token state already; so perhaps end() ought to do nothing? When I tried
that, the state machine of the tests complained I didn't propagate end() -- but
this didn't account for the fact that end() was in fact called earlier via
reset() since the tokenStream is consumed there. _Any way_, at least what's in
this patch is no worse than what was before, I think.
* in reset(), removed needless hasAttribute() guard around getAttribute()
* There are nocommits surrounding about the configuration aspects of the filter
being public so that
org.apache.lucene.search.suggest.document.ContextSuggestField#wrapTokenStream
can access it. IMO this seems quirky... it'd be nice to remove the need to
access it. But I suppose just adding some getters to the filter is innocent
enough.
* The setPayload feature of this filter is debatable. Arguably this ought to be
done by a standalone TokenFilter or be done in some internal way to the NRT
suggester. I marked the setter lucene.internal.
* added testWithStopword which found a bug...
* Bug: TokenStreamToAutomaton does not handle trailing stop words correctly
when preservePositionIncrements==false
> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> -------------------------------------------------------------------
>
> Key: LUCENE-8332
> URL: https://issues.apache.org/jira/browse/LUCENE-8332
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: David Smiley
> Assignee: David Smiley
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the
> analysis module renamed as ConcatenateGraphTokenStream. See comments in
> LUCENE-8323 leading to this idea. Such a TokenStream (or TokenFilter?) has
> several uses:
> * for the suggest module
> * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
> * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a
> TokenFilter based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from
> the suggest module as it's marked experimental and it only seems to be public
> now perhaps due to some technicality (it has package level constructors).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]