[
https://issues.apache.org/jira/browse/LUCENE-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489956#comment-16489956
]
Jim Ferenczi commented on LUCENE-8323:
--------------------------------------
{quote}
I think it should be made a TokenFilter so that it can be used easily with,
say, CustomAnalyzer.
{quote}
It's a TokenStream because it consumes the entire input in the first call to
incrementToken (which invokes input.reset(), input.end(), input.close()). It's
fine though because TokenFilterFactory returns a TokenStream so you can already
use it in a CustomAnalyzer or in an AnalyzerWrapper like the CompletionAnalyzer
does. Also note that there is a BaseTokenStreamTestCase already:
CompletionTokenStreamTest.
{quote}
What name? CompletionGraphTokenFilter maybe but the word "Completion" is tied
too much to it's original use-case. Maybe ConcatenateGraphTokenFilter or
shorter ConcatGraphTokenFilter? FiniteStringsGraphTokenFilter is another idea
though it's name seems very non-obvious to all but internal Lucene devs.
{quote}
+1 to ConcatenateGraphTokenStream
{quote}
It appears TokenStreamToAutomaton (used by CompletionTokenStream) and
GraphTokenStreamFiniteStrings.build is a duplicated algorithm... they could be
reused maybe. But I didn't look closely to see.
{quote}
I guess they could share some logic but that's a different issue ;)
> New ConcatenateFilter, a TokenFilter to concat/join tokens
> ----------------------------------------------------------
>
> Key: LUCENE-8323
> URL: https://issues.apache.org/jira/browse/LUCENE-8323
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: David Smiley
> Assignee: David Smiley
> Priority: Major
> Attachments: LUCENE-8323.patch
>
>
> Here I introduce the ConcatenateFilter (with Factory) to concatenate/join
> tokens with a provided separator to produce one final token. It's similar to
> FingerprintFilter but doesn't deduplicate or sort. It's useful for doing
> exact-ish search on short text (think names or titles) with simple analysis.
> At this task, its faster than a PhraseQuery equivalent, and solves the issue
> of matching completely and not a portion of the tokens. It's also useful for
> using Lucene to hold a dictionary of short names/phrases for
> entity-extraction (aka text tagging). The OpenSextant SolrTextTagger uses it
> for this purpose, which is where I'm taking it from.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]