[
https://issues.apache.org/jira/browse/LUCENE-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489618#comment-16489618
]
David Smiley commented on LUCENE-8323:
--------------------------------------
Thanks for the review Adrien.
bq. Did you use those license headers on purpose for new files? They don't look
like the usual ones that we use.
It's deliberate; this comes from another project, remember. (see my 1st
comment).
I agree with all your other points but it may be moot now based my discovery of
CompletionTokenStream...
bq. You can also check the CompletionTokenStream in the suggest package. It
does exactly what you want and it's already a TokenStream so maybe it can be
renamed and moved to the analysis module ?
Wow thanks Jim; this is exactly what I'm looking for!
+1 to move make move/rename CompletionTokenStream for broader use.
I think it should be made a TokenFilter so that it can be used easily with,
say, CustomAnalyzer. I did this as a quick hack and it's mostly okay. I had
to debug some various tokenstream lifecycle stuff though that wasn't so much
because it's a Filter and more due to with it getting tested in a more hardened
way thanks to BaseTokenStreamTestCase.
What name? CompletionGraphTokenFilter maybe but the word "Completion" is tied
too much to it's original use-case. Maybe ConcatenateGraphTokenFilter or
shorter ConcatGraphTokenFilter? FiniteStringsGraphTokenFilter is another idea
though it's name seems very non-obvious to all but internal Lucene devs.
I think we should add "@see" references between GraphTokenStreamFiniteStrings
and CompletionTokenStream as these things do very similar things. It appears
TokenStreamToAutomaton (used by CompletionTokenStream) and
GraphTokenStreamFiniteStrings.build is a duplicated algorithm... they could be
reused maybe. But I didn't look closely to see.
I just did a quick hack experiment of using CompletionTokenStream in place of
ConcatenateFilter with the SolrTextTagger tests and it basically works. I
mentioned some lifecycle stuff above I debugged. I needed to make the
separator customizable (e.g. to be a space). One weird thing is that the first
position increment of CompletionTokenStream is 0 which IndexWriter is unhappy
with so I set it to 1. Interestingly, BaseTokenStreamTestCase didn't complain
about this yet real world use complained right away. Maybe
BaseTokenStreamTestCase needs to explicitly test this?
I'll throw up a patch once I get confirmation on a name.
> New ConcatenateFilter, a TokenFilter to concat/join tokens
> ----------------------------------------------------------
>
> Key: LUCENE-8323
> URL: https://issues.apache.org/jira/browse/LUCENE-8323
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: David Smiley
> Assignee: David Smiley
> Priority: Major
> Attachments: LUCENE-8323.patch
>
>
> Here I introduce the ConcatenateFilter (with Factory) to concatenate/join
> tokens with a provided separator to produce one final token. It's similar to
> FingerprintFilter but doesn't deduplicate or sort. It's useful for doing
> exact-ish search on short text (think names or titles) with simple analysis.
> At this task, its faster than a PhraseQuery equivalent, and solves the issue
> of matching completely and not a portion of the tokens. It's also useful for
> using Lucene to hold a dictionary of short names/phrases for
> entity-extraction (aka text tagging). The OpenSextant SolrTextTagger uses it
> for this purpose, which is where I'm taking it from.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]