[jira] [Commented] (LUCENE-8323) New ConcatenateFilter, a TokenFilter to concat/join tokens

David Smiley (JIRA) Thu, 24 May 2018 12:14:31 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489618#comment-16489618
 ]


David Smiley commented on LUCENE-8323:
--------------------------------------

Thanks for the review Adrien.

bq. Did you use those license headers on purpose for new files? They don't look 
like the usual ones that we use.

It's deliberate; this comes from another project, remember.  (see my 1st 
comment).

I agree with all your other points but it may be moot now based my discovery of 
CompletionTokenStream...

bq. You can also check the CompletionTokenStream in the suggest package. It 
does exactly what you want and it's already a TokenStream so maybe it can be 
renamed and moved to the analysis module ?

Wow thanks Jim; this is exactly what I'm looking for!

+1 to move make move/rename CompletionTokenStream for broader use.

I think it should be made a TokenFilter so that it can be used easily with, 
say, CustomAnalyzer.  I did this as a quick hack and it's mostly okay.  I had 
to debug some various tokenstream lifecycle stuff though that wasn't so much 
because it's a Filter and more due to with it getting tested in a more hardened 
way thanks to BaseTokenStreamTestCase.

What name?  CompletionGraphTokenFilter maybe but the word "Completion" is tied 
too much to it's original use-case.  Maybe ConcatenateGraphTokenFilter or 
shorter ConcatGraphTokenFilter?  FiniteStringsGraphTokenFilter is another idea 
though it's name seems very non-obvious to all but internal Lucene devs.

I think we should add "@see" references between GraphTokenStreamFiniteStrings 
and CompletionTokenStream as these things do very similar things.  It appears 
TokenStreamToAutomaton (used by CompletionTokenStream) and 
GraphTokenStreamFiniteStrings.build is a duplicated algorithm... they could be 
reused maybe.  But I didn't look closely to see.

I just did a quick hack experiment of using CompletionTokenStream in place of 
ConcatenateFilter with the SolrTextTagger tests and it basically works.  I 
mentioned some lifecycle stuff above I debugged.  I needed to make the 
separator customizable (e.g. to be a space).  One weird thing is that the first 
position increment of CompletionTokenStream is 0 which IndexWriter is unhappy 
with so I set it to 1.  Interestingly, BaseTokenStreamTestCase didn't complain 
about this yet real world use complained right away.  Maybe 
BaseTokenStreamTestCase needs to explicitly test this?

I'll throw up a patch once I get confirmation on a name.

> New ConcatenateFilter, a TokenFilter to concat/join tokens
> ----------------------------------------------------------
>
>                 Key: LUCENE-8323
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8323
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>         Attachments: LUCENE-8323.patch
>
>
> Here I introduce the ConcatenateFilter (with Factory) to concatenate/join 
> tokens with a provided separator to produce one final token.  It's similar to 
> FingerprintFilter but doesn't deduplicate or sort.  It's useful for doing 
> exact-ish search on short text (think names or titles) with simple analysis.  
> At this task, its faster than a PhraseQuery equivalent, and solves the issue 
> of matching completely and not a portion of the tokens.  It's also useful for 
> using Lucene to hold a dictionary of short names/phrases for 
> entity-extraction (aka text tagging).  The OpenSextant SolrTextTagger uses it 
> for this purpose, which is where I'm taking it from.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8323) New ConcatenateFilter, a TokenFilter to concat/join tokens

Reply via email to