[jira] [Commented] (LUCENE-8323) New ConcatenateFilter, a TokenFilter to concat/join tokens

Jim Ferenczi (JIRA) Thu, 24 May 2018 16:00:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489956#comment-16489956
 ]


Jim Ferenczi commented on LUCENE-8323:
--------------------------------------

{quote}
I think it should be made a TokenFilter so that it can be used easily with, 
say, CustomAnalyzer.
{quote}

It's a TokenStream because it consumes the entire input in the first call to 
incrementToken (which invokes input.reset(), input.end(), input.close()). It's 
fine though because TokenFilterFactory returns a TokenStream so you can already 
use it in a CustomAnalyzer or in an AnalyzerWrapper like the CompletionAnalyzer 
does. Also note that there is a BaseTokenStreamTestCase already: 
CompletionTokenStreamTest. 

{quote}
What name? CompletionGraphTokenFilter maybe but the word "Completion" is tied 
too much to it's original use-case. Maybe ConcatenateGraphTokenFilter or 
shorter ConcatGraphTokenFilter? FiniteStringsGraphTokenFilter is another idea 
though it's name seems very non-obvious to all but internal Lucene devs.
{quote}

+1 to ConcatenateGraphTokenStream

{quote}
It appears TokenStreamToAutomaton (used by CompletionTokenStream) and 
GraphTokenStreamFiniteStrings.build is a duplicated algorithm... they could be 
reused maybe. But I didn't look closely to see.
{quote}

I guess they could share some logic but that's a different issue ;) 

> New ConcatenateFilter, a TokenFilter to concat/join tokens
> ----------------------------------------------------------
>
>                 Key: LUCENE-8323
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8323
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>         Attachments: LUCENE-8323.patch
>
>
> Here I introduce the ConcatenateFilter (with Factory) to concatenate/join 
> tokens with a provided separator to produce one final token.  It's similar to 
> FingerprintFilter but doesn't deduplicate or sort.  It's useful for doing 
> exact-ish search on short text (think names or titles) with simple analysis.  
> At this task, its faster than a PhraseQuery equivalent, and solves the issue 
> of matching completely and not a portion of the tokens.  It's also useful for 
> using Lucene to hold a dictionary of short names/phrases for 
> entity-extraction (aka text tagging).  The OpenSextant SolrTextTagger uses it 
> for this purpose, which is where I'm taking it from.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8323) New ConcatenateFilter, a TokenFilter to concat/join tokens

Reply via email to