[jira] [Commented] (LUCENE-8323) New ConcatenateFilter, a TokenFilter to concat/join tokens

Adrien Grand (JIRA) Thu, 24 May 2018 08:05:19 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489174#comment-16489174
 ]


Adrien Grand commented on LUCENE-8323:
--------------------------------------

+1 I like that you documented explicitly that the behavior is currently 
undefined for stacked tokens and gaps.

Some minor comments:
 - Did you use those license headers on purpose for new files? They don't look 
like the usual ones that we use.
 - Maybe use the start offset of the first token rather than 0 as a start 
offset.
 - Make "buf" final.
 - Nit: can you add some text to the ctor javadocs? It looks a bit weird when 
there are only parameter descriptions.

> New ConcatenateFilter, a TokenFilter to concat/join tokens
> ----------------------------------------------------------
>
>                 Key: LUCENE-8323
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8323
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>         Attachments: LUCENE-8323.patch
>
>
> Here I introduce the ConcatenateFilter (with Factory) to concatenate/join 
> tokens with a provided separator to produce one final token.  It's similar to 
> FingerprintFilter but doesn't deduplicate or sort.  It's useful for doing 
> exact-ish search on short text (think names or titles) with simple analysis.  
> At this task, its faster than a PhraseQuery equivalent, and solves the issue 
> of matching completely and not a portion of the tokens.  It's also useful for 
> using Lucene to hold a dictionary of short names/phrases for 
> entity-extraction (aka text tagging).  The OpenSextant SolrTextTagger uses it 
> for this purpose, which is where I'm taking it from.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8323) New ConcatenateFilter, a TokenFilter to concat/join tokens

Reply via email to