[
https://issues.apache.org/jira/browse/LUCENE-8323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489174#comment-16489174
]
Adrien Grand commented on LUCENE-8323:
--------------------------------------
+1 I like that you documented explicitly that the behavior is currently
undefined for stacked tokens and gaps.
Some minor comments:
- Did you use those license headers on purpose for new files? They don't look
like the usual ones that we use.
- Maybe use the start offset of the first token rather than 0 as a start
offset.
- Make "buf" final.
- Nit: can you add some text to the ctor javadocs? It looks a bit weird when
there are only parameter descriptions.
> New ConcatenateFilter, a TokenFilter to concat/join tokens
> ----------------------------------------------------------
>
> Key: LUCENE-8323
> URL: https://issues.apache.org/jira/browse/LUCENE-8323
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: David Smiley
> Assignee: David Smiley
> Priority: Major
> Attachments: LUCENE-8323.patch
>
>
> Here I introduce the ConcatenateFilter (with Factory) to concatenate/join
> tokens with a provided separator to produce one final token. It's similar to
> FingerprintFilter but doesn't deduplicate or sort. It's useful for doing
> exact-ish search on short text (think names or titles) with simple analysis.
> At this task, its faster than a PhraseQuery equivalent, and solves the issue
> of matching completely and not a portion of the tokens. It's also useful for
> using Lucene to hold a dictionary of short names/phrases for
> entity-extraction (aka text tagging). The OpenSextant SolrTextTagger uses it
> for this purpose, which is where I'm taking it from.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]