[
https://issues.apache.org/jira/browse/SOLR-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120268#comment-15120268
]
Shawn Heisey commented on SOLR-8606:
------------------------------------
Assuming I understand this correctly, that the duplicate terms are at the same
*term* position, this is something we should probably fix. The term position
is a different piece of information than the start/end position, which refers
to character position in the unprocessed string supplied to analysis.
TL;DR details:
In terms of *matching*, this behavior isn't a bug. It won't cause incorrect
matches or prevent correct matches, as long as the duplicate terms are at the
same term position.
There are however potential ramifications for *relevance*.
RemoveDuplicatesTokenFilterFactory exists, but from what I understand, this
filter will only take effect when the duplicate tokens are next to each other
in the stream ... frequently when WDF creates duplicates, they are NOT
consecutive tokens.
This issue is not filed in the correct project -- the class in question is a
Lucene class. I am waiting for confirmation on that before I move it.
> Duplicate tokens using WordDelimiterFilter for a specific configuration
> -----------------------------------------------------------------------
>
> Key: SOLR-8606
> URL: https://issues.apache.org/jira/browse/SOLR-8606
> Project: Solr
> Issue Type: Bug
> Reporter: Lespiau
> Priority: Minor
> Attachments: SOLR-8686-TEST.patch
>
>
> When using both the options PRESERVE_ORIGINAL| SPLIT_ON_CASE_CHANGE and
> CONCATENATE_ALL|CATENATE_WORDS using the WordDelimiterFilter, we have
> duplicate tokens on strings contaning only case changes.
> When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc",
> "Def".
> When having PRESERVE_ORIGINAL, we keep "abcDef".
> However, when one uses CONCATENATE_ALL or CATENATE_WORDS, it also adds an
> other token built from the concatenation of the splited words, giving
> "abcDef" again.
> I'm not 100% certain that token filters should not produce duplicate tokens
> (same word, same start and end positions). Can someone confirm this is a bug ?
> I supply a patch that gives a test explosing the incorrect behavior.
> I'm willing to work on the following days to fix that.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]