[jira] [Commented] (SOLR-8606) Duplicate tokens using WordDelimiterFilter for a specific configuration

Shawn Heisey (JIRA) Wed, 27 Jan 2016 14:13:58 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15120268#comment-15120268
 ]


Shawn Heisey commented on SOLR-8606:
------------------------------------

Assuming I understand this correctly, that the duplicate terms are at the same 
*term* position, this is something we should probably fix.  The term position 
is a different piece of information than the start/end position, which refers 
to character position in the unprocessed string supplied to analysis.

TL;DR details:

In terms of *matching*, this behavior isn't a bug.  It won't cause incorrect 
matches or prevent correct matches, as long as the duplicate terms are at the 
same term position.

There are however potential ramifications for *relevance*.

RemoveDuplicatesTokenFilterFactory exists, but from what I understand, this 
filter will only take effect when the duplicate tokens are next to each other 
in the stream ... frequently when WDF creates duplicates, they are NOT 
consecutive tokens.

This issue is not filed in the correct project -- the class in question is a 
Lucene class.  I am waiting for confirmation on that before I move it.

> Duplicate tokens using WordDelimiterFilter for a specific configuration
> -----------------------------------------------------------------------
>
>                 Key: SOLR-8606
>                 URL: https://issues.apache.org/jira/browse/SOLR-8606
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Lespiau
>            Priority: Minor
>         Attachments: SOLR-8686-TEST.patch
>
>
> When using both the options PRESERVE_ORIGINAL| SPLIT_ON_CASE_CHANGE and 
> CONCATENATE_ALL|CATENATE_WORDS using the WordDelimiterFilter, we have 
> duplicate tokens on strings contaning only case changes.
> When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc", 
> "Def".
> When having PRESERVE_ORIGINAL, we keep "abcDef".
> However, when one uses CONCATENATE_ALL or CATENATE_WORDS, it also adds an 
> other token built from the concatenation of the splited words, giving 
> "abcDef" again.
> I'm not 100% certain that token filters should not produce duplicate tokens 
> (same word, same start and end positions). Can someone confirm this is a bug ?
> I supply a patch that gives a test explosing the incorrect behavior.
> I'm willing to work on the following days to fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (SOLR-8606) Duplicate tokens using WordDelimiterFilter for a specific configuration

Reply via email to