[ 
https://issues.apache.org/jira/browse/SOLR-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15122672#comment-15122672
 ] 

Shawn Heisey commented on SOLR-8606:
------------------------------------

When I read the description of the RemoveDuplicates filter on the Solr wiki, it 
seems to say that duplicates will only be removed when they are consecutive 
tokens in the stream.  If that's indeed how it works, it would not remove 
duplicates in this case, because the duplicates are separated by another token.

> Duplicate tokens using WordDelimiterFilter for a specific configuration
> -----------------------------------------------------------------------
>
>                 Key: SOLR-8606
>                 URL: https://issues.apache.org/jira/browse/SOLR-8606
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Lespiau
>            Priority: Minor
>         Attachments: SOLR-8686-TEST.patch, wdf-analysis.png
>
>
> When using both the options PRESERVE_ORIGINAL| SPLIT_ON_CASE_CHANGE and 
> CONCATENATE_ALL|CATENATE_WORDS using the WordDelimiterFilter, we have 
> duplicate tokens on strings contaning only case changes.
> When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc", 
> "Def".
> When having PRESERVE_ORIGINAL, we keep "abcDef".
> However, when one uses CONCATENATE_ALL or CATENATE_WORDS, it also adds an 
> other token built from the concatenation of the splited words, giving 
> "abcDef" again.
> I'm not 100% certain that token filters should not produce duplicate tokens 
> (same word, same start and end positions). Can someone confirm this is a bug ?
> I supply a patch that gives a test explosing the incorrect behavior.
> I'm willing to work on the following days to fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to