[ 
https://issues.apache.org/jira/browse/SOLR-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15123929#comment-15123929
 ] 

Lespiau commented on SOLR-8606:
-------------------------------

I have looked at the source code and tested it, the RemoveDuplicates filter 
**will** filter the duplicate terms.

Indeed, it removes duplicates in the set of tokens sharing the same position.

However, it will do so by comparing every token with other tokens at the same 
position. For performance issues, I will still correct the WDTK, so that is it 
not necessary to call the RemoveDuplicates afterwards (it does only one 
iteration, and not 2).

> Duplicate tokens using WordDelimiterFilter for a specific configuration
> -----------------------------------------------------------------------
>
>                 Key: SOLR-8606
>                 URL: https://issues.apache.org/jira/browse/SOLR-8606
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Lespiau
>            Priority: Minor
>         Attachments: SOLR-8686-TEST.patch, wdf-analysis.png
>
>
> When using both the options PRESERVE_ORIGINAL| SPLIT_ON_CASE_CHANGE and 
> CONCATENATE_ALL|CATENATE_WORDS using the WordDelimiterFilter, we have 
> duplicate tokens on strings contaning only case changes.
> When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc", 
> "Def".
> When having PRESERVE_ORIGINAL, we keep "abcDef".
> However, when one uses CONCATENATE_ALL or CATENATE_WORDS, it also adds an 
> other token built from the concatenation of the splited words, giving 
> "abcDef" again.
> I'm not 100% certain that token filters should not produce duplicate tokens 
> (same word, same start and end positions). Can someone confirm this is a bug ?
> I supply a patch that gives a test explosing the incorrect behavior.
> I'm willing to work on the following days to fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to