[ 
https://issues.apache.org/jira/browse/LUCENE-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807274#comment-15807274
 ] 

Uwe Schindler commented on LUCENE-7622:
---------------------------------------

I agree that by default TokenStreams should not produce duplicate tokens, but 
there are use cases (boosting) where you might want to do this. E.g., if you 
want to raise the boost of a term in a document (e.g., if its inside a <em> 
HTML tag and should have emphasis), you can duplicate the token to increase its 
frequency (with same position). The alternative would be payloads and payload 
query, but this is cheap to do.

Also: If you use ASCIIFoldingFilter or stemming and add the folded/stemmed 
terms together with the original ones to the index, those terms with no 
folding/stemming applied would get duplicated. But If you don't do this the 
statistics would be wrong. I agree, for this case it would be better to have a 
separate field, but some people like to have it in the same.

> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-7622
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7622
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-7622.patch
>
>
> The change to BTSTC is quite simple, to catch any case where the same term 
> text spans from the same position with the same position length. Such 
> duplicate tokens are silly to add to the index, or to search at search time.
> Yet, this change produced many failures, and I looked briefly at them, and 
> they are cases that I think are actually OK, e.g. 
> {{PatternCaptureGroupTokenFilter}} capturing (..)(..) on the string {{ktkt}} 
> will create a duplicate token.
> Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to