[ 
https://issues.apache.org/jira/browse/LUCENE-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15530085#comment-15530085
 ] 

Erick Erickson commented on LUCENE-7468:
----------------------------------------

I think it's still a good point that saving the same token twice isn't desired 
behavior here, having to add the RemoveDuplicatesTokenFilter seems 
unnecessarily trappy although a fine work-around in order to not have to wait 
for a new release......

FWIW

> ASCIIFoldingFilter should not emit duplicated tokens when preserve original 
> is on
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-7468
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7468
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.7, 5.x, trunk, 6.x
>            Reporter: David Causse
>         Attachments: LUCENE-7468.patch
>
>
> The ASCIIFoldingFilter seems to make the bold assumption that any tokens that 
> contain a char outside the ASCII range will be folded.
> The problem is that when preserve original is true we capture and restore the 
> state even if the token remains unmodified.
> This causes term frequencies to double for such words and probably extra 
> space used when positions/offsets are stored in the postings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to