[
https://issues.apache.org/jira/browse/LUCENE-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15551380#comment-15551380
]
ASF subversion and git services commented on LUCENE-7468:
---------------------------------------------------------
Commit 739c0a7bf2c911e25ed40fb6717d9aed641a0a2f in lucene-solr's branch
refs/heads/branch_6x from [~jpountz]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=739c0a7 ]
LUCENE-7468: ASCIIFoldingFilter should not emit duplicated tokens when preserve
original is on.
> ASCIIFoldingFilter should not emit duplicated tokens when preserve original
> is on
> ---------------------------------------------------------------------------------
>
> Key: LUCENE-7468
> URL: https://issues.apache.org/jira/browse/LUCENE-7468
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 4.7, 5.x, trunk, 6.x
> Reporter: David Causse
> Attachments: LUCENE-7468.patch
>
>
> The ASCIIFoldingFilter seems to make the bold assumption that any tokens that
> contain a char outside the ASCII range will be folded.
> The problem is that when preserve original is true we capture and restore the
> state even if the token remains unmodified.
> This causes term frequencies to double for such words and probably extra
> space used when positions/offsets are stored in the postings.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]