David Causse created LUCENE-7468:
------------------------------------
Summary: ASCIIFoldingFilter should not emit duplicated tokens when
preserve original is on
Key: LUCENE-7468
URL: https://issues.apache.org/jira/browse/LUCENE-7468
Project: Lucene - Core
Issue Type: Bug
Components: modules/analysis
Affects Versions: 4.7, 5.x, trunk, 6.x
Reporter: David Causse
The ASCIIFoldingFilter seems to make the bold assumption that any tokens that
contain a char outside the ASCII range will be folded.
The problem is that when preserve original is true we capture and restore the
state even the token remains unmodified.
This causes term frequencies to double for such words and probably extra space
used when positions/offsets are stored in the postings.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]