[jira] [Updated] (LUCENE-7468) ASCIIFoldingFilter should not emit duplicated tokens when preserve original is on

David Causse (JIRA) Wed, 28 Sep 2016 06:20:45 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-7468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David Causse updated LUCENE-7468:
---------------------------------
    Description: 
The ASCIIFoldingFilter seems to make the bold assumption that any tokens that 
contain a char outside the ASCII range will be folded.
The problem is that when preserve original is true we capture and restore the 
state even if the token remains unmodified.
This causes term frequencies to double for such words and probably extra space 
used when positions/offsets are stored in the postings.

  was:
The ASCIIFoldingFilter seems to make the bold assumption that any tokens that 
contain a char outside the ASCII range will be folded.
The problem is that when preserve original is true we capture and restore the 
state even the token remains unmodified.
This causes term frequencies to double for such words and probably extra space 
used when positions/offsets are stored in the postings.


> ASCIIFoldingFilter should not emit duplicated tokens when preserve original 
> is on
> ---------------------------------------------------------------------------------
>
>                 Key: LUCENE-7468
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7468
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 4.7, 5.x, trunk, 6.x
>            Reporter: David Causse
>
> The ASCIIFoldingFilter seems to make the bold assumption that any tokens that 
> contain a char outside the ASCII range will be folded.
> The problem is that when preserve original is true we capture and restore the 
> state even if the token remains unmodified.
> This causes term frequencies to double for such words and probably extra 
> space used when positions/offsets are stored in the postings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-7468) ASCIIFoldingFilter should not emit duplicated tokens when preserve original is on

Reply via email to