[jira] [Updated] (LUCENE-7004) Duplicate tokens using WordDelimiterFilter for a specific configuration

Jean-Baptiste Lespiau (JIRA) Sat, 30 Jan 2016 17:03:57 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jean-Baptiste Lespiau updated LUCENE-7004:
------------------------------------------
    Attachment: TEST-LUCENE-7004.PATCH
                FIX-LUCENE-7004.PATCH

Here are the patches:
*TEST-LUCENE-7004.PATCH*: Add more tests, to prevent regression. I was not 
confident enough with the previous tests around the functionnality I wanted to 
modify.
*FIX-LUCENE-7004.PATCH*: Fix the generation of the duplicate tokens, and change 
the few tests accordingly.

It also uses a new class for debuging introduced by LUCENE-7003.

Feel free to comment.

I don't know the patch format you require (just differences, of also the 
commits information). Just tell me if I need to change something.


I suspect the "!shouldGenerateParts(concatenation.type)" in the condition of 
the flushConcatenation function in WDTK to be semantically incorrect, but it 
seems not to be producing any errors because it's dead code (i.e. always 
false). Removing it does not modify the tests passing. Well, that's another 
matter.

PS: Thanks for the one who moved it into LUCENE.

> Duplicate tokens using WordDelimiterFilter for a specific configuration
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-7004
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7004
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Jean-Baptiste Lespiau
>            Priority: Minor
>         Attachments: FIX-LUCENE-7004.PATCH, TEST-LUCENE-7004.PATCH, 
> wdf-analysis.png
>
>
> When using both the options PRESERVE_ORIGINAL| SPLIT_ON_CASE_CHANGE and 
> CONCATENATE_ALL|CATENATE_WORDS using the WordDelimiterFilter, we have 
> duplicate tokens on strings contaning only case changes.
> When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc", 
> "Def".
> When having PRESERVE_ORIGINAL, we keep "abcDef".
> However, when one uses CONCATENATE_ALL or CATENATE_WORDS, it also adds an 
> other token built from the concatenation of the splited words, giving 
> "abcDef" again.
> I'm not 100% certain that token filters should not produce duplicate tokens 
> (same word, same start and end positions). Can someone confirm this is a bug ?
> I supply a patch that gives a test explosing the incorrect behavior.
> I'm willing to work on the following days to fix that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-7004) Duplicate tokens using WordDelimiterFilter for a specific configuration

Reply via email to