[
https://issues.apache.org/jira/browse/LUCENE-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Baptiste Lespiau updated LUCENE-7004:
------------------------------------------
Attachment: TEST-LUCENE-7004.PATCH
FIX-LUCENE-7004.PATCH
Here are the patches:
*TEST-LUCENE-7004.PATCH*: Add more tests, to prevent regression. I was not
confident enough with the previous tests around the functionnality I wanted to
modify.
*FIX-LUCENE-7004.PATCH*: Fix the generation of the duplicate tokens, and change
the few tests accordingly.
It also uses a new class for debuging introduced by LUCENE-7003.
Feel free to comment.
I don't know the patch format you require (just differences, of also the
commits information). Just tell me if I need to change something.
I suspect the "!shouldGenerateParts(concatenation.type)" in the condition of
the flushConcatenation function in WDTK to be semantically incorrect, but it
seems not to be producing any errors because it's dead code (i.e. always
false). Removing it does not modify the tests passing. Well, that's another
matter.
PS: Thanks for the one who moved it into LUCENE.
> Duplicate tokens using WordDelimiterFilter for a specific configuration
> -----------------------------------------------------------------------
>
> Key: LUCENE-7004
> URL: https://issues.apache.org/jira/browse/LUCENE-7004
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Jean-Baptiste Lespiau
> Priority: Minor
> Attachments: FIX-LUCENE-7004.PATCH, TEST-LUCENE-7004.PATCH,
> wdf-analysis.png
>
>
> When using both the options PRESERVE_ORIGINAL| SPLIT_ON_CASE_CHANGE and
> CONCATENATE_ALL|CATENATE_WORDS using the WordDelimiterFilter, we have
> duplicate tokens on strings contaning only case changes.
> When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc",
> "Def".
> When having PRESERVE_ORIGINAL, we keep "abcDef".
> However, when one uses CONCATENATE_ALL or CATENATE_WORDS, it also adds an
> other token built from the concatenation of the splited words, giving
> "abcDef" again.
> I'm not 100% certain that token filters should not produce duplicate tokens
> (same word, same start and end positions). Can someone confirm this is a bug ?
> I supply a patch that gives a test explosing the incorrect behavior.
> I'm willing to work on the following days to fix that.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]