[
https://issues.apache.org/jira/browse/LUCENE-5269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792429#comment-13792429
]
Robert Muir commented on LUCENE-5269:
-------------------------------------
{quote}
This is so crazy! Why did we never hit this combination before?
{quote}
This combination is especially good at finding the bug, here's why:
{code}
Tokenizer tokenizer = new EdgeNGramTokenizer(TEST_VERSION_CURRENT, reader, 2,
94);
TokenStream stream = new ShingleFilter(tokenizer, 5);
stream = new NGramTokenFilter(TEST_VERSION_CURRENT, stream, 55, 83);
{code}
The edge-ngram has min=2 max=94, its basically brute forcing every token size.
then the shingles makes tons of tokens with positionIncrement=0.
so it makes it easy for the (previously buggy ngramtokenfilter with wrong
length filter) to misclassify tokens with its logic expecting codepoints, emit
an initial token with posinc=0:
{code}
if ((curPos + curGramSize) <= curCodePointCount) {
...
posIncAtt.setPositionIncrement(curPosInc);
{code}
> TestRandomChains failure
> ------------------------
>
> Key: LUCENE-5269
> URL: https://issues.apache.org/jira/browse/LUCENE-5269
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Fix For: 4.5.1, 4.6, 5.0
>
> Attachments: LUCENE-5269.patch, LUCENE-5269.patch, LUCENE-5269.patch,
> LUCENE-5269_test.patch, LUCENE-5269_test.patch, LUCENE-5269_test.patch
>
>
> One of EdgeNGramTokenizer, ShingleFilter, NGramTokenFilter is buggy, or
> possibly only the combination of them conspiring together.
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]