[
https://issues.apache.org/jira/browse/LUCENE-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292560#comment-16292560
]
Alan Woodward commented on LUCENE-8092:
---------------------------------------
{code}
SF:ShingleFilter@2d06f345 term=ᇻᆑᅌ,bytes=[e1 87 bb e1 86 91 e1 85
8c],startOffset=0,endOffset=3,positionIncrement=1,positionLength=1,type=<HANGUL>,termFrequency=1
CJ:CJKBigramFilter@20d01b2b term=ᇻᆑ,bytes=[e1 87 bb e1 86
91],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=<DOUBLE>,termFrequency=1
CJ:CJKBigramFilter@20d01b2b term=ᆑᅌ,bytes=[e1 86 91 e1 85
8c],startOffset=1,endOffset=3,positionIncrement=1,positionLength=1,type=<DOUBLE>,termFrequency=1
SF:ShingleFilter@2d06f345 term=ᇻᆑᅌ IacUTe,bytes=[e1 87 bb e1 86 91 e1 85 8c 20
49 61 63 55 54
65],startOffset=0,endOffset=14,positionIncrement=0,positionLength=2,type=shingle,termFrequency=1
{code}
So this is because both ShingleFilter and CJKBigramFilter emit multiple tokens
with adjusted offsets. The CJKBigramFilter is splitting the first unigram
emitted by ShingleFilter, and emitting two tokens, the second of which has an
increased offset. The ShingleFilter then emits a combination of the first two
terms, with offset set back to 0.
I'm not sure of the best way to fix this. Maybe both ShingleFilter and
CJKBigramFilter need to cache their inputs and check that the underlying
TokenStream has moved on before they emit bigrams? [~thetaphi] what do you
think?
> TestRandomChains failure
> ------------------------
>
> Key: LUCENE-8092
> URL: https://issues.apache.org/jira/browse/LUCENE-8092
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Alan Woodward
>
> https://builds.apache.org/job/Lucene-Solr-NightlyTests-7.2/1/
> ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains
> -Dtests.seed=C006DAD2E1FC77AF -Dtests.multiplier=2 -Dtests.nightly=true
> -Dtests.slow=true
> -Dtests.linedocsfile=/Users/romseygeek/projects/lucene-test-data/enwiki.random.lines.txt
> -Dtests.locale=tr -Dtests.timezone=Europe/Simferopol -Dtests.asserts=true
> -Dtests.file.encoding=UTF-8
> Reproduces locally on 7.2
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]