[ 
https://issues.apache.org/jira/browse/LUCENE-8092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292560#comment-16292560
 ] 

Alan Woodward commented on LUCENE-8092:
---------------------------------------

{code}
SF:ShingleFilter@2d06f345 term=ᇻᆑᅌ,bytes=[e1 87 bb e1 86 91 e1 85 
8c],startOffset=0,endOffset=3,positionIncrement=1,positionLength=1,type=<HANGUL>,termFrequency=1
CJ:CJKBigramFilter@20d01b2b term=ᇻᆑ,bytes=[e1 87 bb e1 86 
91],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=<DOUBLE>,termFrequency=1
CJ:CJKBigramFilter@20d01b2b term=ᆑᅌ,bytes=[e1 86 91 e1 85 
8c],startOffset=1,endOffset=3,positionIncrement=1,positionLength=1,type=<DOUBLE>,termFrequency=1
SF:ShingleFilter@2d06f345 term=ᇻᆑᅌ IacUTe,bytes=[e1 87 bb e1 86 91 e1 85 8c 20 
49 61 63 55 54 
65],startOffset=0,endOffset=14,positionIncrement=0,positionLength=2,type=shingle,termFrequency=1
{code}

So this is because both ShingleFilter and CJKBigramFilter emit multiple tokens 
with adjusted offsets.  The CJKBigramFilter is splitting the first unigram 
emitted by ShingleFilter, and emitting two tokens, the second of which has an 
increased offset.  The ShingleFilter then emits a combination of the first two 
terms, with offset set back to 0.

I'm not sure of the best way to fix this.  Maybe both ShingleFilter and 
CJKBigramFilter need to cache their inputs and check that the underlying 
TokenStream has moved on before they emit bigrams?  [~thetaphi] what do you 
think?



> TestRandomChains failure
> ------------------------
>
>                 Key: LUCENE-8092
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8092
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Alan Woodward
>
> https://builds.apache.org/job/Lucene-Solr-NightlyTests-7.2/1/
> ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains 
> -Dtests.seed=C006DAD2E1FC77AF -Dtests.multiplier=2 -Dtests.nightly=true 
> -Dtests.slow=true 
> -Dtests.linedocsfile=/Users/romseygeek/projects/lucene-test-data/enwiki.random.lines.txt
>  -Dtests.locale=tr -Dtests.timezone=Europe/Simferopol -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> Reproduces locally on 7.2



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to