[ 
https://issues.apache.org/jira/browse/LUCENE-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16690125#comment-16690125
 ] 

Mike Sokolov commented on LUCENE-8517:
--------------------------------------

Well I can at least add a bit more color. I added some debugging support to 
ValidatingTokenFilter to capture the tokens seen so they can be dumped out 
after. This is the complete history at the moment the exception was thrown, 
showing for each stage, which tokens it emitted so far. The numbers in square 
brackets are offsets, and the number following "+" is position increment. The 
issue here is in stage 3 (the conditional ShingleFilter), the overlapping 
tokens have different offsets. The ShingleFilter has shingle size=4 in this 
case.

stage 0: best<[0-4] +1> effort<[5-11] +1>
 stage 1: best<[0-4] +1> effort<[5-11] +1> ef<[5-11] +0> effort<[5-11] +0>
 stage 2: best<[0-4] +1> effort<[5-11] +1> ef<[5-11] +0> effort<[5-11] +0>
 stage 3: best<[0-4] +1> effort<[5-11] +0>

Anyway, the problem seems to arise in a conditional FixedShingleFilter that 
follows a conditional MockGraphTokenFilter. I see a comment in TestRandomChains 
to the effect that we don't test conditionals with ShingleFilter due to it not 
handling input graphs properly (LUCENE-4170). I wonder if the same logic ought 
to be applied to FIxedShingleFilter.

Also, I could post a patch with the debug output if it seems useful. It was at 
least somewhat helpful in getting a picture of this for me.

> TestRandomChains.testRandomChainsWithLargeStrings failure
> ---------------------------------------------------------
>
>                 Key: LUCENE-8517
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8517
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Steve Rowe
>            Priority: Major
>
> From 
> [https://jenkins.thetaphi.de/job/Lucene-Solr-7.x-Linux/2828/consoleText], 
> reproduces for me on Java8:
> {noformat}
> Checking out Revision 216f10026b86627750e133fe24ce6a750c470695 
> (refs/remotes/origin/branch_7x)
> [...]
> [java-info] java version "10.0.1"
> [java-info] OpenJDK Runtime Environment (10.0.1+10, Oracle Corporation)
> [java-info] OpenJDK 64-Bit Server VM (10.0.1+10, Oracle Corporation)
> [java-info] Test args: [-XX:-UseCompressedOops -XX:+UseConcMarkSweepGC]
> [...]
>    [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
>    [junit4]   2> Exception from random analyzer: 
>    [junit4]   2> charfilters=
>    [junit4]   2>   
> org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@3ef95503,
>  java.io.StringReader@70dde633)
>    [junit4]   2>   
> org.apache.lucene.analysis.fa.PersianCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter@12423b20)
>    [junit4]   2> tokenizer=
>    [junit4]   2>   org.apache.lucene.analysis.th.ThaiTokenizer()
>    [junit4]   2> filters=
>    [junit4]   2>   
> org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter(ValidatingTokenFilter@7914bba7
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
>  org.apache.lucene.analysis.compound.hyphenation.HyphenationTree@abd7bca)
>    [junit4]   2>   
> Conditional:org.apache.lucene.analysis.MockGraphTokenFilter(java.util.Random@56348091,
>  OneTimeWrapper@aa1c073 
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)
>    [junit4]   2>   
> Conditional:org.apache.lucene.analysis.shingle.FixedShingleFilter(OneTimeWrapper@4cf58fce
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
>  4, <NUM>, <SOUTHEAST_ASIAN>)
>    [junit4]   2>   
> org.apache.lucene.analysis.pt.PortugueseLightStemFilter(ValidatingTokenFilter@3a915324
>  
> term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,keyword=false)
>    [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains 
> -Dtests.method=testRandomChainsWithLargeStrings -Dtests.seed=92344C536D4E00F4 
> -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=en-ZW 
> -Dtests.timezone=Atlantic/Faroe -Dtests.asserts=true 
> -Dtests.file.encoding=US-ASCII
>    [junit4] ERROR   0.46s J2 | 
> TestRandomChains.testRandomChainsWithLargeStrings <<<
>    [junit4]    > Throwable #1: java.lang.IllegalStateException: stage 3: 
> inconsistent startOffset at pos=0: 0 vs 5; token=effort
>    [junit4]    >      at 
> __randomizedtesting.SeedInfo.seed([92344C536D4E00F4:F86FF34234002007]:0)
>    [junit4]    >      at 
> org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:109)
>    [junit4]    >      at 
> org.apache.lucene.analysis.pt.PortugueseLightStemFilter.incrementToken(PortugueseLightStemFilter.java:48)
>    [junit4]    >      at 
> org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:68)
>    [junit4]    >      at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkResetException(BaseTokenStreamTestCase.java:441)
>    [junit4]    >      at 
> org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:546)
>    [junit4]    >      at 
> org.apache.lucene.analysis.core.TestRandomChains.testRandomChainsWithLargeStrings(TestRandomChains.java:897)
>    [junit4]    >      at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>    [junit4]    >      at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>    [junit4]    >      at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>    [junit4]    >      at 
> java.base/java.lang.reflect.Method.invoke(Method.java:564)
>    [junit4]    >      at java.base/java.lang.Thread.run(Thread.java:844)
>    [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
> {dummy=TestBloomFilteredLucenePostings(BloomFilteringPostingsFormat(Lucene50(blocksize=128)))},
>  docValues:{}, maxPointsInLeafNode=214, maxMBSortInHeap=5.729405811878087, 
> sim=RandomSimilarity(queryNorm=true): {}, locale=en-ZW, 
> timezone=Atlantic/Faroe
>    [junit4]   2> NOTE: Linux 4.15.0-32-generic amd64/Oracle Corporation 
> 10.0.1 (64-bit)/cpus=8,threads=1,free=266844648,total=518979584
>    [junit4]   2> NOTE: All tests run in this JVM: [TestOptionalCondition, 
> TestSerbianNormalizationRegularFilter, TestCommonGramsFilterFactory, 
> TestDoubleEscape, TestDictionaryCompoundWordTokenFilterFactory, 
> TestNorwegianMinimalStemFilter, TestCzechStemmer, 
> TestTurkishLowerCaseFilterFactory, TestAnalyzers, 
> TestScandinavianFoldingFilterFactory, TestReversePathHierarchyTokenizer, 
> TestSimplePatternTokenizer, TestGalicianMinimalStemFilter, MinHashFilterTest, 
> TestPortugueseStemFilterFactory, TestPersianCharFilter, 
> TestPerFieldAnalyzerWrapper, TestRandomChains]
>    [junit4] Completed [73/291 (1!)] on J2 in 3.12s, 2 tests, 1 error <<< 
> FAILURES!
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to