[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883363#comment-15883363 ]
Steve Rowe edited comment on LUCENE-7708 at 2/24/17 7:17 PM: ------------------------------------------------------------- I'm beasting 1000 iterations of TestRandomChains with the patch, and run 110 found the following reproducing seed - maybe it's ShingleFilter's fault? (I didn't investigate further): {noformat} [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=false text='\ufac4\u0552H \ua954\ua944 \ud0d2\uaddd\ub6cb\uc388\uc344\uca88\ud224\uc462\uaf42 g ' [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@3fb9d00e, [<HOST>, <HANGUL>, <IDEOGRAPHIC>, <SOUTHEAST_ASIAN>]) [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.standard.StandardTokenizer(org.apache.lucene.util.AttributeFactory$1@c893af9b) [junit4] 2> filters= [junit4] 2> org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@7e1e9fe2 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> org.apache.lucene.analysis.cjk.CJKBigramFilter(ValidatingTokenFilter@12c3fb1b term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@31c463b5 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false, 49) [junit4] 2> org.apache.lucene.analysis.in.IndicNormalizationFilter(ValidatingTokenFilter@3f72787 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> offsetsAreCorrect=false [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=E532502212098AC7 -Dtests.slow=true -Dtests.locale=ko-KR -Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.76s | TestRandomChains.testRandomChains <<< [junit4] > Throwable #1: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset; got startOffset=10,endOffset=9 [junit4] > at __randomizedtesting.SeedInfo.seed([E532502212098AC7:D8D37943551B9707]:0) [junit4] > at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:110) [junit4] > at org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:345) [junit4] > at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4] > at org.apache.lucene.analysis.in.IndicNormalizationFilter.incrementToken(IndicNormalizationFilter.java:40) [junit4] > at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:731) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:853) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] OK 1.64s | TestRandomChains.testRandomChainsWithLargeStrings [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, maxPointsInLeafNode=542, maxMBSortInHeap=7.773738401752009, sim=RandomSimilarity(queryNorm=false): {}, locale=ko-KR, timezone=Atlantic/Jan_Mayen [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=400845920,total=514850816 [junit4] 2> NOTE: All tests run in this JVM: [TestRandomChains] [junit4] Completed [1/1 (1!)] in 6.03s, 2 tests, 1 error <<< FAILURES! {noformat} was (Author: steve_rowe): I'm beasting 1000 iterations of TestRandomChains with the patch, and run 110 found the following reproducing seed - maybe it's SingleFilter's fault? (I didn't investigate further): {noformat} [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=false text='\ufac4\u0552H \ua954\ua944 \ud0d2\uaddd\ub6cb\uc388\uc344\uca88\ud224\uc462\uaf42 g ' [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@3fb9d00e, [<HOST>, <HANGUL>, <IDEOGRAPHIC>, <SOUTHEAST_ASIAN>]) [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.standard.StandardTokenizer(org.apache.lucene.util.AttributeFactory$1@c893af9b) [junit4] 2> filters= [junit4] 2> org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@7e1e9fe2 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> org.apache.lucene.analysis.cjk.CJKBigramFilter(ValidatingTokenFilter@12c3fb1b term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@31c463b5 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false, 49) [junit4] 2> org.apache.lucene.analysis.in.IndicNormalizationFilter(ValidatingTokenFilter@3f72787 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> offsetsAreCorrect=false [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=E532502212098AC7 -Dtests.slow=true -Dtests.locale=ko-KR -Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.76s | TestRandomChains.testRandomChains <<< [junit4] > Throwable #1: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset; got startOffset=10,endOffset=9 [junit4] > at __randomizedtesting.SeedInfo.seed([E532502212098AC7:D8D37943551B9707]:0) [junit4] > at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:110) [junit4] > at org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:345) [junit4] > at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4] > at org.apache.lucene.analysis.in.IndicNormalizationFilter.incrementToken(IndicNormalizationFilter.java:40) [junit4] > at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:731) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4] > at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540) [junit4] > at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:853) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] OK 1.64s | TestRandomChains.testRandomChainsWithLargeStrings [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, maxPointsInLeafNode=542, maxMBSortInHeap=7.773738401752009, sim=RandomSimilarity(queryNorm=false): {}, locale=ko-KR, timezone=Atlantic/Jan_Mayen [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=400845920,total=514850816 [junit4] 2> NOTE: All tests run in this JVM: [TestRandomChains] [junit4] Completed [1/1 (1!)] in 6.03s, 2 tests, 1 error <<< FAILURES! {noformat} > Track PositionLengthAttribute abuse > ----------------------------------- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis > Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org