[
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883363#comment-15883363
]
Steve Rowe edited comment on LUCENE-7708 at 2/24/17 7:36 PM:
-------------------------------------------------------------
I'm beasting 1000 iterations of TestRandomChains with the patch, and run 110
found the following reproducing seed - maybe it's ShingleFilter's fault? (I
didn't investigate further):
*edit*: this seed fails on unpatched master, so the patch on this issue isn't
to blame. I'll create a different issue.
{noformat}
[junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
[junit4] 2> TEST FAIL: useCharFilter=false text='\ufac4\u0552H
\ua954\ua944 \ud0d2\uaddd\ub6cb\uc388\uc344\uca88\ud224\uc462\uaf42 g '
[junit4] 2> Exception from random analyzer:
[junit4] 2> charfilters=
[junit4] 2>
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@3fb9d00e,
[<HOST>, <HANGUL>, <IDEOGRAPHIC>, <SOUTHEAST_ASIAN>])
[junit4] 2> tokenizer=
[junit4] 2>
org.apache.lucene.analysis.standard.StandardTokenizer(org.apache.lucene.util.AttributeFactory$1@c893af9b)
[junit4] 2> filters=
[junit4] 2>
org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@7e1e9fe2
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
[junit4] 2>
org.apache.lucene.analysis.cjk.CJKBigramFilter(ValidatingTokenFilter@12c3fb1b
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
[junit4] 2>
org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@31c463b5
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false,
49)
[junit4] 2>
org.apache.lucene.analysis.in.IndicNormalizationFilter(ValidatingTokenFilter@3f72787
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
[junit4] 2> offsetsAreCorrect=false
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains
-Dtests.method=testRandomChains -Dtests.seed=E532502212098AC7 -Dtests.slow=true
-Dtests.locale=ko-KR -Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
[junit4] ERROR 0.76s | TestRandomChains.testRandomChains <<<
[junit4] > Throwable #1: java.lang.IllegalArgumentException: startOffset
must be non-negative, and endOffset must be >= startOffset; got
startOffset=10,endOffset=9
[junit4] > at
__randomizedtesting.SeedInfo.seed([E532502212098AC7:D8D37943551B9707]:0)
[junit4] > at
org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:110)
[junit4] > at
org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:345)
[junit4] > at
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
[junit4] > at
org.apache.lucene.analysis.in.IndicNormalizationFilter.incrementToken(IndicNormalizationFilter.java:40)
[junit4] > at
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:731)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540)
[junit4] > at
org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:853)
[junit4] > at java.lang.Thread.run(Thread.java:745)
[junit4] OK 1.64s | TestRandomChains.testRandomChainsWithLargeStrings
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene70):
{dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{},
maxPointsInLeafNode=542, maxMBSortInHeap=7.773738401752009,
sim=RandomSimilarity(queryNorm=false): {}, locale=ko-KR,
timezone=Atlantic/Jan_Mayen
[junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation
1.8.0_77 (64-bit)/cpus=16,threads=1,free=400845920,total=514850816
[junit4] 2> NOTE: All tests run in this JVM: [TestRandomChains]
[junit4] Completed [1/1 (1!)] in 6.03s, 2 tests, 1 error <<< FAILURES!
{noformat}
was (Author: steve_rowe):
I'm beasting 1000 iterations of TestRandomChains with the patch, and run 110
found the following reproducing seed - maybe it's ShingleFilter's fault? (I
didn't investigate further):
{noformat}
[junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
[junit4] 2> TEST FAIL: useCharFilter=false text='\ufac4\u0552H
\ua954\ua944 \ud0d2\uaddd\ub6cb\uc388\uc344\uca88\ud224\uc462\uaf42 g '
[junit4] 2> Exception from random analyzer:
[junit4] 2> charfilters=
[junit4] 2>
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@3fb9d00e,
[<HOST>, <HANGUL>, <IDEOGRAPHIC>, <SOUTHEAST_ASIAN>])
[junit4] 2> tokenizer=
[junit4] 2>
org.apache.lucene.analysis.standard.StandardTokenizer(org.apache.lucene.util.AttributeFactory$1@c893af9b)
[junit4] 2> filters=
[junit4] 2>
org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@7e1e9fe2
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
[junit4] 2>
org.apache.lucene.analysis.cjk.CJKBigramFilter(ValidatingTokenFilter@12c3fb1b
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
[junit4] 2>
org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@31c463b5
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false,
49)
[junit4] 2>
org.apache.lucene.analysis.in.IndicNormalizationFilter(ValidatingTokenFilter@3f72787
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
[junit4] 2> offsetsAreCorrect=false
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains
-Dtests.method=testRandomChains -Dtests.seed=E532502212098AC7 -Dtests.slow=true
-Dtests.locale=ko-KR -Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true
-Dtests.file.encoding=UTF-8
[junit4] ERROR 0.76s | TestRandomChains.testRandomChains <<<
[junit4] > Throwable #1: java.lang.IllegalArgumentException: startOffset
must be non-negative, and endOffset must be >= startOffset; got
startOffset=10,endOffset=9
[junit4] > at
__randomizedtesting.SeedInfo.seed([E532502212098AC7:D8D37943551B9707]:0)
[junit4] > at
org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:110)
[junit4] > at
org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:345)
[junit4] > at
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
[junit4] > at
org.apache.lucene.analysis.in.IndicNormalizationFilter.incrementToken(IndicNormalizationFilter.java:40)
[junit4] > at
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:731)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
[junit4] > at
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540)
[junit4] > at
org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:853)
[junit4] > at java.lang.Thread.run(Thread.java:745)
[junit4] OK 1.64s | TestRandomChains.testRandomChainsWithLargeStrings
[junit4] 2> NOTE: test params are: codec=Asserting(Lucene70):
{dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{},
maxPointsInLeafNode=542, maxMBSortInHeap=7.773738401752009,
sim=RandomSimilarity(queryNorm=false): {}, locale=ko-KR,
timezone=Atlantic/Jan_Mayen
[junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation
1.8.0_77 (64-bit)/cpus=16,threads=1,free=400845920,total=514850816
[junit4] 2> NOTE: All tests run in this JVM: [TestRandomChains]
[junit4] Completed [1/1 (1!)] in 6.03s, 2 tests, 1 error <<< FAILURES!
{noformat}
> Track PositionLengthAttribute abuse
> -----------------------------------
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/queryparser, modules/analysis
> Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to
> encode the number of terms they put in a single token.
> This breaks the query parsing because it creates disconnected graph.
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best
> would be to remove the attribute from these token filters but this could
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce
> invalid queries.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]