[jira] [Comment Edited] (LUCENE-7708) Track PositionLengthAttribute abuse

Steve Rowe (JIRA) Fri, 24 Feb 2017 11:37:25 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883363#comment-15883363
 ]


Steve Rowe edited comment on LUCENE-7708 at 2/24/17 7:36 PM:
-------------------------------------------------------------

I'm beasting 1000 iterations of TestRandomChains with the patch, and run 110 
found the following reproducing seed - maybe it's ShingleFilter's fault?  (I 
didn't investigate further):

*edit*: this seed fails on unpatched master, so the patch on this issue isn't 
to blame.  I'll create a different issue.

{noformat}
  [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
   [junit4]   2> TEST FAIL: useCharFilter=false text='\ufac4\u0552H 
\ua954\ua944 \ud0d2\uaddd\ub6cb\uc388\uc344\uca88\ud224\uc462\uaf42 g '
   [junit4]   2> Exception from random analyzer: 
   [junit4]   2> charfilters=
   [junit4]   2>   
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@3fb9d00e,
 [<HOST>, <HANGUL>, <IDEOGRAPHIC>, <SOUTHEAST_ASIAN>])
   [junit4]   2> tokenizer=
   [junit4]   2>   
org.apache.lucene.analysis.standard.StandardTokenizer(org.apache.lucene.util.AttributeFactory$1@c893af9b)
   [junit4]   2> filters=
   [junit4]   2>   
org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@7e1e9fe2
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
   [junit4]   2>   
org.apache.lucene.analysis.cjk.CJKBigramFilter(ValidatingTokenFilter@12c3fb1b 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
   [junit4]   2>   
org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@31c463b5 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false,
 49)
   [junit4]   2>   
org.apache.lucene.analysis.in.IndicNormalizationFilter(ValidatingTokenFilter@3f72787
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
   [junit4]   2> offsetsAreCorrect=false
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains 
-Dtests.method=testRandomChains -Dtests.seed=E532502212098AC7 -Dtests.slow=true 
-Dtests.locale=ko-KR -Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.76s | TestRandomChains.testRandomChains <<<
   [junit4]    > Throwable #1: java.lang.IllegalArgumentException: startOffset 
must be non-negative, and endOffset must be >= startOffset; got 
startOffset=10,endOffset=9
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([E532502212098AC7:D8D37943551B9707]:0)
   [junit4]    >        at 
org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:110)
   [junit4]    >        at 
org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:345)
   [junit4]    >        at 
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
   [junit4]    >        at 
org.apache.lucene.analysis.in.IndicNormalizationFilter.incrementToken(IndicNormalizationFilter.java:40)
   [junit4]    >        at 
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:731)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540)
   [junit4]    >        at 
org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:853)
   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
   [junit4] OK      1.64s | TestRandomChains.testRandomChainsWithLargeStrings
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
{dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, 
maxPointsInLeafNode=542, maxMBSortInHeap=7.773738401752009, 
sim=RandomSimilarity(queryNorm=false): {}, locale=ko-KR, 
timezone=Atlantic/Jan_Mayen
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 
1.8.0_77 (64-bit)/cpus=16,threads=1,free=400845920,total=514850816
   [junit4]   2> NOTE: All tests run in this JVM: [TestRandomChains]
   [junit4] Completed [1/1 (1!)] in 6.03s, 2 tests, 1 error <<< FAILURES!
{noformat}


was (Author: steve_rowe):
I'm beasting 1000 iterations of TestRandomChains with the patch, and run 110 
found the following reproducing seed - maybe it's ShingleFilter's fault?  (I 
didn't investigate further):

{noformat}
  [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
   [junit4]   2> TEST FAIL: useCharFilter=false text='\ufac4\u0552H 
\ua954\ua944 \ud0d2\uaddd\ub6cb\uc388\uc344\uca88\ud224\uc462\uaf42 g '
   [junit4]   2> Exception from random analyzer: 
   [junit4]   2> charfilters=
   [junit4]   2>   
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@3fb9d00e,
 [<HOST>, <HANGUL>, <IDEOGRAPHIC>, <SOUTHEAST_ASIAN>])
   [junit4]   2> tokenizer=
   [junit4]   2>   
org.apache.lucene.analysis.standard.StandardTokenizer(org.apache.lucene.util.AttributeFactory$1@c893af9b)
   [junit4]   2> filters=
   [junit4]   2>   
org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@7e1e9fe2
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
   [junit4]   2>   
org.apache.lucene.analysis.cjk.CJKBigramFilter(ValidatingTokenFilter@12c3fb1b 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
   [junit4]   2>   
org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@31c463b5 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false,
 49)
   [junit4]   2>   
org.apache.lucene.analysis.in.IndicNormalizationFilter(ValidatingTokenFilter@3f72787
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
   [junit4]   2> offsetsAreCorrect=false
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains 
-Dtests.method=testRandomChains -Dtests.seed=E532502212098AC7 -Dtests.slow=true 
-Dtests.locale=ko-KR -Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.76s | TestRandomChains.testRandomChains <<<
   [junit4]    > Throwable #1: java.lang.IllegalArgumentException: startOffset 
must be non-negative, and endOffset must be >= startOffset; got 
startOffset=10,endOffset=9
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([E532502212098AC7:D8D37943551B9707]:0)
   [junit4]    >        at 
org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:110)
   [junit4]    >        at 
org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:345)
   [junit4]    >        at 
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
   [junit4]    >        at 
org.apache.lucene.analysis.in.IndicNormalizationFilter.incrementToken(IndicNormalizationFilter.java:40)
   [junit4]    >        at 
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:731)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]    >        at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540)
   [junit4]    >        at 
org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:853)
   [junit4]    >        at java.lang.Thread.run(Thread.java:745)
   [junit4] OK      1.64s | TestRandomChains.testRandomChainsWithLargeStrings
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
{dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, 
maxPointsInLeafNode=542, maxMBSortInHeap=7.773738401752009, 
sim=RandomSimilarity(queryNorm=false): {}, locale=ko-KR, 
timezone=Atlantic/Jan_Mayen
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 
1.8.0_77 (64-bit)/cpus=16,threads=1,free=400845920,total=514850816
   [junit4]   2> NOTE: All tests run in this JVM: [TestRandomChains]
   [junit4] Completed [1/1 (1!)] in 6.03s, 2 tests, 1 error <<< FAILURES!
{noformat}

> Track PositionLengthAttribute abuse
> -----------------------------------
>
>                 Key: LUCENE-7708
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7708
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/queryparser, modules/analysis
>            Reporter: Jim Ferenczi
>         Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-7708) Track PositionLengthAttribute abuse

Reply via email to