[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930152#comment-15930152 ] Steve Rowe commented on LUCENE-7708: bq. Looks like 6.5.0 isn't a valid version yet. Easy enough to add, but if I do so, would I be doing the right thing? I see Jim already set the version to 6.5, but FYI [~elyograg], historically people have excluded the trailing ".0" in minor release labels here on JIRA. > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928885#comment-15928885 ] Shawn Heisey commented on LUCENE-7708: -- Looks like 6.5.0 isn't a valid version yet. Easy enough to add, but if I do so, would I be doing the right thing? > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928511#comment-15928511 ] Jim Ferenczi commented on LUCENE-7708: -- Thanks [~dsmiley]. I updated the status. > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928311#comment-15928311 ] David Smiley commented on LUCENE-7708: -- [~jim.ferenczi] what we do after committing/all-done is "Resolve" the issue (not "Close"). That dialog box will give you the option to set the fix-version. Later on during the release process, there should be a JIRA step that involves bulk-closing all issues resolved for this version. > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928198#comment-15928198 ] Jim Ferenczi commented on LUCENE-7708: -- [~elyograg] one shingle filter problem is fixed in LUCENE-7708 and appears in 6.3 when the support for graph analysis has been added to the QueryBuilder. The other shingle filter problem I can think of is when the number of paths is gigantic and produces an OOM. I opened LUCENE-7747 to fix this. Although I think that the workaround for now is to be disable graph query analysis when the analyzer contains a shingle filter that produces shingles of different size. The graph analysis in this case builds all possible path since each position has different side paths. > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928145#comment-15928145 ] Shawn Heisey commented on LUCENE-7708: -- There's no fix version here. CHANGES.txt says it's in 6.5.0. (looking for possible causes of a shingle filter problem confirmed in Solr 6.3 and 6.4, this couldn't be the cause) > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883672#comment-15883672 ] Jim Ferenczi commented on LUCENE-7708: -- Thanks [~sar...@syr.edu] and [~mikemccand] ! > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883671#comment-15883671 ] ASF subversion and git services commented on LUCENE-7708: - Commit 6c63df0b15f735907438514f3b4b702680d74588 in lucene-solr's branch refs/heads/branch_6x from [~jim.ferenczi] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c63df0 ] LUCENE-7708: Fix position length attribute set by the ShingleFilter when outputUnigrams=false > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883655#comment-15883655 ] ASF subversion and git services commented on LUCENE-7708: - Commit 57a42e4ec54aebac40c1ef7dc93d933cd00dbe1e in lucene-solr's branch refs/heads/master from [~jim.ferenczi] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=57a42e4 ] LUCENE-7708: Fix position length attribute set by the ShingleFilter when outputUnigrams=false > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883528#comment-15883528 ] Steve Rowe commented on LUCENE-7708: +1, LGTM, all {{lucene/analysis/common/}} tests pass for me with the latest patch. Also, 1000 beasting iterations of TestRandomChains didn't trigger any failures with this patch (other than the unrelated one at LUCENE-7711). > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883526#comment-15883526 ] Michael McCandless commented on LUCENE-7708: +1, thanks [~jim.ferenczi]! > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883363#comment-15883363 ] Steve Rowe commented on LUCENE-7708: I'm beasting 1000 iterations of TestRandomChains with the patch, and run 110 found the following reproducing seed - maybe it's SingleFilter's fault? (I didn't investigate further): {noformat} [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains [junit4] 2> TEST FAIL: useCharFilter=false text='\ufac4\u0552H \ua954\ua944 \ud0d2\uaddd\ub6cb\uc388\uc344\uca88\ud224\uc462\uaf42 g ' [junit4] 2> Exception from random analyzer: [junit4] 2> charfilters= [junit4] 2> org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@3fb9d00e, [, , , ]) [junit4] 2> tokenizer= [junit4] 2> org.apache.lucene.analysis.standard.StandardTokenizer(org.apache.lucene.util.AttributeFactory$1@c893af9b) [junit4] 2> filters= [junit4] 2> org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@7e1e9fe2 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> org.apache.lucene.analysis.cjk.CJKBigramFilter(ValidatingTokenFilter@12c3fb1b term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@31c463b5 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false, 49) [junit4] 2> org.apache.lucene.analysis.in.IndicNormalizationFilter(ValidatingTokenFilter@3f72787 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false) [junit4] 2> offsetsAreCorrect=false [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestRandomChains -Dtests.method=testRandomChains -Dtests.seed=E532502212098AC7 -Dtests.slow=true -Dtests.locale=ko-KR -Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] ERROR 0.76s | TestRandomChains.testRandomChains <<< [junit4]> Throwable #1: java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset; got startOffset=10,endOffset=9 [junit4]>at __randomizedtesting.SeedInfo.seed([E532502212098AC7:D8D37943551B9707]:0) [junit4]>at org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:110) [junit4]>at org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:345) [junit4]>at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4]>at org.apache.lucene.analysis.in.IndicNormalizationFilter.incrementToken(IndicNormalizationFilter.java:40) [junit4]>at org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:731) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642) [junit4]>at org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540) [junit4]>at org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:853) [junit4]>at java.lang.Thread.run(Thread.java:745) [junit4] OK 1.64s | TestRandomChains.testRandomChainsWithLargeStrings [junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): {dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, maxPointsInLeafNode=542, maxMBSortInHeap=7.773738401752009, sim=RandomSimilarity(queryNorm=false): {}, locale=ko-KR, timezone=Atlantic/Jan_Mayen [junit4] 2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 1.8.0_77 (64-bit)/cpus=16,threads=1,free=400845920,total=514850816 [junit4] 2> NOTE: All tests run in this JVM: [TestRandomChains] [junit4] Completed [1/1 (1!)] in 6.03s, 2 tests, 1 error <<< FAILURES! {noformat} > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883169#comment-15883169 ] Steve Rowe commented on LUCENE-7708: +1 to the idea, but some tests are failing with the patch: {noformat} [junit4] Tests with failures [seed: 4D8AED66905F8617]: [junit4] - org.apache.lucene.analysis.shingle.ShingleFilterTest.testOutputUnigramsIfNoShinglesSingleTokenCase [junit4] - org.apache.lucene.analysis.shingle.ShingleFilterTest.testOutputUnigramsIfNoShinglesWithMultipleInputTokens [junit4] - org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapperTest.testOutputUnigramsIfNoShinglesSingleToken [junit4] - org.apache.lucene.analysis.shingle.TestShingleFilterFactory.testOutputUnigramsIfNoShingles {noformat} > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882616#comment-15882616 ] Jim Ferenczi commented on LUCENE-7708: -- The CJKBigramFilter is working correctly because it sets the position length attribute only if outputUnigrams is set. So only the ShingleFilter is problematic since outputUnigrams is not check when position length is set. So for instance with shingles of size 2, the input "foo bar baz" would create two tokens "foo bar" and "bar baz" with a pos len of 2 and an position increment in between which forms a disconnected graph. I'll work on a patch shortly. > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org