[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-03-17 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15930152#comment-15930152
 ] 

Steve Rowe commented on LUCENE-7708:


bq. Looks like 6.5.0 isn't a valid version yet. Easy enough to add, but if I do 
so, would I be doing the right thing?

I see Jim already set the version to 6.5, but FYI [~elyograg], historically 
people have excluded the trailing ".0" in minor release labels here on JIRA.

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-03-16 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928885#comment-15928885
 ] 

Shawn Heisey commented on LUCENE-7708:
--

Looks like 6.5.0 isn't a valid version yet.  Easy enough to add, but if I do 
so, would I be doing the right thing?

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-03-16 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928511#comment-15928511
 ] 

Jim Ferenczi commented on LUCENE-7708:
--

Thanks [~dsmiley]. I updated the status.

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-03-16 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928311#comment-15928311
 ] 

David Smiley commented on LUCENE-7708:
--

[~jim.ferenczi] what we do after committing/all-done is "Resolve" the issue 
(not "Close").  That dialog box will give you the option to set the 
fix-version.  Later on during the release process, there should be a JIRA step 
that involves bulk-closing all issues resolved for this version.

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-03-16 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928198#comment-15928198
 ] 

Jim Ferenczi commented on LUCENE-7708:
--

[~elyograg] one shingle filter problem is fixed in LUCENE-7708 and appears in 
6.3 when the support for graph analysis has been added to the QueryBuilder. 
The other shingle filter problem I can think of is when the number of paths is 
gigantic and produces an OOM. I opened LUCENE-7747 to fix this.
Although I think that the workaround for now is to be disable graph query 
analysis when the analyzer contains a shingle filter that produces shingles of 
different size. The graph analysis in this case builds all possible path since 
each position has different side paths.

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-03-16 Thread Shawn Heisey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15928145#comment-15928145
 ] 

Shawn Heisey commented on LUCENE-7708:
--

There's no fix version here.  CHANGES.txt says it's in 6.5.0.

(looking for possible causes of a shingle filter problem confirmed in Solr 6.3 
and 6.4, this couldn't be the cause)


> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-02-24 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883672#comment-15883672
 ] 

Jim Ferenczi commented on LUCENE-7708:
--

Thanks [~sar...@syr.edu] and [~mikemccand] !

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-02-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883671#comment-15883671
 ] 

ASF subversion and git services commented on LUCENE-7708:
-

Commit 6c63df0b15f735907438514f3b4b702680d74588 in lucene-solr's branch 
refs/heads/branch_6x from [~jim.ferenczi]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6c63df0 ]

LUCENE-7708: Fix position length attribute set by the ShingleFilter when 
outputUnigrams=false


> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-02-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883655#comment-15883655
 ] 

ASF subversion and git services commented on LUCENE-7708:
-

Commit 57a42e4ec54aebac40c1ef7dc93d933cd00dbe1e in lucene-solr's branch 
refs/heads/master from [~jim.ferenczi]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=57a42e4 ]

LUCENE-7708: Fix position length attribute set by the ShingleFilter when 
outputUnigrams=false


> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-02-24 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883528#comment-15883528
 ] 

Steve Rowe commented on LUCENE-7708:


+1, LGTM, all {{lucene/analysis/common/}} tests pass for me with the latest 
patch.

Also, 1000 beasting iterations of TestRandomChains didn't trigger any failures 
with this patch (other than the unrelated one at LUCENE-7711).

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-02-24 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883526#comment-15883526
 ] 

Michael McCandless commented on LUCENE-7708:


+1, thanks [~jim.ferenczi]!

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-02-24 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883363#comment-15883363
 ] 

Steve Rowe commented on LUCENE-7708:


I'm beasting 1000 iterations of TestRandomChains with the patch, and run 110 
found the following reproducing seed - maybe it's SingleFilter's fault?  (I 
didn't investigate further):

{noformat}
  [junit4] Suite: org.apache.lucene.analysis.core.TestRandomChains
   [junit4]   2> TEST FAIL: useCharFilter=false text='\ufac4\u0552H 
\ua954\ua944 \ud0d2\uaddd\ub6cb\uc388\uc344\uca88\ud224\uc462\uaf42 g '
   [junit4]   2> Exception from random analyzer: 
   [junit4]   2> charfilters=
   [junit4]   2>   
org.apache.lucene.analysis.charfilter.HTMLStripCharFilter(java.io.StringReader@3fb9d00e,
 [, , , ])
   [junit4]   2> tokenizer=
   [junit4]   2>   
org.apache.lucene.analysis.standard.StandardTokenizer(org.apache.lucene.util.AttributeFactory$1@c893af9b)
   [junit4]   2> filters=
   [junit4]   2>   
org.apache.lucene.analysis.miscellaneous.KeywordRepeatFilter(ValidatingTokenFilter@7e1e9fe2
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
   [junit4]   2>   
org.apache.lucene.analysis.cjk.CJKBigramFilter(ValidatingTokenFilter@12c3fb1b 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
   [junit4]   2>   
org.apache.lucene.analysis.shingle.ShingleFilter(ValidatingTokenFilter@31c463b5 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false,
 49)
   [junit4]   2>   
org.apache.lucene.analysis.in.IndicNormalizationFilter(ValidatingTokenFilter@3f72787
 
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,flags=0,payload=null,keyword=false)
   [junit4]   2> offsetsAreCorrect=false
   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains 
-Dtests.method=testRandomChains -Dtests.seed=E532502212098AC7 -Dtests.slow=true 
-Dtests.locale=ko-KR -Dtests.timezone=Atlantic/Jan_Mayen -Dtests.asserts=true 
-Dtests.file.encoding=UTF-8
   [junit4] ERROR   0.76s | TestRandomChains.testRandomChains <<<
   [junit4]> Throwable #1: java.lang.IllegalArgumentException: startOffset 
must be non-negative, and endOffset must be >= startOffset; got 
startOffset=10,endOffset=9
   [junit4]>at 
__randomizedtesting.SeedInfo.seed([E532502212098AC7:D8D37943551B9707]:0)
   [junit4]>at 
org.apache.lucene.analysis.tokenattributes.PackedTokenAttributeImpl.setOffset(PackedTokenAttributeImpl.java:110)
   [junit4]>at 
org.apache.lucene.analysis.shingle.ShingleFilter.incrementToken(ShingleFilter.java:345)
   [junit4]>at 
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
   [junit4]>at 
org.apache.lucene.analysis.in.IndicNormalizationFilter.incrementToken(IndicNormalizationFilter.java:40)
   [junit4]>at 
org.apache.lucene.analysis.ValidatingTokenFilter.incrementToken(ValidatingTokenFilter.java:67)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkAnalysisConsistency(BaseTokenStreamTestCase.java:731)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:642)
   [junit4]>at 
org.apache.lucene.analysis.BaseTokenStreamTestCase.checkRandomData(BaseTokenStreamTestCase.java:540)
   [junit4]>at 
org.apache.lucene.analysis.core.TestRandomChains.testRandomChains(TestRandomChains.java:853)
   [junit4]>at java.lang.Thread.run(Thread.java:745)
   [junit4] OK  1.64s | TestRandomChains.testRandomChainsWithLargeStrings
   [junit4]   2> NOTE: test params are: codec=Asserting(Lucene70): 
{dummy=PostingsFormat(name=LuceneVarGapFixedInterval)}, docValues:{}, 
maxPointsInLeafNode=542, maxMBSortInHeap=7.773738401752009, 
sim=RandomSimilarity(queryNorm=false): {}, locale=ko-KR, 
timezone=Atlantic/Jan_Mayen
   [junit4]   2> NOTE: Linux 4.1.0-custom2-amd64 amd64/Oracle Corporation 
1.8.0_77 (64-bit)/cpus=16,threads=1,free=400845920,total=514850816
   [junit4]   2> NOTE: All tests run in this JVM: [TestRandomChains]
   [junit4] Completed [1/1 (1!)] in 6.03s, 2 tests, 1 error <<< FAILURES!
{noformat}

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch, LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a 

[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-02-24 Thread Steve Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883169#comment-15883169
 ] 

Steve Rowe commented on LUCENE-7708:


+1 to the idea, but some tests are failing with the patch:

{noformat}
   [junit4] Tests with failures [seed: 4D8AED66905F8617]:
   [junit4]   - 
org.apache.lucene.analysis.shingle.ShingleFilterTest.testOutputUnigramsIfNoShinglesSingleTokenCase
   [junit4]   - 
org.apache.lucene.analysis.shingle.ShingleFilterTest.testOutputUnigramsIfNoShinglesWithMultipleInputTokens
   [junit4]   - 
org.apache.lucene.analysis.shingle.ShingleAnalyzerWrapperTest.testOutputUnigramsIfNoShinglesSingleToken
   [junit4]   - 
org.apache.lucene.analysis.shingle.TestShingleFilterFactory.testOutputUnigramsIfNoShingles
{noformat}

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7708.patch
>
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7708) Track PositionLengthAttribute abuse

2017-02-24 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15882616#comment-15882616
 ] 

Jim Ferenczi commented on LUCENE-7708:
--

The CJKBigramFilter is working correctly because it sets the position length 
attribute only if outputUnigrams is set.
So only the ShingleFilter is problematic since outputUnigrams is not check when 
position length is set. 
So for instance with shingles of size 2, the input "foo bar baz" would create 
two tokens  "foo bar" and "bar baz" with a pos len of 2 and an position 
increment in between which forms a disconnected graph.
I'll work on a patch shortly.

> Track PositionLengthAttribute abuse
> ---
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/queryparser, modules/analysis
>Reporter: Jim Ferenczi
>
> Some token filters uses the position length attribute of the token stream to 
> encode the number of terms they put in a single token. 
> This breaks the query parsing because it creates disconnected graph. 
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the 
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best 
> would be to remove the attribute from these token filters but this could 
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce 
> invalid queries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org