[
https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882616#comment-15882616
]
Jim Ferenczi commented on LUCENE-7708:
--------------------------------------
The CJKBigramFilter is working correctly because it sets the position length
attribute only if outputUnigrams is set.
So only the ShingleFilter is problematic since outputUnigrams is not check when
position length is set.
So for instance with shingles of size 2, the input "foo bar baz" would create
two tokens "foo bar" and "bar baz" with a pos len of 2 and an position
increment in between which forms a disconnected graph.
I'll work on a patch shortly.
> Track PositionLengthAttribute abuse
> -----------------------------------
>
> Key: LUCENE-7708
> URL: https://issues.apache.org/jira/browse/LUCENE-7708
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/queryparser, modules/analysis
> Reporter: Jim Ferenczi
>
> Some token filters uses the position length attribute of the token stream to
> encode the number of terms they put in a single token.
> This breaks the query parsing because it creates disconnected graph.
> I've tracked down the abusive case to 2 candidates:
> * ShingleFilter which sets the position length attribute to the length of the
> shingle.
> * CJKBigramFilter which always sets the position length attribute to 2.
> I don't think these filters should set the position length at all so the best
> would be to remove the attribute from these token filters but this could
> break BWC.
> Though this is a serious bug since shingles and cjk bigram now produce
> invalid queries.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]