[jira] [Updated] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Ferenczi updated LUCENE-7708: - Fix Version/s: 6.5 master (7.0) > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Fix For: master (7.0), 6.5 > > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Ferenczi updated LUCENE-7708: - Attachment: LUCENE-7708.patch Thanks Steve ! I pushed a new patch that solves the tests failures. > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch, LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7708) Track PositionLengthAttribute abuse
[ https://issues.apache.org/jira/browse/LUCENE-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Ferenczi updated LUCENE-7708: - Attachment: LUCENE-7708.patch Here is one patch for the ShingleFilter. When outputUnigrams is set to false, position length for a shingle of size N is the number of position created by shingles of smaller size: (N - minShingleSize) + 1. [~mikemccand] can you take a look ? > Track PositionLengthAttribute abuse > --- > > Key: LUCENE-7708 > URL: https://issues.apache.org/jira/browse/LUCENE-7708 > Project: Lucene - Core > Issue Type: Bug > Components: core/queryparser, modules/analysis >Reporter: Jim Ferenczi > Attachments: LUCENE-7708.patch > > > Some token filters uses the position length attribute of the token stream to > encode the number of terms they put in a single token. > This breaks the query parsing because it creates disconnected graph. > I've tracked down the abusive case to 2 candidates: > * ShingleFilter which sets the position length attribute to the length of the > shingle. > * CJKBigramFilter which always sets the position length attribute to 2. > I don't think these filters should set the position length at all so the best > would be to remove the attribute from these token filters but this could > break BWC. > Though this is a serious bug since shingles and cjk bigram now produce > invalid queries. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org