[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly
[ https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335455#comment-16335455 ] Adrien Grand commented on LUCENE-8132: -- No, the hyphenation decompounder would have to be the first token filter in the analysis chain. > HyphenationDecompoundTokenFilter does not set position/offset attributes > correctly > -- > > Key: LUCENE-8132 > URL: https://issues.apache.org/jira/browse/LUCENE-8132 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6.1, 7.2.1 >Reporter: Holger Bruch >Priority: Major > > HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set > positionIncrement to 0 for all subwords, reuse start/endoffset of the > original token and ignore positionLength completly. > In consequence, the QueryBuilder generates a SynonymQuery comprising all > subwords, which should rather treated as individual terms. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly
[ https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335427#comment-16335427 ] Holger Bruch commented on LUCENE-8132: -- I’m not as deeply in Lucene as you are. What would be the pros and cons of ensuring the input is an instance of tokenizer? Would it still be possible to apply a token filters like WDF or lowercase filter before the HyphenationDecompunder? > HyphenationDecompoundTokenFilter does not set position/offset attributes > correctly > -- > > Key: LUCENE-8132 > URL: https://issues.apache.org/jira/browse/LUCENE-8132 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6.1, 7.2.1 >Reporter: Holger Bruch >Priority: Major > > HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set > positionIncrement to 0 for all subwords, reuse start/endoffset of the > original token and ignore positionLength completly. > In consequence, the QueryBuilder generates a SynonymQuery comprising all > subwords, which should rather treated as individual terms. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly
[ https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334273#comment-16334273 ] Robert Muir commented on LUCENE-8132: - Thats what HyphenationDecompoundTokenFilter already does. I think maybe the name is confusing, at least look at the class javadocs :) In this case I'm sorry but I think you are stretching, (and you aren't correct). We should fix these filters and enforce tokenizer as input, seriously. > HyphenationDecompoundTokenFilter does not set position/offset attributes > correctly > -- > > Key: LUCENE-8132 > URL: https://issues.apache.org/jira/browse/LUCENE-8132 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6.1, 7.2.1 >Reporter: Holger Bruch >Priority: Major > > HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set > positionIncrement to 0 for all subwords, reuse start/endoffset of the > original token and ignore positionLength completly. > In consequence, the QueryBuilder generates a SynonymQuery comprising all > subwords, which should rather treated as individual terms. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly
[ https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334080#comment-16334080 ] Adrien Grand commented on LUCENE-8132: -- I haven't though about concrete use-cases, but for instance I suspect some users perform decompounding using both an algorithm and a dictionary? > HyphenationDecompoundTokenFilter does not set position/offset attributes > correctly > -- > > Key: LUCENE-8132 > URL: https://issues.apache.org/jira/browse/LUCENE-8132 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6.1, 7.2.1 >Reporter: Holger Bruch >Priority: Major > > HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set > positionIncrement to 0 for all subwords, reuse start/endoffset of the > original token and ignore positionLength completly. > In consequence, the QueryBuilder generates a SynonymQuery comprising all > subwords, which should rather treated as individual terms. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly
[ https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334074#comment-16334074 ] Robert Muir commented on LUCENE-8132: - why do you need to decompound more than once? The japanesetokenizer example is the same issue (as it already decompounds) > HyphenationDecompoundTokenFilter does not set position/offset attributes > correctly > -- > > Key: LUCENE-8132 > URL: https://issues.apache.org/jira/browse/LUCENE-8132 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6.1, 7.2.1 >Reporter: Holger Bruch >Priority: Major > > HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set > positionIncrement to 0 for all subwords, reuse start/endoffset of the > original token and ignore positionLength completly. > In consequence, the QueryBuilder generates a SynonymQuery comprising all > subwords, which should rather treated as individual terms. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly
[ https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334065#comment-16334065 ] Adrien Grand commented on LUCENE-8132: -- I'm not sure how practical this would be: some tokenizers today sometimes set the pos inc to 0 I think (JapanesTokenizer?) and it would only allow one of such filters in the analysis chain. > HyphenationDecompoundTokenFilter does not set position/offset attributes > correctly > -- > > Key: LUCENE-8132 > URL: https://issues.apache.org/jira/browse/LUCENE-8132 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6.1, 7.2.1 >Reporter: Holger Bruch >Priority: Major > > HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set > positionIncrement to 0 for all subwords, reuse start/endoffset of the > original token and ignore positionLength completly. > In consequence, the QueryBuilder generates a SynonymQuery comprising all > subwords, which should rather treated as individual terms. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly
[ https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334027#comment-16334027 ] Robert Muir commented on LUCENE-8132: - Maybe the right solution is just to fix it correctly and simply enforce {{input instanceof Tokenizer}}? Because its really like an extension of tokenization. > HyphenationDecompoundTokenFilter does not set position/offset attributes > correctly > -- > > Key: LUCENE-8132 > URL: https://issues.apache.org/jira/browse/LUCENE-8132 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6.1, 7.2.1 >Reporter: Holger Bruch >Priority: Major > > HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set > positionIncrement to 0 for all subwords, reuse start/endoffset of the > original token and ignore positionLength completly. > In consequence, the QueryBuilder generates a SynonymQuery comprising all > subwords, which should rather treated as individual terms. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly
[ https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334016#comment-16334016 ] Holger Bruch commented on LUCENE-8132: -- Ok, seems hard to get right for all cases. I wonder, if the current implementation could work at query time for anyone. However, I‘m working on a fix for HyphenationDecompounderTokenFilter that handles offset, posInc and posLength, though not in case a synonym filter is applied before. > HyphenationDecompoundTokenFilter does not set position/offset attributes > correctly > -- > > Key: LUCENE-8132 > URL: https://issues.apache.org/jira/browse/LUCENE-8132 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6.1, 7.2.1 >Reporter: Holger Bruch >Priority: Major > > HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set > positionIncrement to 0 for all subwords, reuse start/endoffset of the > original token and ignore positionLength completly. > In consequence, the QueryBuilder generates a SynonymQuery comprising all > subwords, which should rather treated as individual terms. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly
[ https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333962#comment-16333962 ] Adrien Grand commented on LUCENE-8132: -- I agree this sounds wrong. Unfortunately, inserting positions in a token filter is hard to do right if the analysis chain has a preceding token filter that sets synonyms, as you need to fix positions on all paths. This issue touches this problem a bit: LUCENE-5012. > HyphenationDecompoundTokenFilter does not set position/offset attributes > correctly > -- > > Key: LUCENE-8132 > URL: https://issues.apache.org/jira/browse/LUCENE-8132 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 6.6.1, 7.2.1 >Reporter: Holger Bruch >Priority: Major > > HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set > positionIncrement to 0 for all subwords, reuse start/endoffset of the > original token and ignore positionLength completly. > In consequence, the QueryBuilder generates a SynonymQuery comprising all > subwords, which should rather treated as individual terms. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org