[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly

2018-01-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335455#comment-16335455
 ] 

Adrien Grand commented on LUCENE-8132:
--

No, the hyphenation decompounder would have to be the first token filter in the 
analysis chain.

> HyphenationDecompoundTokenFilter does not set position/offset attributes 
> correctly
> --
>
> Key: LUCENE-8132
> URL: https://issues.apache.org/jira/browse/LUCENE-8132
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6.1, 7.2.1
>Reporter: Holger Bruch
>Priority: Major
>
> HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set 
> positionIncrement to 0 for all subwords, reuse start/endoffset of the 
> original token and ignore positionLength completly.
> In consequence, the QueryBuilder generates a SynonymQuery comprising all 
> subwords, which should rather treated as individual terms.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly

2018-01-22 Thread Holger Bruch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335427#comment-16335427
 ] 

Holger Bruch commented on LUCENE-8132:
--

I’m not as deeply in Lucene as you are. What would be the pros and cons of 
ensuring the input is an instance of tokenizer?
Would it still be possible to apply a token filters like WDF or lowercase 
filter before the HyphenationDecompunder?

> HyphenationDecompoundTokenFilter does not set position/offset attributes 
> correctly
> --
>
> Key: LUCENE-8132
> URL: https://issues.apache.org/jira/browse/LUCENE-8132
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6.1, 7.2.1
>Reporter: Holger Bruch
>Priority: Major
>
> HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set 
> positionIncrement to 0 for all subwords, reuse start/endoffset of the 
> original token and ignore positionLength completly.
> In consequence, the QueryBuilder generates a SynonymQuery comprising all 
> subwords, which should rather treated as individual terms.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly

2018-01-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334273#comment-16334273
 ] 

Robert Muir commented on LUCENE-8132:
-

Thats what HyphenationDecompoundTokenFilter already does. I think maybe the 
name is confusing, at least look at the class javadocs :)

In this case I'm sorry but I think you are stretching, (and you aren't 
correct). We should fix these filters and enforce tokenizer as input, seriously.

> HyphenationDecompoundTokenFilter does not set position/offset attributes 
> correctly
> --
>
> Key: LUCENE-8132
> URL: https://issues.apache.org/jira/browse/LUCENE-8132
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6.1, 7.2.1
>Reporter: Holger Bruch
>Priority: Major
>
> HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set 
> positionIncrement to 0 for all subwords, reuse start/endoffset of the 
> original token and ignore positionLength completly.
> In consequence, the QueryBuilder generates a SynonymQuery comprising all 
> subwords, which should rather treated as individual terms.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly

2018-01-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334080#comment-16334080
 ] 

Adrien Grand commented on LUCENE-8132:
--

I haven't though about concrete use-cases, but for instance I suspect some 
users perform decompounding using both an algorithm and a dictionary?

> HyphenationDecompoundTokenFilter does not set position/offset attributes 
> correctly
> --
>
> Key: LUCENE-8132
> URL: https://issues.apache.org/jira/browse/LUCENE-8132
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6.1, 7.2.1
>Reporter: Holger Bruch
>Priority: Major
>
> HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set 
> positionIncrement to 0 for all subwords, reuse start/endoffset of the 
> original token and ignore positionLength completly.
> In consequence, the QueryBuilder generates a SynonymQuery comprising all 
> subwords, which should rather treated as individual terms.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly

2018-01-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334074#comment-16334074
 ] 

Robert Muir commented on LUCENE-8132:
-

why do you need to decompound more than once? The japanesetokenizer example is 
the same issue (as it already decompounds)

> HyphenationDecompoundTokenFilter does not set position/offset attributes 
> correctly
> --
>
> Key: LUCENE-8132
> URL: https://issues.apache.org/jira/browse/LUCENE-8132
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6.1, 7.2.1
>Reporter: Holger Bruch
>Priority: Major
>
> HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set 
> positionIncrement to 0 for all subwords, reuse start/endoffset of the 
> original token and ignore positionLength completly.
> In consequence, the QueryBuilder generates a SynonymQuery comprising all 
> subwords, which should rather treated as individual terms.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly

2018-01-22 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334065#comment-16334065
 ] 

Adrien Grand commented on LUCENE-8132:
--

I'm not sure how practical this would be: some tokenizers today sometimes set 
the pos inc to 0 I think (JapanesTokenizer?) and it would only allow one of 
such filters in the analysis chain.

> HyphenationDecompoundTokenFilter does not set position/offset attributes 
> correctly
> --
>
> Key: LUCENE-8132
> URL: https://issues.apache.org/jira/browse/LUCENE-8132
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6.1, 7.2.1
>Reporter: Holger Bruch
>Priority: Major
>
> HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set 
> positionIncrement to 0 for all subwords, reuse start/endoffset of the 
> original token and ignore positionLength completly.
> In consequence, the QueryBuilder generates a SynonymQuery comprising all 
> subwords, which should rather treated as individual terms.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly

2018-01-22 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334027#comment-16334027
 ] 

Robert Muir commented on LUCENE-8132:
-

Maybe the right solution is just to fix it correctly and simply enforce {{input 
instanceof Tokenizer}}? Because its really like an extension of tokenization.

> HyphenationDecompoundTokenFilter does not set position/offset attributes 
> correctly
> --
>
> Key: LUCENE-8132
> URL: https://issues.apache.org/jira/browse/LUCENE-8132
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6.1, 7.2.1
>Reporter: Holger Bruch
>Priority: Major
>
> HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set 
> positionIncrement to 0 for all subwords, reuse start/endoffset of the 
> original token and ignore positionLength completly.
> In consequence, the QueryBuilder generates a SynonymQuery comprising all 
> subwords, which should rather treated as individual terms.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly

2018-01-22 Thread Holger Bruch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16334016#comment-16334016
 ] 

Holger Bruch commented on LUCENE-8132:
--

Ok, seems hard to get right for all cases. I wonder, if the current 
implementation could work at query time for anyone. 
However, I‘m working on a fix for HyphenationDecompounderTokenFilter that 
handles offset, posInc and posLength, though not in case a synonym filter is 
applied before.


> HyphenationDecompoundTokenFilter does not set position/offset attributes 
> correctly
> --
>
> Key: LUCENE-8132
> URL: https://issues.apache.org/jira/browse/LUCENE-8132
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6.1, 7.2.1
>Reporter: Holger Bruch
>Priority: Major
>
> HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set 
> positionIncrement to 0 for all subwords, reuse start/endoffset of the 
> original token and ignore positionLength completly.
> In consequence, the QueryBuilder generates a SynonymQuery comprising all 
> subwords, which should rather treated as individual terms.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8132) HyphenationDecompoundTokenFilter does not set position/offset attributes correctly

2018-01-21 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16333962#comment-16333962
 ] 

Adrien Grand commented on LUCENE-8132:
--

I agree this sounds wrong. Unfortunately, inserting positions in a token filter 
is hard to do right if the analysis chain has a preceding token filter that 
sets synonyms, as you need to fix positions on all paths. This issue touches 
this problem a bit: LUCENE-5012.

> HyphenationDecompoundTokenFilter does not set position/offset attributes 
> correctly
> --
>
> Key: LUCENE-8132
> URL: https://issues.apache.org/jira/browse/LUCENE-8132
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.6.1, 7.2.1
>Reporter: Holger Bruch
>Priority: Major
>
> HyphenationDecompoundTokenFilter and DictionaryDecompoundTokenFilter set 
> positionIncrement to 0 for all subwords, reuse start/endoffset of the 
> original token and ignore positionLength completly.
> In consequence, the QueryBuilder generates a SynonymQuery comprising all 
> subwords, which should rather treated as individual terms.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org