[jira] [Commented] (LUCENE-7622) Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?

2017-01-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807640#comment-15807640
 ] 

Uwe Schindler commented on LUCENE-7622:
---

Hi Robert. I know that you can tune. Maybe I was a bit unclear. I wanted to say 
that unlike with stupid CrappyDefaultSim it's no longer possible to boost terms 
more or less unlimited (like a document with 1 times the same term no 
longer beats all others). So to repeat terms at same position with a repeater 
token filter is still useful, but no longer so drastic. So sorry for being 
unclear. 邏 Maybe I change or remove the last sentence in my comment to remove 
the misunderstanding.

> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
> 
>
> Key: LUCENE-7622
> URL: https://issues.apache.org/jira/browse/LUCENE-7622
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Attachments: LUCENE-7622.patch
>
>
> The change to BTSTC is quite simple, to catch any case where the same term 
> text spans from the same position with the same position length. Such 
> duplicate tokens are silly to add to the index, or to search at search time.
> Yet, this change produced many failures, and I looked briefly at them, and 
> they are cases that I think are actually OK, e.g. 
> {{PatternCaptureGroupTokenFilter}} capturing (..)(..) on the string {{ktkt}} 
> will create a duplicate token.
> Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7622) Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?

2017-01-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807508#comment-15807508
 ] 

Robert Muir commented on LUCENE-7622:
-

BM25 does not make this harder. It just normalizes term frequency in a way that 
isn't as brain dead as {{sqrt}}. And unlike Crappy^H^H^H^HDefaultSimilarity, 
its totally tunable without modifying source code, e.g. adjust {{k1}} parameter 
to your needs.

Sorry, you are wrong: it only makes this kind of thing way easier.

> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
> 
>
> Key: LUCENE-7622
> URL: https://issues.apache.org/jira/browse/LUCENE-7622
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Attachments: LUCENE-7622.patch
>
>
> The change to BTSTC is quite simple, to catch any case where the same term 
> text spans from the same position with the same position length. Such 
> duplicate tokens are silly to add to the index, or to search at search time.
> Yet, this change produced many failures, and I looked briefly at them, and 
> they are cases that I think are actually OK, e.g. 
> {{PatternCaptureGroupTokenFilter}} capturing (..)(..) on the string {{ktkt}} 
> will create a duplicate token.
> Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7622) Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?

2017-01-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807277#comment-15807277
 ] 

Uwe Schindler commented on LUCENE-7622:
---

For the above boosting use cases, it would be better to have an additional 
attribute in TokenStreams that default to 1, but returns a "frequency" or 
"boost" if used. Then you could stop cloning the tokens. FYI: I know that BM25 
makes this type of boosting harder, but you can still add emphasis on tokens in 
a text by duplicating them

> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
> 
>
> Key: LUCENE-7622
> URL: https://issues.apache.org/jira/browse/LUCENE-7622
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Attachments: LUCENE-7622.patch
>
>
> The change to BTSTC is quite simple, to catch any case where the same term 
> text spans from the same position with the same position length. Such 
> duplicate tokens are silly to add to the index, or to search at search time.
> Yet, this change produced many failures, and I looked briefly at them, and 
> they are cases that I think are actually OK, e.g. 
> {{PatternCaptureGroupTokenFilter}} capturing (..)(..) on the string {{ktkt}} 
> will create a duplicate token.
> Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7622) Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?

2017-01-07 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15807274#comment-15807274
 ] 

Uwe Schindler commented on LUCENE-7622:
---

I agree that by default TokenStreams should not produce duplicate tokens, but 
there are use cases (boosting) where you might want to do this. E.g., if you 
want to raise the boost of a term in a document (e.g., if its inside a  
HTML tag and should have emphasis), you can duplicate the token to increase its 
frequency (with same position). The alternative would be payloads and payload 
query, but this is cheap to do.

Also: If you use ASCIIFoldingFilter or stemming and add the folded/stemmed 
terms together with the original ones to the index, those terms with no 
folding/stemming applied would get duplicated. But If you don't do this the 
statistics would be wrong. I agree, for this case it would be better to have a 
separate field, but some people like to have it in the same.

> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
> 
>
> Key: LUCENE-7622
> URL: https://issues.apache.org/jira/browse/LUCENE-7622
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
> Attachments: LUCENE-7622.patch
>
>
> The change to BTSTC is quite simple, to catch any case where the same term 
> text spans from the same position with the same position length. Such 
> duplicate tokens are silly to add to the index, or to search at search time.
> Yet, this change produced many failures, and I looked briefly at them, and 
> they are cases that I think are actually OK, e.g. 
> {{PatternCaptureGroupTokenFilter}} capturing (..)(..) on the string {{ktkt}} 
> will create a duplicate token.
> Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org