[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-12 Thread Christoph Goller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905178#comment-16905178
 ] 

Christoph Goller commented on LUCENE-8943:
--

I agree, we cannot realistically approximate the doc freq of phrases.
And yes, actually the scoring problem I brought up is a kind of synonym issue.

Usually, if we are using synonyms we want to score exact query matches higher 
than synonym matches. That's probably one of the reasons why SynonymQuery 
allows to specify boosts.

I am having lots of multiword synonyms. W2k16 e.g. is a synonym for "Windows 
Server 2016". Different boosts for multiword synonyms don't work reliably since 
matches for "Windows Server 2016" may score much higher than those of W2k16 due 
to huge IDFs.

I am not so much looking for an optimal BM25 scoring for Phrases / Multiphrases 
/ Spans. Instead I  am looking for a 
way to score them within BM25 so that boosts work as expected.

One step into this direction would be to limit IDF values in case of Phrases / 
Multiphrases / Spans. In BM25 it seems to be very important that IDF saturates 
and currently the behavior of Phrases / Multiphrases / Spans contradicts that. 
With the solution I proposed we can get rid of huge IDF values for Phrases / 
Multiphrases / Spans. Therefore I still think we should do it. Plus it would 
make scores more camparable and boosts would work more reliable.

Your post made me think of the problem in another way. If we had something like 
MultiWordsSynonymQuery, we could have even more control. Similar to 
SynonymQuery we could use one IDF value for all synonyms. Synonym boost would 
work much more reliably.

MultiWordsSynonymQuery could be very general. In my last post I suggested to 
approximate docFreq instead of IDFs in order to gurantee saturation. For 
implementing it, I thought about adding a member variable pseudoStats 
(TermStatistics) to Weight, which would be used for computing SimScorer. 
Usually the values for pseudoStats would be computed bottom up (SpanWeight, 
PhraseWeight) from the subqueries. But we could implement a general 
MultiWordsSynonymQuery as subclass of BooleanQuery (only allowing disjunction) 
which would set (adapt) pseudoStats in all its subweights (docFreq as max 
docFreq of all synonyms just as SynonymQuery currently does).

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-06 Thread Christoph Goller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901076#comment-16901076
 ] 

Christoph Goller edited comment on LUCENE-8943 at 8/6/19 1:54 PM:
--

{{Thanks for your quick response Alan. I've been doing some thinking about 
adding up IDF values in case of simple phrase queries and I no longer think 
that is the way we should do it.}}

{{The problem is that we can get very high IDF values, i.e. values that are 
considerably higher than the maximum IDF value for a single term!}}

{{Consider an index with 10 million docs. The maximum IDF value (BM25) for a 
single term is 16.8. Assume we have 10 docs containing "wifi" and 10 docs 
containing "wi-fi" which is split by our tokenizer into 2 tokens. The IDF value 
for "wifi" will be 13.77. If we assume that "wi" and "fi" both occur only in 
"wi-fi" docs, we get an IDF of 27.5 for the "wi fi" phrase query which wee need 
in order to find our 10 "wi-fi" docs. If we search for wifi OR "wi fi" the docs 
containing "wi-fi" will score much higher!}}

{{Admittedly, it is easy to construct examples in which adding the IDF values 
of phrase parts yields values that are too high. The assumption of independence 
of phrase parts does not normally apply. But BM25 has a saturation for IDF 
values and adding up IDF values breaks it. This seems to be a serious 
drawback.}}

{{I propose to switch from combining IDF-values to calculating / approximating 
docFreq. For the OR-case SynonymQuery does this already. It uses the maximum. 
For the AND-case we could use something like}}

{{docFreqPhrase = (docFreq1 * docFreq2) / docCount}}

{{The intuition behind this is again independence of phrase parts. But by 
computing a docFreq we can guarantee the saturation for IDF.}}

{{For the "wi fi" example we get docFreqPhrase of 10^-5 leading to an IDF of 
16.8 (saturation) and the difference to the IDF of wifi is considerably smaller 
compared to adding up IDFs. If phrase parts are rare, we quickly run into 
saturation of the IDF. But we also get some reasonable values. Consider the 
phrase "New York". If we assume that 100,000 docs contain "new" and 10,000 docs 
contain "york". By applying the formula from above we get and IDF for the 
phrase "New York" of 11.5 which is roughly the number we get when we add up the 
IDFs of the parts (current Lucene behavior).}}

{{We could even have some simple adjustments for the fact that usually the 
independence assumption is not correct. For both the OR-case and the AND-case 
we could make values a little bit higher. The exact way for approximating 
docFreq for the AND-case and the OR-case could be defined in the Similarity and 
it could be configurable.}}

I also did some research with Google:

{{(multiword OR N-gram) AND BM25 AND IDF}}

Unfortunately I did not find anything that helps.

{{Do you know about the benchmarks used to evaluate scoring in Lucene? Are 
there any phrase queries involved?}}
 {{Robert told me it’s very Trek-like, so probably no phrase queries?}}

{{In my opinion something like BM25 can only get us to a certain level of 
relevance. Of course, we have to get it right. IDF values of phrases / 
SpanQueries should not have such a big effect on the score simply because they 
get too high IDF-values. We have to do something reasonable. But for real 
break-through improvements we need something like query segmentation or even 
query interpretation and proximity of query terms in documents should have a 
high impact on the score. That's why I think it is important to integrate 
PhraseQueries and SpanQueries properly into BM25.}}


was (Author: gol...@detego-software.de):
{{Thanks for your quick response Alan. I've been doing some thinking about 
adding up IDF values in case of simple phrase queries and I no longer think 
that is the way we should do it.}}

{{The problem is that we can get very high IDF values, i.e. values that are 
considerably higher than the maximum IDF value for a single term!}}

{{Consider an index with 10 million docs. The maximum IDF value (BM25) for a 
single term is 16.8. Assume we have 10 docs containing "wifi" and 10 docs 
containing "wi-fi" which is split by our tokenizer into 2 tokens. The IDF value 
for "wifi" will be 13.77. If we assume that "wi" and "fi" both occur only in 
"wi-fi" docs, we get an IDF of 27.5 for the "wi fi" phrase query which wee need 
in order to find our 10 "wi-fi" docs. If we search for wifi OR "wi fi" the docs 
containing "wi-fi" will score much higher!}}

{{Admittedly, it is easy to construct examples in which adding the IDF values 
of phrase parts yields values that are too high. The assumption of independence 
of phrase parts does not normally apply. But BM25 has a saturation for IDF 
values and adding up IDF values breaks it. This seems to be a serious 
drawback.}}

{{I propose to switch from combining IDF-values to calculating / 

[jira] [Comment Edited] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-06 Thread Christoph Goller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901076#comment-16901076
 ] 

Christoph Goller edited comment on LUCENE-8943 at 8/6/19 1:52 PM:
--

{{Thanks for your quick response Alan. I've been doing some thinking about 
adding up IDF values in case of simple phrase queries and I no longer think 
that is the way we should do it.}}

{{The problem is that we can get very high IDF values, i.e. values that are 
considerably higher than the maximum IDF value for a single term!}}

{{Consider an index with 10 million docs. The maximum IDF value (BM25) for a 
single term is 16.8. Assume we have 10 docs containing "wifi" and 10 docs 
containing "wi-fi" which is split by our tokenizer into 2 tokens. The IDF value 
for "wifi" will be 13.77. If we assume that "wi" and "fi" both occur only in 
"wi-fi" docs, we get an IDF of 27.5 for the "wi fi" phrase query which wee need 
in order to find our 10 "wi-fi" docs. If we search for wifi OR "wi fi" the docs 
containing "wi-fi" will score much higher!}}

{{Admittedly, it is easy to construct examples in which adding the IDF values 
of phrase parts yields values that are too high. The assumption of independence 
of phrase parts does not normally apply. But BM25 has a saturation for IDF 
values and adding up IDF values breaks it. This seems to be a serious 
drawback.}}

{{I propose to switch from combining IDF-values to calculating / approximating 
docFreq. For the OR-case SynonymQuery does this already. It uses the maximum. 
For the AND-case we could use something like}}

{{docFreqPhrase = (docFreq1 * docFreq2) / docCount}}

{{The intuition behind this is again independence of phrase parts. But by 
computing a docFreq we can guarantee the saturation for IDF.}}

{{For the "wi fi" example we get docFreqPhrase of 10^-5 leading to an IDF of 
16.8 (saturation) and the difference to the IDF of wifi is considerably smaller 
compared to adding up IDFs. If phrase parts are rare, we quickly run into 
saturation of the IDF. But we also get some reasonable values. Consider the 
phrase "New York". If we assume that 100,000 docs contain "new" and 10,000 docs 
contain "york". By applying the formula from above we get and IDF for the 
phrase "New York" of 11.5 which is roughly the number we get when we add up the 
IDFs of the parts (current Lucene behavior).}}

{{We could even have some simple adjustments for the fact that usually the 
independence assumption is not correct. For both the OR-case and the AND-case 
we could make values a little bit higher. The exact way for approximating 
docFreq for the AND-case and the OR-case could be defined in the Similarity and 
it could be configurable.}}

{{I also did some research with Google: }}

{{(multiword OR N-gram) AND BM25 AND IDF}}


 Unfortunately, I did not find anything that helps.
 {{Do you know about the benchmarks used to evaluate scoring in Lucene? Are 
there any phrase queries involved?}}
 {{Robert told me it’s very Trek-like, so probably no phrase queries?}}

{{In my opinion something like BM25 can only get us to a certain level of 
relevance. Of course, we have to get it right. IDF values of phrases / 
SpanQueries should not have such a big effect on the score simply because they 
get too high IDF-values. We have to do something reasonable. But for real 
break-through improvements we need something like query segmentation or even 
query interpretation and proximity of query terms in documents should have a 
high impact on the score. That's why I think it is important to integrate 
PhraseQueries and SpanQueries properly into BM25.}}


was (Author: gol...@detego-software.de):
{{Thanks for your quick response Alan. I've been doing some thinking about 
adding up IDF values in case of simple phrase queries and I no longer think 
that is the way we should do it.}}

{{The problem is that we can get very high IDF values, i.e. values that are 
considerably higher than the maximum IDF value for a single term!}}

{{Consider an index with 10 million docs. The maximum IDF value (BM25) for a 
single term is 16.8. Assume we have 10 docs containing "wifi" and 10 docs 
containing "wi-fi" which is split by our tokenizer into 2 tokens. The IDF value 
for "wifi" will be 13.77. If we assume that "wi" and "fi" both occur only in 
"wi-fi" docs, we get an IDF of 27.5 for the "wi fi" phrase query which wee need 
in order to find our 10 "wi-fi" docs. If we search for wifi OR "wi fi" the docs 
containing "wi-fi" will score much higher!}}

{{Admittedly, it is easy to construct examples in which adding the IDF values 
of phrase parts yields values that are too high. The assumption of independence 
of phrase parts does not normally apply. But BM25 has a saturation for IDF 
values and adding up IDF values breaks it. This seems to be a serious 
drawback.}}

{{I propose to switch from combining IDF-values to calculating 

[jira] [Comment Edited] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-06 Thread Christoph Goller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901076#comment-16901076
 ] 

Christoph Goller edited comment on LUCENE-8943 at 8/6/19 1:52 PM:
--

{{Thanks for your quick response Alan. I've been doing some thinking about 
adding up IDF values in case of simple phrase queries and I no longer think 
that is the way we should do it.}}

{{The problem is that we can get very high IDF values, i.e. values that are 
considerably higher than the maximum IDF value for a single term!}}

{{Consider an index with 10 million docs. The maximum IDF value (BM25) for a 
single term is 16.8. Assume we have 10 docs containing "wifi" and 10 docs 
containing "wi-fi" which is split by our tokenizer into 2 tokens. The IDF value 
for "wifi" will be 13.77. If we assume that "wi" and "fi" both occur only in 
"wi-fi" docs, we get an IDF of 27.5 for the "wi fi" phrase query which wee need 
in order to find our 10 "wi-fi" docs. If we search for wifi OR "wi fi" the docs 
containing "wi-fi" will score much higher!}}

{{Admittedly, it is easy to construct examples in which adding the IDF values 
of phrase parts yields values that are too high. The assumption of independence 
of phrase parts does not normally apply. But BM25 has a saturation for IDF 
values and adding up IDF values breaks it. This seems to be a serious 
drawback.}}

{{I propose to switch from combining IDF-values to calculating / approximating 
docFreq. For the OR-case SynonymQuery does this already. It uses the maximum. 
For the AND-case we could use something like}}

{{docFreqPhrase = (docFreq1 * docFreq2) / docCount}}

{{The intuition behind this is again independence of phrase parts. But by 
computing a docFreq we can guarantee the saturation for IDF.}}

{{For the "wi fi" example we get docFreqPhrase of 10^-5 leading to an IDF of 
16.8 (saturation) and the difference to the IDF of wifi is considerably smaller 
compared to adding up IDFs. If phrase parts are rare, we quickly run into 
saturation of the IDF. But we also get some reasonable values. Consider the 
phrase "New York". If we assume that 100,000 docs contain "new" and 10,000 docs 
contain "york". By applying the formula from above we get and IDF for the 
phrase "New York" of 11.5 which is roughly the number we get when we add up the 
IDFs of the parts (current Lucene behavior).}}

{{We could even have some simple adjustments for the fact that usually the 
independence assumption is not correct. For both the OR-case and the AND-case 
we could make values a little bit higher. The exact way for approximating 
docFreq for the AND-case and the OR-case could be defined in the Similarity and 
it could be configurable.}}

{{I also did some research with Google: (multiword OR N-gram) AND BM25 AND IDF}}
 Unfortunately, I did not find anything that helps.
 {{Do you know about the benchmarks used to evaluate scoring in Lucene? Are 
there any phrase queries involved?}}
 {{Robert told me it’s very Trek-like, so probably no phrase queries?}}

{{In my opinion something like BM25 can only get us to a certain level of 
relevance. Of course, we have to get it right. IDF values of phrases / 
SpanQueries should not have such a big effect on the score simply because they 
get too high IDF-values. We have to do something reasonable. But for real 
break-through improvements we need something like query segmentation or even 
query interpretation and proximity of query terms in documents should have a 
high impact on the score. That's why I think it is important to integrate 
PhraseQueries and SpanQueries properly into BM25.}}


was (Author: gol...@detego-software.de):
{{Thanks for your quick response Alan. I've been doing some thinking about 
adding up IDF values in case of simple phrase queries and I no longer think 
that is the way we should do it.}}

{{The problem is that we can get very high IDF values, i.e. values that are 
considerably higher than the maximum IDF value for a single term!}}

{{Consider an index with 10 million docs. The maximum IDF value (BM25) for a 
single term is 16.8. Assume we have 10 docs containing "wifi" and 10 docs 
containing "wi-fi" which is split by our tokenizer into 2 tokens. The IDF value 
for "wifi" will be 13.77. If we assume that "wi" and "fi" both occur only in 
"wi-fi" docs, we get an IDF of 27.5 for the "wi fi" phrase query which wee need 
in order to find our 10 "wi-fi" docs. If we search for wifi OR "wi fi" the docs 
containing "wi-fi" will score much higher!}}

{{Admittedly, it is easy to construct examples in which adding the IDF values 
of phrase parts yields values that are too high. The assumption of independence 
of phrase parts does not normally apply. But BM25 has a saturation for IDF 
values and adding up IDF values breaks it. This seems to be a serious 
drawback.}}

{{I propose to switch from combining IDF-values to calculating / 

[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-06 Thread Christoph Goller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901076#comment-16901076
 ] 

Christoph Goller commented on LUCENE-8943:
--

{{Thanks for your quick response Alan. I've been doing some thinking about 
adding up IDF values in case of simple phrase queries and I no longer think 
that is the way we should do it.}}

{{The problem is that we can get very high IDF values, i.e. values that are 
considerably higher than the maximum IDF value for a single term!}}

{{Consider an index with 10 million docs. The maximum IDF value (BM25) for a 
single term is 16.8. Assume we have 10 docs containing "wifi" and 10 docs 
containing "wi-fi" which is split by our tokenizer into 2 tokens. The IDF value 
for "wifi" will be 13.77. If we assume that "wi" and "fi" both occur only in 
"wi-fi" docs, we get an IDF of 27.5 for the "wi fi" phrase query which wee need 
in order to find our 10 "wi-fi" docs. If we search for wifi OR "wi fi" the docs 
containing "wi-fi" will score much higher!}}

{{Admittedly, it is easy to construct examples in which adding the IDF values 
of phrase parts yields values that are too high. The assumption of independence 
of phrase parts does not normally apply. But BM25 has a saturation for IDF 
values and adding up IDF values breaks it. This seems to be a serious 
drawback.}}

{{I propose to switch from combining IDF-values to calculating / approximating 
docFreq. For the OR-case SynonymQuery does this already. It uses the maximum. 
For the AND-case we could use something like}}

{{docFreqPhrase = (docFreq1 * docFreq2) / docCount}}

{{The intuition behind this is again independence of phrase parts. But by 
computing a docFreq we can guarantee the saturation for IDF.}}

{{For the "wi fi" example we get docFreqPhrase of 10^-5 leading to an IDF of 
16.8 (saturation) and the difference to the IDF of wifi is considerably smaller 
compared to adding up IDFs. If phrase parts are rare, we quickly run into 
saturation of the IDF. But we also get some reasonable values. Consider the 
phrase "New York". If we assume that 100,000 docs contain "new" and 10,000 docs 
contain "york". By applying the formula from above we get and IDF for the 
phrase "New York" of 11.5 which is roughly the number we get when we add up the 
IDFs of the parts (current Lucene behavior).}}

{{We could even have some simple adjustments for the fact that usually the 
independence assumption is not correct. For both the OR-case and the AND-case 
we could make values a little bit higher. The exact way for approximating 
docFreq for the AND-case and the OR-case could be defined in the Similarity and 
it could be configurable.}}

{{I also did some research with Google: (multiword OR N-gram) AND BM25 AND IDF}}
{{Unfortunately, I did not find anything that helps. }}
{{Do you know about the benchmarks used to evaluate scoring in Lucene? Are 
there any phrase queries involved?}}
{{Robert told me it’s very Trek-like, so probably no phrase queries?}}

{{In my opinion something like BM25 can only get us to a certain level of 
relevance. Of course, we have to get it right. IDF values of phrases / 
SpanQueries should not have such a big effect on the score simply because they 
get too high IDF-values. We have to do something reasonable. But for real 
break-through improvements we need something like query segmentation or even 
query interpretation and proximity of query terms in documents should have a 
high impact on the score. That's why I think it is important to integrate 
PhraseQueries and SpanQueries properly into BM25.}}

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery 

[jira] [Comment Edited] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-02 Thread Christoph Goller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898859#comment-16898859
 ] 

Christoph Goller edited comment on LUCENE-8943 at 8/2/19 12:39 PM:
---

Why is this an issue?

Because IDFs of SpanOrQueriy and MultiPhraseQuery can get gigantic meaning that 
such queries have an unexpectedly high impact on the final score.


was (Author: gol...@detego-software.de):
Why is this an issue?

Because IDFs of SpanOrQueriy and MultiPhraseQuery can get gigantic meaning that 
such queries get an unexpectedly high impact on the final score.

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-02 Thread Christoph Goller (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16898859#comment-16898859
 ] 

Christoph Goller commented on LUCENE-8943:
--

Why is this an issue?

Because IDFs of SpanOrQueriy and MultiPhraseQuery can get gigantic meaning that 
such queries get an unexpectedly high impact on the final score.

> Incorrect IDF in MultiPhraseQuery and SpanOrQuery
> -
>
> Key: LUCENE-8943
> URL: https://issues.apache.org/jira/browse/LUCENE-8943
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/query/scoring
>Affects Versions: 8.0
>Reporter: Christoph Goller
>Priority: Major
>
> I recently stumbled across a very old bug in the IDF computation for 
> MultiPhraseQuery and SpanOrQuery.
> BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
> combining IDF values from more than on term / TermStatistics.
> I mean the method:
> Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
> termStats[])
> It simply adds up the IDFs from all termStats[].
> This method is used e.g. in PhraseQuery where it makes sense. If we assume 
> that for the phrase "New York" the occurrences of both words are independent, 
> we can multiply their probabilitis and since IDFs are logarithmic we add them 
> up. Seems to be a reasonable approximation. However, this method is also used 
> to add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:
> Similarity.SimScorer getStats(IndexSearcher searcher)
> A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
> positions. IDFs of alternative terms for one position should not be added up. 
> Instead we should use the minimum value as an approcimation because this 
> corresponds to the docFreq of the most frequent term and we know that this is 
> a lower bound for the docFreq for this position.
> In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
> SpanWeight and adds up all IDFs of all OR-clauses.
> If my arguments are not convincing, look at SynonymQuery / SynonymWeight in 
> the constructor:
> SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
> boost) 
> A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
> docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8943) Incorrect IDF in MultiPhraseQuery and SpanOrQuery

2019-08-02 Thread Christoph Goller (JIRA)
Christoph Goller created LUCENE-8943:


 Summary: Incorrect IDF in MultiPhraseQuery and SpanOrQuery
 Key: LUCENE-8943
 URL: https://issues.apache.org/jira/browse/LUCENE-8943
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/query/scoring
Affects Versions: 8.0
Reporter: Christoph Goller


I recently stumbled across a very old bug in the IDF computation for 
MultiPhraseQuery and SpanOrQuery.

BM25Similarity and TFIDFSimilarity / ClassicSimilarity have a method for 
combining IDF values from more than on term / TermStatistics.

I mean the method:


Explanation idfExplain(CollectionStatistics collectionStats, TermStatistics 
termStats[])


It simply adds up the IDFs from all termStats[].

This method is used e.g. in PhraseQuery where it makes sense. If we assume that 
for the phrase "New York" the occurrences of both words are independent, we can 
multiply their probabilitis and since IDFs are logarithmic we add them up. 
Seems to be a reasonable approximation. However, this method is also used to 
add up the IDFs of all terms in a MultiPhraseQuery as can be seen in:


Similarity.SimScorer getStats(IndexSearcher searcher)

A MultiPhraseQuery is actually a PhraseQuery with alternatives at individual 
positions. IDFs of alternative terms for one position should not be added up. 
Instead we should use the minimum value as an approcimation because this 
corresponds to the docFreq of the most frequent term and we know that this is a 
lower bound for the docFreq for this position.

In SpanOrQuerry we have the same problem It uses buildSimWeight(...) from 
SpanWeight and adds up all IDFs of all OR-clauses.

If my arguments are not convincing, look at SynonymQuery / SynonymWeight in the 
constructor:

SynonymWeight(Query query, IndexSearcher searcher, ScoreMode scoreMode, float 
boost) 

A SynonymQuery is also a kind of OR-query and it uses the maximum of the 
docFreq of all its alternative terms. I think this is how it should be.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-8637) WeightedSpanTermExtractor unnexessarily enforces rewrite for some SpanQueiries

2019-01-14 Thread Christoph Goller (JIRA)
Christoph Goller created LUCENE-8637:


 Summary: WeightedSpanTermExtractor unnexessarily enforces rewrite 
for some SpanQueiries
 Key: LUCENE-8637
 URL: https://issues.apache.org/jira/browse/LUCENE-8637
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/highlighter
Affects Versions: 7.5, 7.3.1, 7.4, 7.6
Reporter: Christoph Goller


Method mustRewriteQuery(SpanQuery) returns true for SpanPositionCheckQuery, 
SpanContainingQuery, SpanWithinQuery, and SpanBoostQuery, however, these 
queries do not require rewriting. One effect of this is e.g. that 
UnifiedHighlighter does not work with OffsetSource Postings and switches to 
Analysis which of course has consequences for performance.

I attach a patch for lucene version 7.6.0. I have not checked whether it breaks 
existing unit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-23 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214805#comment-16214805
 ] 

Christoph Goller edited comment on LUCENE-8000 at 10/23/17 8:17 AM:


??As an additional point, advanced use cases often utilize token "stacking" for 
additional uses as well and these would have further distortions on length.??

That's exactly what we are doing. Therefore using discountOverlaps = false 
could punish languages with more different word forms. I also prefer 
discountOverlaps = true. I have an intern (student) working on relevance tuning 
and benchmarks. I think we can try overwriting 
{code:java}
protected float avgFieldLength(CollectionStatistics collectionStats)
{code}
 and see it it changes anything. We will also have a look into Lucene benchmark 
module.

Thanks for your feedback.


was (Author: gol...@detego-software.de):
??As an additional point, advanced use cases often utilize token "stacking" for 
additional uses as well and these would have further distortions on length. ??

That's exactly what we are doing. Therefore using discountOverlaps = false 
could punish languages with more different word forms. I also prefer 
discountOverlaps = true. I have an intern (student) working on relevance tuning 
and benchmarks. I think we can try overwriting 
{code:java}
protected float avgFieldLength(CollectionStatistics collectionStats)
{code}
 and see it it changes anything. We will also have a look into Lucene benchmark 
module.

Thanks for your feedback.

> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
>   return 1f;   // field does not exist, or stat is unsupported
> } else {
>   final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>   return (float) (sumTotalTermFreq / (double) docCount);
> }
>   }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in my use case 
> different normal forms of tokens on the same position are shorter and 
> therefore get higher scores  than they should and that we do not use the 
> whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-23 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16214805#comment-16214805
 ] 

Christoph Goller commented on LUCENE-8000:
--

??As an additional point, advanced use cases often utilize token "stacking" for 
additional uses as well and these would have further distortions on length. ??

That's exactly what we are doing. Therefore using discountOverlaps = false 
could punish languages with more different word forms. I also prefer 
discountOverlaps = true. I have an intern (student) working on relevance tuning 
and benchmarks. I think we can try overwriting 
{code:java}
protected float avgFieldLength(CollectionStatistics collectionStats)
{code}
 and see it it changes anything. We will also have a look into Lucene benchmark 
module.

Thanks for your feedback.

> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
>   return 1f;   // field does not exist, or stat is unsupported
> } else {
>   final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>   return (float) (sumTotalTermFreq / (double) docCount);
> }
>   }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in my use case 
> different normal forms of tokens on the same position are shorter and 
> therefore get higher scores  than they should and that we do not use the 
> whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212350#comment-16212350
 ] 

Christoph Goller edited comment on LUCENE-8000 at 10/20/17 8:42 AM:


??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explanation why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explanation that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field). Another possible explanation would be that some fields 
have synonyms and others have not. That would punish fields with synonyms 
compared to others since their length is greater (in Classic Similarity with 
discountOverlaps = false), but in BM25 it should not have this effect since 
BM25 uses relative lenght for scoring and not abolute length like Classic 
Similarity.

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is not the right place for this.



was (Author: gol...@detego-software.de):
??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explanation that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field). Another possible explanation would be that some fields 
have synonyms and others have not. That would punish fields with synonyms 
compared to others since their length is greater (in Classic Similarity with 
discountOverlaps = false), but in BM25 it should not have this effect since 
BM25 uses relative lenght for scoring and not abolute length like Classic 
Similarity.

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is not the right place for this.


> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that 

[jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212350#comment-16212350
 ] 

Christoph Goller edited comment on LUCENE-8000 at 10/20/17 8:42 AM:


??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explanation that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field). Another possible explanation would be that some fields 
have synonyms and others have not. That would punish fields with synonyms 
compared to others since their length is greater (in Classic Similarity with 
discountOverlaps = false), but in BM25 it should not have this effect since 
BM25 uses relative lenght for scoring and not abolute length like Classic 
Similarity.

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is not the right place for this.



was (Author: gol...@detego-software.de):
??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explanation that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field). Another possible explanation would be that some fields 
have synonyms and others have not. That would punish fields with synonyms 
compared to others since their length is greater (in Classic Similarity with 
discountOverlaps = false), but in BM25 it should not have this effect since 
BM25 uses relative lenght for scoring and not abolute length like Classic 
Similarity.

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is nto the right place for this.


> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that 

[jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212350#comment-16212350
 ] 

Christoph Goller edited comment on LUCENE-8000 at 10/20/17 8:41 AM:


??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explanation that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field). Another possible explanation would be that some fields 
have synonyms and others have not. That would punish fields with synonyms 
compared to others since their length is greater (in Classic Similarity with 
discountOverlaps = false), but in BM25 it should not have this effect since 
BM25 uses relative lenght for scoring and not abolute length like Classic 
Similarity.

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is nto the right place for this.



was (Author: gol...@detego-software.de):
??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explanation that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field). Another possible explanation would be that some fields 
have synonyms and others have not. That would punish fields with synonyms 
compared to others since their length is greater (in Classic Similarity with 
discountOverlaps = false), but in BM25 it should not have this effect since 
BM25 used relative lenght for scoring and not abolute length.

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is nto the right place for this.


> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  

[jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212350#comment-16212350
 ] 

Christoph Goller edited comment on LUCENE-8000 at 10/20/17 8:39 AM:


??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explanation that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field). Another possible explanation would be that some fields 
have synonyms and others have not. That would punish fields with synonyms 
compared to others since their length is greater (in Classic Similarity with 
discountOverlaps = false), but in BM25 it should not have this effect since 
BM25 used relative lenght for scoring and not abolute length.

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is nto the right place for this.



was (Author: gol...@detego-software.de):
??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explanation that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field).

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is nto the right place for this.


> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
>   return 1f;   // field does not exist, or stat is unsupported
> } else {
>   final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : 

[jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212350#comment-16212350
 ] 

Christoph Goller edited comment on LUCENE-8000 at 10/20/17 8:35 AM:


??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explanation that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field).

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is nto the right place for this.



was (Author: gol...@detego-software.de):
??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explaination that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field).

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is nto the right place for this.


> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
>   return 1f;   // field does not exist, or stat is unsupported
> } else {
>   final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>   return (float) (sumTotalTermFreq / (double) docCount);
> }
>   }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in my use case 
> different normal forms of 

[jira] [Commented] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212350#comment-16212350
 ] 

Christoph Goller commented on LUCENE-8000:
--

??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood.* I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explaination that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field).

Sorry for bothering you with these questions. It's only my curiosity and mayb 
Jira is nto the right place for this.


> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
>   return 1f;   // field does not exist, or stat is unsupported
> } else {
>   final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>   return (float) (sumTotalTermFreq / (double) docCount);
> }
>   }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in my use case 
> different normal forms of tokens on the same position are shorter and 
> therefore get higher scores  than they should and that we do not use the 
> whole spectrum of relative document lenght of BM25.
> I think for BM25  discountOverlaps  should default to false. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212350#comment-16212350
 ] 

Christoph Goller edited comment on LUCENE-8000 at 10/20/17 8:34 AM:


??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explaination that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field).

Sorry for bothering you with these questions. It's only my curiosity and maybe 
Jira is nto the right place for this.



was (Author: gol...@detego-software.de):
??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explaination that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field).

Sorry for bothering you with these questions. It's only my curiosity and mayb 
Jira is nto the right place for this.


> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
>   return 1f;   // field does not exist, or stat is unsupported
> } else {
>   final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>   return (float) (sumTotalTermFreq / (double) docCount);
> }
>   }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in my use case 
> different normal forms of 

[jira] [Comment Edited] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16212350#comment-16212350
 ] 

Christoph Goller edited comment on LUCENE-8000 at 10/20/17 8:34 AM:


??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood. *I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explaination that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field).

Sorry for bothering you with these questions. It's only my curiosity and mayb 
Jira is nto the right place for this.



was (Author: gol...@detego-software.de):
??My point is that defaults are for typical use-cases, and the default of 
discountOverlaps meets that goal. It results in better (measured) performance 
for many tokenfilters that are commonly used such as common-grams, WDF, 
synonyms, etc. I ran these tests before proposing the default, it was not done 
flying blind.??

Understood.* I have not experienced any problems with the current default* and 
I have the option to set discountOverlaps to false. Therefore it's ok for me if 
the ticket gets closed.

I only think about this out of "scientific" curiosity in the context of  
relevance tuning.

What benchmarks have you used for measuring performance?

Is your opinion based on tests with Lucene Classic Similarity (it also uses 
discountOverlaps = true) or also on tests with BM25.

Have you any idea / explaination why relevancy is better using discountOverlaps 
= true. My naive guess would be that since stopwords or synonyms are either 
used on all documents or on none and therefore it should not make much 
difference whether we count overlaps or not. Is the explaination that for some 
documents many stopwords / synonyms / WDF splits are used and for others not 
(for the same field).

Sorry for bothering you with these questions. It's only my curiosity and mayb 
Jira is nto the right place for this.


> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for all terms of a field, therefore 
> counting positions of terms including those that overlap.
> {code}
>  protected float avgFieldLength(CollectionStatistics collectionStats) {
> final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
> if (sumTotalTermFreq <= 0) {
>   return 1f;   // field does not exist, or stat is unsupported
> } else {
>   final long docCount = collectionStats.docCount() == -1 ? 
> collectionStats.maxDoc() : collectionStats.docCount();
>   return (float) (sumTotalTermFreq / (double) docCount);
> }
>   }
> }
> {code}
> Are we comparing apples and oranges in the final scoring?
> I haven't run any benchmarks and I am not sure whether this has a serious 
> effect. It just means that documents that have synonyms or in my use case 
> different normal forms of tokens 

[jira] [Updated] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christoph Goller updated LUCENE-8000:
-
Description: 
Length of individual documents only counts the number of positions of a 
document since discountOverlaps defaults to true.

{code}
 @Override
  public final long computeNorm(FieldInvertState state) {
final int numTerms = discountOverlaps ? state.getLength() - 
state.getNumOverlap() : state.getLength();
int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
if (indexCreatedVersionMajor >= 7) {
  return SmallFloat.intToByte4(numTerms);
} else {
  return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
}
  }}
{code}

Measureing document length this way seems perfectly ok for me. What bothers me 
is that
average document length is based on sumTotalTermFreq for a field. As far as I 
understand that sums up totalTermFreqs for all terms of a field, therefore 
counting positions of terms including those that overlap.

{code}
 protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
  return 1f;   // field does not exist, or stat is unsupported
} else {
  final long docCount = collectionStats.docCount() == -1 ? 
collectionStats.maxDoc() : collectionStats.docCount();
  return (float) (sumTotalTermFreq / (double) docCount);
}
  }
}
{code}

Are we comparing apples and oranges in the final scoring?

I haven't run any benchmarks and I am not sure whether this has a serious 
effect. It just means that documents that have synonyms or in my use case 
different normal forms of tokens on the same position are shorter and therefore 
get higher scores  than they should and that we do not use the whole spectrum 
of relative document lenght of BM25.

I think for BM25  discountOverlaps  should default to false. 



  was:
Length of individual documents only counts the number of positions of a 
document since discountOverlaps defaults to true.

{code}
 @Override
  public final long computeNorm(FieldInvertState state) {
final int numTerms = discountOverlaps ? state.getLength() - 
state.getNumOverlap() : state.getLength();
int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
if (indexCreatedVersionMajor >= 7) {
  return SmallFloat.intToByte4(numTerms);
} else {
  return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
}
  }}
{code}

Measureing document length this way seems perfectly ok for me. What bothers me 
is that
average document length is based on sumTotalTermFreq for a field. As far as I 
understand that sums up totalTermFreqs for all terms of a field, therefore 
counting positions of terms including those that overlap.

{code}
 protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
  return 1f;   // field does not exist, or stat is unsupported
} else {
  final long docCount = collectionStats.docCount() == -1 ? 
collectionStats.maxDoc() : collectionStats.docCount();
  return (float) (sumTotalTermFreq / (double) docCount);
}
  }
}
{code}

Are we comparing apples and oranges in the final scoring?

I haven't run any benchmarks and I am not sure whether this has a serious 
effect. It just means that documents that have synonyms or in our case 
different normal forms of tokens on the same position are shorter and therefore 
get higher scores  than they should and that we do not use the whole spectrum 
of relative document lenght of BM25.

I think for BM25  discountOverlaps  should default to false. 




> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> 

[jira] [Updated] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christoph Goller updated LUCENE-8000:
-
Description: 
Length of individual documents only counts the number of positions of a 
document since discountOverlaps defaults to true.

{code}
 @Override
  public final long computeNorm(FieldInvertState state) {
final int numTerms = discountOverlaps ? state.getLength() - 
state.getNumOverlap() : state.getLength();
int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
if (indexCreatedVersionMajor >= 7) {
  return SmallFloat.intToByte4(numTerms);
} else {
  return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
}
  }}
{code}

Measureing document length this way seems perfectly ok for me. What bothers me 
is that
average document length is based on sumTotalTermFreq for a field. As far as I 
understand that sums up totalTermFreqs for all terms of a field, therefore 
counting positions of terms including those that overlap.

{code}
 protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
  return 1f;   // field does not exist, or stat is unsupported
} else {
  final long docCount = collectionStats.docCount() == -1 ? 
collectionStats.maxDoc() : collectionStats.docCount();
  return (float) (sumTotalTermFreq / (double) docCount);
}
  }
}
{code}

Are we comparing apples and oranges in the final scoring?

I haven't run any benchmarks and I am not sure whether this has a serious 
effect. It just means that documents that have synonyms or in our case 
different normal forms of tokens on the same position are shorter and therefore 
get higher scores  than they should and that we do not use the whole spectrum 
of relative document lenght of BM25.

I think for BM25  discountOverlaps  should default to false. 



  was:
Length of individual documents only counts the number of positions of a 
document since discountOverlaps defaults to true.

 { @Override
  public final long computeNorm(FieldInvertState state) {
final int numTerms = discountOverlaps ? state.getLength() - 
state.getNumOverlap() : state.getLength();
int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
if (indexCreatedVersionMajor >= 7) {
  return SmallFloat.intToByte4(numTerms);
} else {
  return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
}
  }}}

Measureing document length this way seems perfectly ok for me. What bothers me 
is that
average document length is based on sumTotalTermFreq for a field. As far as I 
understand that sums up totalTermFreqs for all terms of a field, therefore 
counting positions of terms including those that overlap.

{{  protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
  return 1f;   // field does not exist, or stat is unsupported
} else {
  final long docCount = collectionStats.docCount() == -1 ? 
collectionStats.maxDoc() : collectionStats.docCount();
  return (float) (sumTotalTermFreq / (double) docCount);
}
  }
}}
Are we comparing apples and oranges in the final scoring?

I haven't run any benchmarks and I am not sure whether this has a serious 
effect. It just means that documents that have synonyms or in our case 
different normal forms of tokens on the same position are shorter and therefore 
get higher scores  than they should and that we do not use the whole spectrum 
of relative document lenght of BM25.

I think for BM25  discountOverlaps  should default to false. 




> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
> {code}
>  @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}
> {code}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up 

[jira] [Updated] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-20 Thread Christoph Goller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-8000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christoph Goller updated LUCENE-8000:
-
Description: 
Length of individual documents only counts the number of positions of a 
document since discountOverlaps defaults to true.

 { @Override
  public final long computeNorm(FieldInvertState state) {
final int numTerms = discountOverlaps ? state.getLength() - 
state.getNumOverlap() : state.getLength();
int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
if (indexCreatedVersionMajor >= 7) {
  return SmallFloat.intToByte4(numTerms);
} else {
  return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
}
  }}}

Measureing document length this way seems perfectly ok for me. What bothers me 
is that
average document length is based on sumTotalTermFreq for a field. As far as I 
understand that sums up totalTermFreqs for all terms of a field, therefore 
counting positions of terms including those that overlap.

{{  protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
  return 1f;   // field does not exist, or stat is unsupported
} else {
  final long docCount = collectionStats.docCount() == -1 ? 
collectionStats.maxDoc() : collectionStats.docCount();
  return (float) (sumTotalTermFreq / (double) docCount);
}
  }
}}
Are we comparing apples and oranges in the final scoring?

I haven't run any benchmarks and I am not sure whether this has a serious 
effect. It just means that documents that have synonyms or in our case 
different normal forms of tokens on the same position are shorter and therefore 
get higher scores  than they should and that we do not use the whole spectrum 
of relative document lenght of BM25.

I think for BM25  discountOverlaps  should default to false. 



  was:
Length of individual documents only counts the number of positions of a 
document since discountOverlaps defaults to true.

 {quote} @Override
  public final long computeNorm(FieldInvertState state) {
final int numTerms = discountOverlaps ? state.getLength() - 
state.getNumOverlap() : state.getLength();
int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
if (indexCreatedVersionMajor >= 7) {
  return SmallFloat.intToByte4(numTerms);
} else {
  return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
}
  }{quote}

Measureing document length this way seems perfectly ok for me. What bothers me 
is that
average document length is based on sumTotalTermFreq for a field. As far as I 
understand that sums up totalTermFreqs for all terms of a field, therefore 
counting positions of terms including those that overlap.

{quote}  protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
  return 1f;   // field does not exist, or stat is unsupported
} else {
  final long docCount = collectionStats.docCount() == -1 ? 
collectionStats.maxDoc() : collectionStats.docCount();
  return (float) (sumTotalTermFreq / (double) docCount);
}
  }{quote}

Are we comparing apples and oranges in the final scoring?

I haven't run any benchmarks and I am not sure whether this has a serious 
effect. It just means that documents that have synonyms or in our case 
different normal forms of tokens on the same position are shorter and therefore 
get higher scores  than they should and that we do not use the whole spectrum 
of relative document lenght of BM25.

I think for BM25  discountOverlaps  should default to false. 




> Document Length Normalization in BM25Similarity correct?
> 
>
> Key: LUCENE-8000
> URL: https://issues.apache.org/jira/browse/LUCENE-8000
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Christoph Goller
>Priority: Minor
>
> Length of individual documents only counts the number of positions of a 
> document since discountOverlaps defaults to true.
>  { @Override
>   public final long computeNorm(FieldInvertState state) {
> final int numTerms = discountOverlaps ? state.getLength() - 
> state.getNumOverlap() : state.getLength();
> int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
> if (indexCreatedVersionMajor >= 7) {
>   return SmallFloat.intToByte4(numTerms);
> } else {
>   return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
> }
>   }}}
> Measureing document length this way seems perfectly ok for me. What bothers 
> me is that
> average document length is based on sumTotalTermFreq for a field. As far as I 
> understand that sums up totalTermFreqs for 

[jira] [Created] (LUCENE-8000) Document Length Normalization in BM25Similarity correct?

2017-10-19 Thread Christoph Goller (JIRA)
Christoph Goller created LUCENE-8000:


 Summary: Document Length Normalization in BM25Similarity correct?
 Key: LUCENE-8000
 URL: https://issues.apache.org/jira/browse/LUCENE-8000
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Christoph Goller
Priority: Minor


Length of individual documents only counts the number of positions of a 
document since discountOverlaps defaults to true.

 {quote} @Override
  public final long computeNorm(FieldInvertState state) {
final int numTerms = discountOverlaps ? state.getLength() - 
state.getNumOverlap() : state.getLength();
int indexCreatedVersionMajor = state.getIndexCreatedVersionMajor();
if (indexCreatedVersionMajor >= 7) {
  return SmallFloat.intToByte4(numTerms);
} else {
  return SmallFloat.floatToByte315((float) (1 / Math.sqrt(numTerms)));
}
  }{quote}

Measureing document length this way seems perfectly ok for me. What bothers me 
is that
average document length is based on sumTotalTermFreq for a field. As far as I 
understand that sums up totalTermFreqs for all terms of a field, therefore 
counting positions of terms including those that overlap.

{quote}  protected float avgFieldLength(CollectionStatistics collectionStats) {
final long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
  return 1f;   // field does not exist, or stat is unsupported
} else {
  final long docCount = collectionStats.docCount() == -1 ? 
collectionStats.maxDoc() : collectionStats.docCount();
  return (float) (sumTotalTermFreq / (double) docCount);
}
  }{quote}

Are we comparing apples and oranges in the final scoring?

I haven't run any benchmarks and I am not sure whether this has a serious 
effect. It just means that documents that have synonyms or in our case 
different normal forms of tokens on the same position are shorter and therefore 
get higher scores  than they should and that we do not use the whole spectrum 
of relative document lenght of BM25.

I think for BM25  discountOverlaps  should default to false. 





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7398) Nested Span Queries are buggy

2016-09-07 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470053#comment-15470053
 ] 

Christoph Goller commented on LUCENE-7398:
--

I just found that the LUCENE-2878 work/branch may contain some interesting 
ideas about scoring and proximity search / Span*Queries.

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-06 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466888#comment-15466888
 ] 

Christoph Goller edited comment on LUCENE-7398 at 9/6/16 3:37 PM:
--

After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is one. 

1) Missing Matches caused by lazy iteration:

I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I 
know there never was an implementation that compared all possible combinations 
of subspan matches for SpanNearQuery in Lucene. So SpanNearQuery always missed 
some matches.

*) This ticket demonstrates missing matches for ordered SpanQuery. Documents 
that should match don't match. This is caused by subspans of SpanNearQuery 
having a variable match length. For these cases the lazy iteration 
implementation which tries to optimize the number of comparisons of subspan 
matches is not sufficient.

*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. 
I think this is caused by a similar kind of lazy iteration in the unordered 
case.

*) In the unordered case lazy iteration also causes problems if the subspans do 
not have variable-length matches. This is demonstrated in LUCENE-5331 and 
LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all 
clauses of the SpanNearQuery were SpanTermQueries, but some occured more than 
once. For PhraseQuery and MultiPhraseQuery and their implementation in 
SloppyPhraseScore this seems to be a known problem that has been solved by a 
special complex treatment of repetitions that I currently don't understand in 
detail.

My current opinion: We should give up lazy iteration for the unordered and the 
ordered case to solve these problems. I think it can be done and the 
performance peanalty should not be too big. We already iterate over all 
positions of all subspans. So we already have done the expensive operation of 
reading them. Should some more comparisons of int-values (positions) really 
matter so much? At least for the ordered case I am optimistic that I could 
implement it efficiently.

2) Inconsistent Scoring of SpanNearQuery

*) Lazy iteration means that some "redundant" matches in a document are skipped 
in order to have a faster matching algorithm. I am not sure how redundant was 
defined exactly for the idea of lazy iteration. It referred to matches with the 
same start posisiton somehow. As long as different matches for the first clause 
are concerned, they are found, but not the all matches for intermediate 
subclauses are regarded. Skipping matches however reduces the frequency that is 
computed and consequently the score. See Javadoc of phraseFreq() in 
SloppyPhraseScore which mention the same phenomenon. This is quite important 
for my use case of SpanQueries. I have different versions/variants of the same 
term on the same position, e.g. one with case-normalization and one without and 
I want a higher score if the user-query matches for more than one variant, and 
I use this approach for clauses of SpanNearQuery.

*) In NearSpansOrdered the method width() (it is used to compute sloppy 
frequency in SpanScore) returns the number of gaps between the matches. If you 
have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it 
returns the length of the match, not the number of gaps. See atMatch() for the 
difference. The reason is probably, that (maxEndPositionCell.endPosition() - 
minPositionCell().startPosition() - totalSpanLength) might even become negative 
if matches overlap. I would prefer something like Math.max(0, 
(maxEndPositionCell.endPosition() - minPositionCell().startPosition() - 
totalSpanLength))

*) SpanOrQuery and SpanNearQuery completely ignore the scores of their 
subclauses  (subweights are always generated as non-scoring). A SpanOrQuery 
should give a Score similar to a BooleanQuery, shouldn't it? As long as we have 
this behavior, SpanBoostQuery does not make any sense, doese it? So to my 
opinion the existance of SpanBoostQuery shows that others also had the idea 
that a nested SpanQuery should somehow use the scores of their clauses for the 
computation of their own score.



was (Author: gol...@detego-software.de):
After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-06 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466888#comment-15466888
 ] 

Christoph Goller edited comment on LUCENE-7398 at 9/6/16 9:36 AM:
--

After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is one. 

1) Missing Matches caused by lazy iteration:

I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I 
know there never was an implementation that compared all possible combinations 
of subspan matches for SpanNearQuery in Lucene. So SpanNEarQuery always missed 
some matches.

*) This ticket demonstrates missing matches for ordered SpanQuery. Documents 
that should match aer not matching. They are caused by subspans of 
SpanNearQuery having a variable match length. For these cases the lazy 
iteration implementation which tries to optimize the number of comparisons of 
subspan matches is not sufficient.

*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. 
I think this is caused by a similar kind of lazy iteration in the unordered 
case.

*) In the unorderd case lazy iteration also causes problems if the subspans do 
not have variable-length matches. This is demonstrated in LUCENE-5331 and 
LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all 
clauses of the SpanNearQuery were SpanTermQueries, but some occured more than 
once. For PhraseQuery and MultiPhraseQuery and their implementation in 
SloppyPhraseScore this seems to be a known problem that has been solved by a 
special complex treatment of repetitions that I currently don't understand in 
detail.

My current opinion: We should give up lazy iteration for the unorderd and the 
ordered case to solve these problems. I think it can be done and the 
performance peanalty should not be too big. We already iterate over all 
positions of all subspans. So we already have done the expensive operation of 
reading them. Should some more comparisons of int-values (positions) really 
matter so much? At least fo the ordered case I am optimistic that I could 
implement it efficiently.

2) Inconsistent Scoring of SpanNearQuery

*) Lazy iteration means that some "redundant" matches in a document are skipped 
in order to have a faster matching algorithm. I am not sure how redundant was 
defined exactly for the idea of lazy iteration. It referred to matches with the 
same start posisiton somehow. As long as different matches for the first clause 
are concerned, they are found, but not the all matches for intermediate 
subclauses are regarded. Skipping matches however reduces the frequency that is 
computed and consequently the score. See Javadoc of phraseFreq() in 
SloppyPhraseScore which mention the same phenomenon. This is quite important 
for my use case of SpanQueries. I have different versions/variants of the same 
term on the same position, e.g. one with case-normalization and one without and 
I want a higher score if the user-query matches for more than one variant, and 
I use this approach for clauses of SpanNearQuery.

*) In NearSpansOrdered the method width() (it is used to compute sloppy 
frequency in SpanScore) returns the number of gaps between the matches. If you 
have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it 
returns the length of the match, not the number of gaps. See atMatch() for the 
difference. The reason is probably, that (maxEndPositionCell.endPosition() - 
minPositionCell().startPosition() - totalSpanLength) might even become negative 
if matches overlap. I would prefer something like Math.max(0, 
(maxEndPositionCell.endPosition() - minPositionCell().startPosition() - 
totalSpanLength))

*) SpanOrQuery and SpanNearQuery completely ignore the scores of their 
subclauses  (subweights are always generated as non-scoring). A SpanOrQuery 
should give a Score similar to a BooleanQuery, shouldn't it? As long as we have 
this behavior, SpanBoostQuery does not make any sense, doese it? So to my 
opinion the existance of SpanBoostQuery shows that others also had the idea 
that a nested SpanQuery should somehow use the scores of their clauses for the 
computation of their own score.



was (Author: gol...@detego-software.de):
After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-06 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466888#comment-15466888
 ] 

Christoph Goller edited comment on LUCENE-7398 at 9/6/16 9:14 AM:
--

After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is one. 

1) Missing Matches caused by lazy iteration:

I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I 
know there never was an implementation that compared all possible combinations 
of subspan matches for SpanNearQuery in Lucene. So SpanNEarQuery always missed 
some matches.

*) This ticket demonstrates missing matches for ordered SpanQuery. They are 
caused by subspans of SpanNearQuery having a variable match length. For these 
cases the lazy iteration implementation which tries to optimize the number of 
comparisons of subspan matches is not sufficient.

*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. 
I think this is caused by a similar kind of lazy iteration in the unordered 
case.

*) In the unorderd case lazy iteration also causes problems if the subspans do 
not have variable-length matches. This is demonstrated in LUCENE-5331 and 
LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all 
clauses of the SpanNearQuery were SpanTermQueries, but some occured more than 
once. For PhraseQuery and MultiPhraseQuery and their implementation in 
SloppyPhraseScore this seems to be a known problem that has been solved by a 
special complex treatment of repetitions that I currently don't understand in 
detail.

My current opinion: We should give up lazy iteration for the unorderd and the 
ordered case to solve these problems. I think it can be done and the 
performance peanalty should not be too big. We already iterate over all 
positions of all subspans. So we already have done the expensive operation of 
reading them. Should some more comparisons of int-values (positions) really 
matter so much? At least fo the ordered case I am optimistic that I could 
implement it efficiently.

2) Inconsistent Scoring of SpanNearQuery

*) Lazy iteration means that some "redundant" matches in a document are skipped 
in order to have a faster matching algorithm. I am not sure how redundant was 
defined exactly for the idea of lazy iteration. It referred to matches with the 
same start posisiton somehow. As long as different matches for the first clause 
are concerned, they are found, but not the all matches for intermediate 
subclauses are regarded. Skipping matches however reduces the frequency that is 
computed and consequently the score. See Javadoc of phraseFreq() in 
SloppyPhraseScore which mention the same phenomenon. This is quite important 
for my use case of SpanQueries. I have different versions/variants of the same 
term on the same position, e.g. one with case-normalization and one without and 
I want a higher score if the user-query matches for more than one variant, and 
I use this approach for clauses of SpanNearQuery.

*) In NearSpansOrdered the method width() (it is used to compute sloppy 
frequency in SpanScore) returns the number of gaps between the matches. If you 
have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it 
returns the length of the match, not the number of gaps. See atMatch() for the 
difference. The reason is probably, that (maxEndPositionCell.endPosition() - 
minPositionCell().startPosition() - totalSpanLength) might even become negative 
if matches overlap. I would prefer something like Math.max(0, 
(maxEndPositionCell.endPosition() - minPositionCell().startPosition() - 
totalSpanLength))

*) SpanOrQuery and SpanNearQuery completely ignore the scores of their 
subclauses  (subweights are always generated as non-scoring). A SpanOrQuery 
should give a Score similar to a BooleanQuery, shouldn't it? As long as we have 
this behavior, SpanBoostQuery does not make any sense, doese it? So to my 
opinion the existance of SpanBoostQuery shows that others also had the idea 
that a nested SpanQuery should somehow use the scores of their clauses for the 
computation of their own score.



was (Author: gol...@detego-software.de):
After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is 

[jira] [Commented] (LUCENE-7398) Nested Span Queries are buggy

2016-09-06 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466888#comment-15466888
 ] 

Christoph Goller commented on LUCENE-7398:
--

After thoroughly reviewing the current implementations of SpanNearQuery, 
PhraseQuery and MultiPhraseQuery I see some problems and inconsistencies. I 
volunteer to fix at least some of these problems, but first I would like to 
have a consensus about the desired bahavior of SpanQuery. This ticket may not 
be the right place for such a discussion, so please point me to a better place 
if there is one. 

1) Missing Matches caused by lazy iteration:

I think lazy iteration is not a new thing in Lucene SpanNearQuery. As far as I 
know there never was an implementation that compared all possible combinations 
of subspan matches for SpanNearQuery in Lucene. So SpanNEarQuery always missed 
some matches.

*) This ticket demonstrates missing matches for ordered SpanQuery. They are 
caused by subspans of SpanNearQuery having a variable match length. For these 
cases the lazy iteration implementation which tries to optimize the number of 
comparisons of subspan matches is not sufficient.

*) Tim tried these examples with unorderd SpanQuery and got the same bahavior. 
I think this is caused by a similar kind of lazy iteration in the unordered 
case.

*) In the unorderd case lazy iteration also causes problems if the subspans do 
not have variable-length matches. This is demonstrated in LUCENE-5331 and 
LUCENE-2861. Tim, thanks for pointing to these tickets. In these examples all 
clauses of the SpanNearQuery were SpanTermQueries, but some occured more than 
once. For PhraseQuery and MultiPhraseQuery and their implementation in 
SloppyPhraseScore this seems to be a known problem that has been solved by a 
special complex treatment of repetitions that I currently don't understand in 
detail.

My current opinion: We should give up lazy iteration for the unorderd and the 
ordered case to solve these problems. I think it can be done and the 
performance peanalty should not be too big. We already iterate over all 
positions of all subspans. So we already have done the expensive operation of 
reading them. Should some more comparisons of int-values (positions) really 
matter so much? At least fo the ordered case I am optimistic that I could 
implement it efficiently.

2) Inconsistent Scoring of SpanNearQuery

*) Lazy iteration means that some "redundant" matches in a document are skipped 
in order to have a faster matching algorithm. I am not sure how redundant was 
defined exactly for the idea of lazy iteration. It referred to matches with the 
same start posisiton somehow. Skpping matches however reduces the frequency 
that is computed and consequently the score. See Javadoc of phraseFreq() in 
SloppyPhraseScore which mention the same phenomenon. This is quite important 
for my use case of SpanQueries. I have different versions/variants of the same 
term on the same position, e.g. one with case-normalization and one without and 
I want a higher score if the user-query matches for more than one variant, and 
I use this approach for clauses of SpanNearQuery.

*) In NearSpansOrdered the method width() (it is used to compute sloppy 
frequency in SpanScore) returns the number of gaps between the matches. If you 
have a perfect match it returns 0 (no sloppyness). In NearSpansUnordered it 
returns the length of the match, not the number of gaps. See atMatch() for the 
difference. The reason is probably, that (maxEndPositionCell.endPosition() - 
minPositionCell().startPosition() - totalSpanLength) might even become negative 
if matches overlap. I would prefer something like Math.max(0, 
(maxEndPositionCell.endPosition() - minPositionCell().startPosition() - 
totalSpanLength))

*) SpanOrQuery and SpanNearQuery completely ignore the scores of their 
subclauses  (subweights are always generated as non-scoring). A SpanOrQuery 
should give a Score similar to a BooleanQuery, shouldn't it? As long as we have 
this behavior, SpanBoostQuery does not make any sense, doese it? So to my 
opinion the existance of SpanBoostQuery shows that others also had the idea 
that a nested SpanQuery should somehow use the scores of their clauses for the 
computation of their own score.


> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is 

[jira] [Commented] (LUCENE-7398) Nested Span Queries are buggy

2016-09-05 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465336#comment-15465336
 ] 

Christoph Goller commented on LUCENE-7398:
--

Good idea to try the nested tests from TestSpanCollection for the unordered 
case. The example from LUCENE-5331 shows the problems of incomplete 
backtracking (not comparing all combinations of span matches of all subspans) 
for the unordered case. In the ordered case we only have a problem with spans 
that have matches of different lenght, in the unorderd case we also see a 
problem with overlapping span-matches, even if they all have length 1.

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-09-05 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465301#comment-15465301
 ] 

Christoph Goller edited comment on LUCENE-7398 at 9/5/16 4:07 PM:
--

Paul's  20160814 patch  almost convinced me. Unfortunately, it does not fix the 
case when an intermediate span has a longer match that reduces overall 
sloppyness but overlaps with a match of a subsequent span and consequently 
requires advancing the subsequent span. Here is an example 

Document: w1 w2 w3 w4 w5
near/0(w1, or(w2, near/0(w2, w3, w4)), or(w5, near/0(w4, w5)))

Add the following code to the end of TestSpanCollection.testNestedNearQuery()

{code}
SpanNearQuery q234 = new SpanNearQuery(new SpanQuery[]{q2, q3, q4}, 0, true);
SpanOrQuery q2234 = new SpanOrQuery(q2, q234);
SpanTermQuery p5 = new SpanTermQuery(new Term(FIELD, "w5"));
SpanNearQuery q45 = new SpanNearQuery(new SpanQuery[]{q4, p5}, 0, true);
SpanOrQuery q455 = new SpanOrQuery(q45, p5);

SpanNearQuery q1q2234q445 = new SpanNearQuery(new SpanQuery[]{q1, q2234, q455}, 
0, true);
spans = q1q2234q445.createWeight(searcher, false, 
1f).getSpans(searcher.getIndexReader().leaves().get(0),SpanWeight.Postings.POSITIONS);
assertEquals(0, spans.advance(0));
{code}

I think we can only fix it if we get give up lazy iteration. I don't think this 
is so bad for performance. If we implement a clever caching for positions in 
spans a complete backtracking would only consist of making a few additional 
int-comparisons. The expensive operation is iterating over all span positions 
(IO) and we do this already in advancePosition(Spans, int), aren't we. 


was (Author: gol...@detego-software.de):
Paul's fix almost convinced me. Unfortunately, it does not fix the case when an 
intermediate span has a longer match that reduces overall sloppyness but 
overlaps with a match of a subsequent span and consequently requires advancing 
the subsequent span. Here is an example 

Document: w1 w2 w3 w4 w5
near/0(w1, or(w2, near/0(w2, w3, w4)), or(w5, near/0(w4, w5)))

Add the following code to the end of TestSpanCollection.testNestedNearQuery()

{code}
SpanNearQuery q234 = new SpanNearQuery(new SpanQuery[]{q2, q3, q4}, 0, true);
SpanOrQuery q2234 = new SpanOrQuery(q2, q234);
SpanTermQuery p5 = new SpanTermQuery(new Term(FIELD, "w5"));
SpanNearQuery q45 = new SpanNearQuery(new SpanQuery[]{q4, p5}, 0, true);
SpanOrQuery q455 = new SpanOrQuery(q45, p5);

SpanNearQuery q1q2234q445 = new SpanNearQuery(new SpanQuery[]{q1, q2234, q455}, 
0, true);
spans = q1q2234q445.createWeight(searcher, false, 
1f).getSpans(searcher.getIndexReader().leaves().get(0),SpanWeight.Postings.POSITIONS);
assertEquals(0, spans.advance(0));
{code}

I think we can only fix it if we get give up lazy iteration. I don't think this 
is so bad for performance. If we implement a clever caching for positions in 
spans a complete backtracking would only consist of making a few additional 
int-comparisons. The expensive operation is iterating over all span positions 
(IO) and we do this already in advancePosition(Spans, int), aren't we. 

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7398) Nested Span Queries are buggy

2016-09-05 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15465301#comment-15465301
 ] 

Christoph Goller commented on LUCENE-7398:
--

Paul's fix almost convinced me. Unfortunately, it does not fix the case when an 
intermediate span has a longer match that reduces overall sloppyness but 
overlaps with a match of a subsequent span and consequently requires advancing 
the subsequent span. Here is an example 

Document: w1 w2 w3 w4 w5
near/0(w1, or(w2, near/0(w2, w3, w4)), or(w5, near/0(w4, w5)))

Add the following code to the end of TestSpanCollection.testNestedNearQuery()

{code}
SpanNearQuery q234 = new SpanNearQuery(new SpanQuery[]{q2, q3, q4}, 0, true);
SpanOrQuery q2234 = new SpanOrQuery(q2, q234);
SpanTermQuery p5 = new SpanTermQuery(new Term(FIELD, "w5"));
SpanNearQuery q45 = new SpanNearQuery(new SpanQuery[]{q4, p5}, 0, true);
SpanOrQuery q455 = new SpanOrQuery(q45, p5);

SpanNearQuery q1q2234q445 = new SpanNearQuery(new SpanQuery[]{q1, q2234, q455}, 
0, true);
spans = q1q2234q445.createWeight(searcher, false, 
1f).getSpans(searcher.getIndexReader().leaves().get(0),SpanWeight.Postings.POSITIONS);
assertEquals(0, spans.advance(0));
{code}

I think we can only fix it if we get give up lazy iteration. I don't think this 
is so bad for performance. If we implement a clever caching for positions in 
spans a complete backtracking would only consist of making a few additional 
int-comparisons. The expensive operation is iterating over all span positions 
(IO) and we do this already in advancePosition(Spans, int), aren't we. 

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398-20160814.patch, LUCENE-7398.patch, 
> LUCENE-7398.patch, TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5331) nested SpanNearQuery with repeating groups does not find match

2016-09-05 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464619#comment-15464619
 ] 

Christoph Goller commented on LUCENE-5331:
--

As LUCENE-2861 the problem is caused by overlappiung matches for d, b, and c 
and an incomplete backtracking mechanism in unordered SpanQuery.

> nested SpanNearQuery with repeating groups does not find match
> --
>
> Key: LUCENE-5331
> URL: https://issues.apache.org/jira/browse/LUCENE-5331
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Jerry Zhou
> Attachments: NestedSpanNearTest.java, 
> NestedSpanNearTest_20160902.patch
>
>
> Nested spanNear queries do not work in some cases when repeating groups are 
> in the query.
> Test case is attached ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2861) Search doesn't return document via query

2016-09-05 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464613#comment-15464613
 ] 

Christoph Goller commented on LUCENE-2861:
--

The problem is caused by overlapping matches within spanNear2. The first match 
for spanNear2 matches "intended message" and the second "message" to the same 
"message" in the text so that the match for "addressed" ist to far away. One 
possible fix would forbid overlapping matches or add aspecial very compley 
treatment like in SloppyPhraseScore. I think it would be better to give up lazy 
backtracking and implement a correct backtracking (see LUCENE-7398).

> Search doesn't return document via query
> 
>
> Key: LUCENE-2861
> URL: https://issues.apache.org/jira/browse/LUCENE-2861
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 2.9.1, 2.9.4, 3.0.3
> Environment: Doesn't depend on enviroment
>Reporter: Zenoviy Veres
>
> The query doesn't return document that contain all words from query in 
> correct order.
> The issue might be within mechanism how do SpanQuerys actually match results 
> (http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/)
> Please refer for details below. The example text wasn't passed through 
> snowball analyzer, however the issue exists after analyzing too
> Query:
> (intend within 3 of message) within 5 of message within 3 of addressed.  
> Text within document:
> The contents of this e-mail message and
> any attachments are intended solely for the
> addressee(s) and may contain confidential
> and/or legally privileged information. If you
> are not the intended recipient of this message
> or if this message has been addressed to you
> in error, please immediately alert the sender
>  by reply e-mail and then delete this message
> and any attachments
> Result query:
> SpanNearQuery spanNear = new SpanNearQuery(new SpanQuery[] {
> new SpanTermQuery(new Term(BODY, "intended")),
> new SpanTermQuery(new Term(BODY, "message"))},
> 4,
> false);
> SpanNearQuery spanNear2 = new SpanNearQuery(new SpanQuery[] 
> {spanNear, new SpanTermQuery(new Term(BODY, "message"))}, 5, false);
> SpanNearQuery spanNear3 = new SpanNearQuery(new SpanQuery[] 
> {spanNear2, new SpanTermQuery(new Term(BODY, "addressed"))}, 3, false);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-5396) SpanNearQuery returns single term spans

2016-09-05 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15464535#comment-15464535
 ] 

Christoph Goller commented on LUCENE-5396:
--

Is this a bug or desired bahavior?

For me it is at least an acceptable behavior. I like the behavior of unordered 
SpanNearQuery to match if clauses overlap or match at the same position. and it 
would be quite difficult to find out if two clauses match at the same index 
term or only at the same position.

background: I am using a component for word decomposition. This might be a very 
rare case for English but it is a much more common phenomen for German and 
Dutch. The two compound parts of "wallpaper" (wall and paper) go into the same 
index position as wallpaper. I am using  spanNear([wall, paper], 0, false) to 
search for wallpaper and expect matches for "wallpaper" as well as for "wall 
paper". 

So far we do not have a proper definition of what SpanQueries should do and the 
only way to find out what they currently do is to look into the code. I think 
the current behavior is not very consistent. I will present some of my  
insights and ideas in LUCENE-7398.

> SpanNearQuery returns single term spans
> ---
>
> Key: LUCENE-5396
> URL: https://issues.apache.org/jira/browse/LUCENE-5396
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Reporter: Piotr Pęzik
>
> Let's assume we have an index with two documents:
> 1. contents: "test bunga bunga test"
> 2. contents: "test bunga test"
> We run two SpanNearQueries against this index:
> 1. spanNear([contents:bunga, contents:bunga], 0, true)
> 2. spanNear([contents:bunga, contents:bunga], 0, false)
> For the first query we get 1 hit. The first document in the example above 
> gets matched and the second one doesn't. This make sense, because we want the 
> term "bunga" followed by another "bunga" here.
> However, both documents get matched by the second query. This is also 
> problematic in cases where we have duplicate terms in longer (unordered) 
> spannear queries, e. g.: unordered 'A B A' will match spans such as 'A B' or 
> 'B A'.
> A complete example follows. 
> -
> import org.apache.lucene.analysis.Analyzer;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.TextField;
> import org.apache.lucene.index.DirectoryReader;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.index.Term;
> import org.apache.lucene.search.IndexSearcher;
> import org.apache.lucene.search.TopDocs;
> import org.apache.lucene.search.spans.SpanNearQuery;
> import org.apache.lucene.search.spans.SpanQuery;
> import org.apache.lucene.search.spans.SpanTermQuery;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.util.Version;
> import java.io.StringReader;
> import static org.junit.Assert.assertEquals;
> class SpansBug {
> public static void main(String [] args) throws Exception {
> Directory dir = new RAMDirectory();
> Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_45);
> IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, 
> analyzer);
> IndexWriter writer = new IndexWriter(dir, iwc);
> String contents = "contents";
> Document doc1 = new Document();
> doc1.add(new TextField(contents, new StringReader("test bunga bunga 
> test")));
> Document doc2 = new Document();
> doc2.add(new TextField(contents, new StringReader("test bunga 
> test")));
> writer.addDocument(doc1);
> writer.addDocument(doc2);
> writer.commit();
> IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(dir));
> SpanQuery stq1 = new SpanTermQuery(new Term(contents,"bunga"));
> SpanQuery stq2 = new SpanTermQuery(new Term(contents,"bunga"));
> SpanQuery [] spqa = new SpanQuery[]{stq1,stq2};
> SpanNearQuery spanQ1 = new SpanNearQuery(spqa,0, true);
> SpanNearQuery spanQ2 = new SpanNearQuery(spqa,0, false);
> System.out.println(spanQ1);
> TopDocs tdocs1 = searcher.search(spanQ1,10);
> assertEquals(tdocs1.totalHits ,1);
> System.out.println(spanQ2);
> TopDocs tdocs2 = searcher.search(spanQ2,10);
> //I'd expect one hit here:
> assertEquals(tdocs2.totalHits ,1); // Assertion fails
> }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7398) Nested Span Queries are buggy

2016-08-03 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406059#comment-15406059
 ] 

Christoph Goller commented on LUCENE-7398:
--

The whole idea of the patch is to change the order of the matches returned by 
SpanOrQuery.

{code}
SpanTermQuery q2 = new SpanTermQuery(new Term(FIELD, "w2"));
SpanTermQuery q3 = new SpanTermQuery(new Term(FIELD, "w3"));
SpanNearQuery q23 = new SpanNearQuery(new SpanQuery[]{q2, q3}, 0, true);
SpanOrQuery q223 = new SpanOrQuery(q2, q23);
{code}

For a document containing "w1 w2 w3 w4" query q223 now returns as first match 
"w2 w3" (the longer one) and then "w2" while formerly it was the other way 
round. Both matches have the same start position, but different end positions 
and the contract about spans says that if start positions equal we first get 
the match with the lower end position (Javadoc of spans).

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Assignee: Alan Woodward
>Priority: Critical
> Attachments: LUCENE-7398.patch, LUCENE-7398.patch, 
> TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-08-03 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903
 ] 

Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:21 PM:
--

After thoroughly looking into SpanQueries my conclusion is, that we have a 
fundamental problem in the implementation of SpanNearQuery. The problem is not 
new, it probably existed already in the first version of SpanQueries which as 
far as I know were implemented by Doug Cutting himself. I remember some 
attempts to describe in which cases SpanQueries work correctly and in which 
they do not (discussions about overlapping), but those explanations and 
definitions were never completely convincing for me. 

My best guess: NearSpansOrdered and NearSpansUnordered currently are only 
correct if for each clause of the SpanQuery we can guarantee, that all its 
matches have the same length. In this case it is clear that (for the ordered 
case) if a match is too long (sloppy) we can skip to the first clause and call 
nextPosition. No alternative matches of intermediate clauses could improve the 
overall match. If we have clauses with varying match length (SpanOr or SpanNear 
with sloppyness) we would have to backtrack to intermediate clauses and check 
whether there are e.g. longer matches that could reduce the overall match 
length. Pauls last test case shows that even a match of the second clause that 
advances its position can reduce the overall lenght if it is longer himself. A 
match of an intermediate clause at an advanced position could be considerably 
shorter than its first match requiring a reset of the spans of following 
clauses. To my opinion this bug can only be fixed by implementing a 
backtracking search on the subspans that also requires a limited possibilitxy 
to reposition Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 
provided a kind of backtracking which was the reason why my queries worked in 
elasticsearch 1.7.x. However, I think the implementation also did not solve all 
cases:

{code}
  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
matchStart = subSpans[subSpans.length - 1].start();
matchEnd = subSpans[subSpans.length - 1].end();
Set possibleMatchPayloads = new HashSet<>();
if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
  possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
}

Collection possiblePayload = null;

int matchSlop = 0;
int lastStart = matchStart;
int lastEnd = matchEnd;
for (int i = subSpans.length - 2; i >= 0; i--) {
  Spans prevSpans = subSpans[i];
  if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
  }
  
  int prevStart = prevSpans.start();
  int prevEnd = prevSpans.end();
  while (true) { // Advance prevSpans until after (lastStart, lastEnd)
if (! prevSpans.next()) {
  inSameDoc = false;
  more = false;
  break; // Check remaining subSpans for final match.
} else if (matchDoc != prevSpans.doc()) {
  inSameDoc = false; // The last subSpans is not advanced here.
  break; // Check remaining subSpans for last match in this document.
} else {
  int ppStart = prevSpans.start();
  int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
  if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
break; // Check remaining subSpans.
  } else { // prevSpans still before (lastStart, lastEnd)
prevStart = ppStart;
prevEnd = ppEnd;
if (collectPayloads && prevSpans.isPayloadAvailable()) {
  Collection payload = prevSpans.getPayload();
  possiblePayload = new ArrayList<>(payload.size());
  possiblePayload.addAll(payload);
}
  }
}
  }

  if (collectPayloads && possiblePayload != null) {
possibleMatchPayloads.addAll(possiblePayload);
  }
  
  assert prevStart <= matchStart;
  if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
matchSlop += (matchStart - prevEnd);
  }

  /* Do not break on (matchSlop > allowedSlop) here to make sure
   * that subSpans[0] is advanced after the match, if any.
   */
  matchStart = prevStart;
  lastStart = prevStart;
  lastEnd = prevEnd;
}

boolean match = matchSlop 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-08-03 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903
 ] 

Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:20 PM:
--

After thoroughly looking into SpanQueries my conclusion is, that we have a 
fundamental problem in the implementation of SpanNearQuery. The problem is not 
new, it probably existed already in the first version of SpanQueries which as 
far as I know were implemented by Doug Cutting himself. I remember some 
attempts to describe in which cases SpanQueries work correctly and in which 
they do not (discussions about overlapping), but those explanations and 
definitions were never completely convincing for me. 

My best guess: NearSpansOrdered and NearSpansUnordered currently are only 
correct if for each clause of the SpanQuery we can guarantee, that all its 
matches have the same length. In this case it is clear that (for the ordered 
case) if a match is too long (sloppy) we can skip to the first clause and call 
nextPosition. No alternative matches of intermediate clauses could improve the 
overall match. If we have clauses with varying match length (SpanOr or SpanNear 
with sloppyness) we would have to backtrack to intermediate clauses and check 
whether there are e.g. longer matches that could reduce the overall match 
length. Pauls last test case shows that even a match of the second clause that 
advances its position can reduce the overall lenght if it is longer himself. A 
match of an intermediate clause at an advanced position could be considerably 
shorter than its first match requiring a reset of the spans of following 
clauses. To my opinion this bug can only be fixed by implementing a 
backtracking search on the subspans that also requires a limited possibilitxy 
to reposition Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 
provided a kind of backtracking which was the reason why my queries worked in 
elasticsearch 1.7.x. However, I think the implementation also did not solve all 
cases:

{code}
  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
matchStart = subSpans[subSpans.length - 1].start();
matchEnd = subSpans[subSpans.length - 1].end();
Set possibleMatchPayloads = new HashSet<>();
if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
  possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
}

Collection possiblePayload = null;

int matchSlop = 0;
int lastStart = matchStart;
int lastEnd = matchEnd;
for (int i = subSpans.length - 2; i >= 0; i--) {
  Spans prevSpans = subSpans[i];
  if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
  }
  
  int prevStart = prevSpans.start();
  int prevEnd = prevSpans.end();
  while (true) { // Advance prevSpans until after (lastStart, lastEnd)
if (! prevSpans.next()) {
  inSameDoc = false;
  more = false;
  break; // Check remaining subSpans for final match.
} else if (matchDoc != prevSpans.doc()) {
  inSameDoc = false; // The last subSpans is not advanced here.
  break; // Check remaining subSpans for last match in this document.
} else {
  int ppStart = prevSpans.start();
  int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
  if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
break; // Check remaining subSpans.
  } else { // prevSpans still before (lastStart, lastEnd)
prevStart = ppStart;
prevEnd = ppEnd;
if (collectPayloads && prevSpans.isPayloadAvailable()) {
  Collection payload = prevSpans.getPayload();
  possiblePayload = new ArrayList<>(payload.size());
  possiblePayload.addAll(payload);
}
  }
}
  }

  if (collectPayloads && possiblePayload != null) {
possibleMatchPayloads.addAll(possiblePayload);
  }
  
  assert prevStart <= matchStart;
  if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
matchSlop += (matchStart - prevEnd);
  }

  /* Do not break on (matchSlop > allowedSlop) here to make sure
   * that subSpans[0] is advanced after the match, if any.
   */
  matchStart = prevStart;
  lastStart = prevStart;
  lastEnd = prevEnd;
}

boolean match = matchSlop 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-08-03 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903
 ] 

Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:19 PM:
--

After thoroughly looking into SpanQueries my conclusion is, that we have a 
fundamental problem in the implementation of SpanNearQuery. The problem is not 
new, it probably existed already in the first version of SpanQueries which as 
far as I know were implemented by Doug Cutting himself. I remember some 
attempts to describe in which cases SpanQueries work correctly and in which 
they do not (discussions about overlapping), but those explanations and 
definitions were never completely convincing for me. 

My best guess: NearSpansOrdered and NearSpansUnordered currently are only 
correct if for each clause of the SpanQuery we can guarantee, that all its 
matches have the same length. In this case it is clear that (for the ordered 
case) if a match is too long (sloppy) we can skip to the first clause and call 
nextPosition. No alternative matches of intermediate clauses could improve the 
overall match. If we have clauses with varying match length (SpanOr or SpanNear 
with sloppyness) we would have to backtrack to intermediate clauses and check 
whether there are e.g. longer matches that could reduce the overall match 
length. Pauls last test case shows that even a match of the second clause that 
advances its position can reduce the overall lenght if it is longer himself. A 
match of an intermediate clause at an advanced position could be considerably 
shorter than its first match requiring a reset of the spans of following 
clauses. To my opinion this bug can only be fixed by implementing a 
backtracking search on the subspans that also requires a limited possibilitxy 
to reposion Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 
provided a kind of backtracking which was the reason why my queries worked in 
elasticsearch 1.7.x. However, I think the implementation also did not solve all 
cases:

{code}
  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
matchStart = subSpans[subSpans.length - 1].start();
matchEnd = subSpans[subSpans.length - 1].end();
Set possibleMatchPayloads = new HashSet<>();
if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
  possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
}

Collection possiblePayload = null;

int matchSlop = 0;
int lastStart = matchStart;
int lastEnd = matchEnd;
for (int i = subSpans.length - 2; i >= 0; i--) {
  Spans prevSpans = subSpans[i];
  if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
  }
  
  int prevStart = prevSpans.start();
  int prevEnd = prevSpans.end();
  while (true) { // Advance prevSpans until after (lastStart, lastEnd)
if (! prevSpans.next()) {
  inSameDoc = false;
  more = false;
  break; // Check remaining subSpans for final match.
} else if (matchDoc != prevSpans.doc()) {
  inSameDoc = false; // The last subSpans is not advanced here.
  break; // Check remaining subSpans for last match in this document.
} else {
  int ppStart = prevSpans.start();
  int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
  if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
break; // Check remaining subSpans.
  } else { // prevSpans still before (lastStart, lastEnd)
prevStart = ppStart;
prevEnd = ppEnd;
if (collectPayloads && prevSpans.isPayloadAvailable()) {
  Collection payload = prevSpans.getPayload();
  possiblePayload = new ArrayList<>(payload.size());
  possiblePayload.addAll(payload);
}
  }
}
  }

  if (collectPayloads && possiblePayload != null) {
possibleMatchPayloads.addAll(possiblePayload);
  }
  
  assert prevStart <= matchStart;
  if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
matchSlop += (matchStart - prevEnd);
  }

  /* Do not break on (matchSlop > allowedSlop) here to make sure
   * that subSpans[0] is advanced after the match, if any.
   */
  matchStart = prevStart;
  lastStart = prevStart;
  lastEnd = prevEnd;
}

boolean match = matchSlop 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-08-03 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903
 ] 

Christoph Goller edited comment on LUCENE-7398 at 8/3/16 1:18 PM:
--

After thoroughly looking into SpanQueries my conclusion is, that we have a 
fundamental problem in the implementation of SpanNearQuery. The problem is not 
new, it probably existed already in the first version of SpanQueries which as 
far as I know were implemented by Doug Cutting himself. I remember some 
attempts to describe in which cases SpanQueries work correctly and in which 
they do not (discussions about overlapping), but those explanations and 
definitions were never completely convincing for me. 

My best guess: NearSpansOrdered and NearSpansUnordered currently are only 
correct if for each clause of the SpanQuery we can guarantee, that all its 
matches have the same length. In this case it is clear that (for the ordered 
case) if a match is too long (sloppy) we can skip to the first clause and call 
nextPosition. No alternative matches of intermediate clauses could improve the 
overall match. If we have clauses with varying match length (SpanOr or SpanNear 
with sloppyness) we would have to backtrack to intermediate clauses and check 
whether there are e.g. longer matches that could reduce the overall match 
length. Pauls last test case shows that even a match of the second clause that 
advances its position can reduce the overall lenght if it is longer himnself. A 
match of an intermediate clause at an advanced position could be considerably 
shorter than its first match requiring a reset of the spans of following 
clauses. To my opinion this bug can only be fixed by implementing a 
backtracking search on the subspans that also requires a limited possibilitxy 
to reposion Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 
provided a kind of backtracking which was the reason why my queries worked in 
elasticsearch 1.7.x. However, I think the implementation also did not solve all 
cases:

{code}
  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
matchStart = subSpans[subSpans.length - 1].start();
matchEnd = subSpans[subSpans.length - 1].end();
Set possibleMatchPayloads = new HashSet<>();
if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
  possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
}

Collection possiblePayload = null;

int matchSlop = 0;
int lastStart = matchStart;
int lastEnd = matchEnd;
for (int i = subSpans.length - 2; i >= 0; i--) {
  Spans prevSpans = subSpans[i];
  if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
  }
  
  int prevStart = prevSpans.start();
  int prevEnd = prevSpans.end();
  while (true) { // Advance prevSpans until after (lastStart, lastEnd)
if (! prevSpans.next()) {
  inSameDoc = false;
  more = false;
  break; // Check remaining subSpans for final match.
} else if (matchDoc != prevSpans.doc()) {
  inSameDoc = false; // The last subSpans is not advanced here.
  break; // Check remaining subSpans for last match in this document.
} else {
  int ppStart = prevSpans.start();
  int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
  if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
break; // Check remaining subSpans.
  } else { // prevSpans still before (lastStart, lastEnd)
prevStart = ppStart;
prevEnd = ppEnd;
if (collectPayloads && prevSpans.isPayloadAvailable()) {
  Collection payload = prevSpans.getPayload();
  possiblePayload = new ArrayList<>(payload.size());
  possiblePayload.addAll(payload);
}
  }
}
  }

  if (collectPayloads && possiblePayload != null) {
possibleMatchPayloads.addAll(possiblePayload);
  }
  
  assert prevStart <= matchStart;
  if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
matchSlop += (matchStart - prevEnd);
  }

  /* Do not break on (matchSlop > allowedSlop) here to make sure
   * that subSpans[0] is advanced after the match, if any.
   */
  matchStart = prevStart;
  lastStart = prevStart;
  lastEnd = prevEnd;
}

boolean match = matchSlop 

[jira] [Commented] (LUCENE-7398) Nested Span Queries are buggy

2016-08-03 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405903#comment-15405903
 ] 

Christoph Goller commented on LUCENE-7398:
--

After thoroughly looking into SpanQueries my conclusion is, that we have a 
fundamental problem in the implementation of SpanNearQuery. The problem is not 
new, it probably existed already in the first version of SpanQueries which as 
far as I know were implemented by Doug Cutting himself. I remember some 
attempts to describe in which cases SpanQueries work correctly and in which 
they do not (discussions about overlapping), but those explanations and 
definitions were never completely convincing for me. 

My best guess: NearSpansOrdered and NearSpansUnordered currently are only 
correct if for each clause of the SpanQuery we can guarantee, that all its 
matches have the same length. In this case it is clear that (for the ordered 
case) if a match is too long (sloppy) we can skip to the first clause and call 
nextPosition. No alternative matches of intermediate clauses could improve the 
overall match. It we have clauses with varying match length (SpanOr or SpanNear 
with sloppyness) we would have to backtrack to intermediate clauses and check 
whether there are e.g. longer matches that could reduce the overall match 
length. Pauls last test case shows that even a match of the second clause that 
advances its position can reduce the overall lenght if it is longer himnself. A 
match of an intermediate clause at an advanced position could be considerably 
shorter than its first match requiring a reset of the spans of following 
clauses. To my opinion this bug can only be fixed by implementing a 
backtracking search on the subspans that also requires a limited possibilitxy 
to reposion Spans to previous positions.

By the way, shrinkToAfterShortestMatch() in NearSpansOrdered of Lucene 4_10_4 
provided a kind of backtracking which was the reason why my queries worked in 
elasticsearch 1.7.x. However, I think the implementation also did not solve all 
cases:

{code}
  /** The subSpans are ordered in the same doc, so there is a possible match.
   * Compute the slop while making the match as short as possible by advancing
   * all subSpans except the last one in reverse order.
   */
  private boolean shrinkToAfterShortestMatch() throws IOException {
matchStart = subSpans[subSpans.length - 1].start();
matchEnd = subSpans[subSpans.length - 1].end();
Set possibleMatchPayloads = new HashSet<>();
if (subSpans[subSpans.length - 1].isPayloadAvailable()) {
  possibleMatchPayloads.addAll(subSpans[subSpans.length - 1].getPayload());
}

Collection possiblePayload = null;

int matchSlop = 0;
int lastStart = matchStart;
int lastEnd = matchEnd;
for (int i = subSpans.length - 2; i >= 0; i--) {
  Spans prevSpans = subSpans[i];
  if (collectPayloads && prevSpans.isPayloadAvailable()) {
Collection payload = prevSpans.getPayload();
possiblePayload = new ArrayList<>(payload.size());
possiblePayload.addAll(payload);
  }
  
  int prevStart = prevSpans.start();
  int prevEnd = prevSpans.end();
  while (true) { // Advance prevSpans until after (lastStart, lastEnd)
if (! prevSpans.next()) {
  inSameDoc = false;
  more = false;
  break; // Check remaining subSpans for final match.
} else if (matchDoc != prevSpans.doc()) {
  inSameDoc = false; // The last subSpans is not advanced here.
  break; // Check remaining subSpans for last match in this document.
} else {
  int ppStart = prevSpans.start();
  int ppEnd = prevSpans.end(); // Cannot avoid invoking .end()
  if (! docSpansOrderedNonOverlap(ppStart, ppEnd, lastStart, lastEnd)) {
break; // Check remaining subSpans.
  } else { // prevSpans still before (lastStart, lastEnd)
prevStart = ppStart;
prevEnd = ppEnd;
if (collectPayloads && prevSpans.isPayloadAvailable()) {
  Collection payload = prevSpans.getPayload();
  possiblePayload = new ArrayList<>(payload.size());
  possiblePayload.addAll(payload);
}
  }
}
  }

  if (collectPayloads && possiblePayload != null) {
possibleMatchPayloads.addAll(possiblePayload);
  }
  
  assert prevStart <= matchStart;
  if (matchStart > prevEnd) { // Only non overlapping spans add to slop.
matchSlop += (matchStart - prevEnd);
  }

  /* Do not break on (matchSlop > allowedSlop) here to make sure
   * that subSpans[0] is advanced after the match, if any.
   */
  matchStart = prevStart;
  lastStart = prevStart;
  lastEnd = prevEnd;
}

boolean match = matchSlop <= allowedSlop;

if(collectPayloads && 

[jira] [Comment Edited] (LUCENE-7398) Nested Span Queries are buggy

2016-07-29 Thread Christoph Goller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399414#comment-15399414
 ] 

Christoph Goller edited comment on LUCENE-7398 at 7/29/16 2:40 PM:
---

Please find attatched an extended TestSpanCollection.java for Lucene 6.1 that 
shows the problem.


was (Author: gol...@detego-software.de):
Please find attatched an extended TestSpanCollection.java that shows the 
problem.

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Priority: Critical
> Attachments: TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7398) Nested Span Queries are buggy

2016-07-29 Thread Christoph Goller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christoph Goller updated LUCENE-7398:
-
Attachment: TestSpanCollection.java

Please find attatched an extended TestSpanCollection.java that shows the 
problem.

> Nested Span Queries are buggy
> -
>
> Key: LUCENE-7398
> URL: https://issues.apache.org/jira/browse/LUCENE-7398
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/search
>Affects Versions: 5.5, 6.x
>Reporter: Christoph Goller
>Priority: Critical
> Attachments: TestSpanCollection.java
>
>
> Example for a nested SpanQuery that is not working:
> Document: Human Genome Organization , HUGO , is trying to coordinate gene 
> mapping research worldwide.
> Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
> 0, true), body:gene]), body:research], 0, true)
> The query should match "coordinate gene mapping research" as well as 
> "coordinate gene research". It does not match  "coordinate gene mapping 
> research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
> probably stopped working with the changes on SpanQueries in 5.3. I will 
> attach a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-7398) Nested Span Queries are buggy

2016-07-29 Thread Christoph Goller (JIRA)
Christoph Goller created LUCENE-7398:


 Summary: Nested Span Queries are buggy
 Key: LUCENE-7398
 URL: https://issues.apache.org/jira/browse/LUCENE-7398
 Project: Lucene - Core
  Issue Type: Bug
  Components: core/search
Affects Versions: 5.5, 6.x
Reporter: Christoph Goller
Priority: Critical


Example for a nested SpanQuery that is not working:

Document: Human Genome Organization , HUGO , is trying to coordinate gene 
mapping research worldwide.

Query: spanNear([body:coordinate, spanOr([spanNear([body:gene, body:mapping], 
0, true), body:gene]), body:research], 0, true)

The query should match "coordinate gene mapping research" as well as 
"coordinate gene research". It does not match  "coordinate gene mapping 
research" with Lucene 5.5 or 6.1, it did however match with Lucene 4.10.4. It 
probably stopped working with the changes on SpanQueries in 5.3. I will attach 
a unit test that shows the problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2783) Deadlock in IndexWriter

2010-11-29 Thread Christoph Goller (JIRA)
Deadlock in IndexWriter
---

 Key: LUCENE-2783
 URL: https://issues.apache.org/jira/browse/LUCENE-2783
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.3
 Environment: ALL
Reporter: Christoph Goller
 Fix For: 2.9.4


If autoCommit == true a merge usually triggers a commit. A commit 
(prepareCommit) can trigger a merge vi the flush method. There is a 
synchronization mechanism for commit (commitLock) and a separate 
synchronization mechanism for merging (ConcurrentMergeScheduler.wait). If one 
thread holds the commitLock monitor and another one holds the 
ConcurrentMergeScheduler monitor we have a deadlock.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-2783) Deadlock in IndexWriter

2010-11-29 Thread Christoph Goller (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christoph Goller closed LUCENE-2783.


Resolution: Fixed

Already fixed with introduction of mergeDone flag in OneMerge of Lucene 
upcoming 2.9.4

 Deadlock in IndexWriter
 ---

 Key: LUCENE-2783
 URL: https://issues.apache.org/jira/browse/LUCENE-2783
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.9.3
 Environment: ALL
Reporter: Christoph Goller
 Fix For: 2.9.4

   Original Estimate: 2h
  Remaining Estimate: 2h

 If autoCommit == true a merge usually triggers a commit. A commit 
 (prepareCommit) can trigger a merge vi the flush method. There is a 
 synchronization mechanism for commit (commitLock) and a separate 
 synchronization mechanism for merging (ConcurrentMergeScheduler.wait). If one 
 thread holds the commitLock monitor and another one holds the 
 ConcurrentMergeScheduler monitor we have a deadlock.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org