Robert Muir created LUCENE-6603:
-----------------------------------
Summary: consider restrictions on what Similarity.coord() can
return
Key: LUCENE-6603
URL: https://issues.apache.org/jira/browse/LUCENE-6603
Project: Lucene - Core
Issue Type: Bug
Reporter: Robert Muir
Today, Similarity.coord() can really return anything, though all of our
similarities are well-behaved, issues are all about custom ones:
{code}
* @param overlap the number of query terms matched in the document
* @param maxOverlap the total number of terms in the query
* @return a score factor based on term overlap with the query
*/
public float coord(int overlap, int maxOverlap)
{code}
But problems arise when a custom similarity implements {{coord(1, 1)}} to
return something crazy (say 2). In this case their coord impl is ignored (see
LUCENE-4300). {{coord(1,1)}} is always treated as 1, or things make no sense,
for example A NOT B would score differently from A (which would be a simple
termquery and have no BQ around it). Same goes with filters, which should not
change scoring but would, if we didn't mandate this.
Now we see the same problem again, with LUCENE-6585. For this optimization to
work in the current world, it will have to check if {{coord(N,N)}} == 1 in
order to do it safely. Otherwise it cannot safely collapse conjunctions for
such custom similarities.
I would like to enforce that {{coord(N,N)}} is always treated as 1, to prevent
all these crazy codepaths for wierd, not-so-well-tested cases. So we could
change the current hack in BooleanWeight.coord():
{code}
-- } else if (maxOverlap == 1) {
++ } else if (overlap == maxOverlap) {
return 1f;
{code}
Alternatively though, we could change javadocs of Similarity.coord() and add a
check to BooleanWeight to throw an exception.
In either case, doing this would be a break, because it would break some custom
sims out there, at least this one
(https://mail-archives.apache.org/mod_mbox/lucene-java-user/201208.mbox/%[email protected]%3E).
That one returns {{1/overlap}} which is kinda like averaging the scores, and
caused us to look into this and open bugs like LUCENE-4297 and LUCENE-4300 and
probably others. But perhaps coord() is not the way such things should be done
and there is another way instead, to allow BQ to be simpler and more efficient.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]