Robert Muir created LUCENE-6603:
-----------------------------------

             Summary: consider restrictions on what Similarity.coord() can 
return
                 Key: LUCENE-6603
                 URL: https://issues.apache.org/jira/browse/LUCENE-6603
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Robert Muir


Today, Similarity.coord() can really return anything, though all of our 
similarities are well-behaved, issues are all about custom ones:

{code}
   * @param overlap the number of query terms matched in the document
   * @param maxOverlap the total number of terms in the query
   * @return a score factor based on term overlap with the query
   */
  public float coord(int overlap, int maxOverlap)
{code}

But problems arise when a custom similarity implements {{coord(1, 1)}} to 
return something crazy (say 2). In this case their coord impl is ignored (see 
LUCENE-4300). {{coord(1,1)}} is always treated as 1, or things make no sense, 
for example A NOT B would score differently from A (which would be a simple 
termquery and have no BQ around it). Same goes with filters, which should not 
change scoring but would, if we didn't mandate this.

Now we see the same problem again, with LUCENE-6585. For this optimization to 
work in the current world, it will have to check if {{coord(N,N)}} == 1 in 
order to do it safely. Otherwise it cannot safely collapse conjunctions for 
such custom similarities.

I would like to enforce that {{coord(N,N)}} is always treated as 1, to prevent 
all these crazy codepaths for wierd, not-so-well-tested cases. So we could 
change the current hack in BooleanWeight.coord():
{code}
--   } else if (maxOverlap == 1) {
++   } else if (overlap == maxOverlap) {
   return 1f;
{code}

Alternatively though, we could change javadocs of Similarity.coord() and add a 
check to BooleanWeight to throw an exception.

In either case, doing this would be a break, because it would break some custom 
sims out there, at least this one 
(https://mail-archives.apache.org/mod_mbox/lucene-java-user/201208.mbox/%[email protected]%3E).
 That one returns {{1/overlap}} which is kinda like averaging the scores, and 
caused us to look into this and open bugs like LUCENE-4297 and LUCENE-4300 and 
probably others. But perhaps coord() is not the way such things should be done 
and there is another way instead, to allow BQ to be simpler and more efficient.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to