The problem is that the query combines the native termquery score (which depends on length of document and term's statistic). The edit distance is also multiplied in. When the difference in term statistics is too large, the edit distance no longer matters. This is perfectly fine and also happens with other types of queries. When you have seldom terms in small documents, those matches will always come up. This is also a problem if you for example boost cheaper products to the top.

If you are only interested in the query distance, you should configure IndexSearcher to use BooleanSimilarity - in that case it will ignore the term statistics and disable norms on the field (during indexing or with a wrapper on the IndexReader): https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/similarities/BooleanSimilarity.html

You can do this only for a specific field: https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/search/similarities/PerFieldSimilarityWrapper.html

Uwe

Am 09.07.2022 um 14:08 schrieb Michael Sokolov:
I am no expert with this, but I got curious and looked at
FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact
matches, or even to incorporate the edit distance more generally into
the per-term score, although it does seem like that would be something
people would generally expect. So maybe FuzzyQuery should somehow do
that? But without changing it, you could also use a query that does it
explicitly; if you get a term "foo", you could maybe search for "foo
OR foo~" ?

On Fri, Jul 8, 2022 at 4:14 PM Mike Drob <md...@mdrob.com> wrote:
Hi folks,

I'm working with some fuzzy queries and trying my best to understand what
is the expected behaviour of the searcher. I'm not sure if this is a
similarity bug or an incorrect usage on my end.

The problem is when I do a fuzzy search for a term "spark~" then instead of
matching documents with spark first, it will match other documents that
have multiple other near terms like "spar" and "spars". I see this same
thing with both ClassicSimilarity and BM25.

This is from a much smaller (two document) index when I was trying to
isolate and reproduce the issue, but I see comparable behaviour with more
varied scoring on a much larger corpus. The two documents are:

addDoc("spark spark", writer); // exact match

addDoc("spar spars", writer); // multiple fuzzy terms

The non-zero edit distance terms get a slight down-boost, but it's not
enough to overcome their sum exceeding even the TF boost for the desired
document.

A full reproducible unit test is at
https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546

What is the recommended approach to get the document with exact term
matching for me again? I don't see an option to tweak the internal boost
provided by FuzzyQuery, that's one idea I had. Or is this a different
change that needs to be fixed at the lucene level rather than application
level?

Thanks,
Mike



More detail:


The first document with the field "spark spark" has a score explanation:

1.4054651 = sum of:
   1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
     1.4054651 = score(freq=2.0), product of:
       1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
         1 = docFreq, number of documents containing term
         2 = docCount, total number of documents with field
       1.4142135 = tf(freq=2.0), with freq of:
         2.0 = freq, occurrences of term within document
       0.70710677 = fieldNorm

And a document with the field "spar spars" comes in ever so slightly higher
at

1.5404116 = sum of:
   0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
     0.74536043 = score(freq=1.0), product of:
       0.75 = boost
       1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
         1 = docFreq, number of documents containing term
         2 = docCount, total number of documents with field
       1.0 = tf(freq=1.0), with freq of:
         1.0 = freq, occurrences of term within document
       0.70710677 = fieldNorm
   0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
     0.79505116 = score(freq=1.0), product of:
       0.8 = boost
       1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
         1 = docFreq, number of documents containing term
         2 = docCount, total number of documents with field
       1.0 = tf(freq=1.0), with freq of:
         1.0 = freq, occurrences of term within document
       0.70710677 = fieldNorm
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to