Fuzzy Query Similarity

Mike Drob Fri, 08 Jul 2022 13:14:36 -0700

Hi folks,

I'm working with some fuzzy queries and trying my best to understand what
is the expected behaviour of the searcher. I'm not sure if this is a
similarity bug or an incorrect usage on my end.


The problem is when I do a fuzzy search for a term "spark~" then instead of
matching documents with spark first, it will match other documents that
have multiple other near terms like "spar" and "spars". I see this same
thing with both ClassicSimilarity and BM25.

This is from a much smaller (two document) index when I was trying to
isolate and reproduce the issue, but I see comparable behaviour with more
varied scoring on a much larger corpus. The two documents are:

addDoc("spark spark", writer); // exact match

addDoc("spar spars", writer); // multiple fuzzy terms

The non-zero edit distance terms get a slight down-boost, but it's not
enough to overcome their sum exceeding even the TF boost for the desired
document.

A full reproducible unit test is at
https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546

What is the recommended approach to get the document with exact term
matching for me again? I don't see an option to tweak the internal boost
provided by FuzzyQuery, that's one idea I had. Or is this a different
change that needs to be fixed at the lucene level rather than application
level?

Thanks,
Mike



More detail:


The first document with the field "spark spark" has a score explanation:

1.4054651 = sum of:
  1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of:
    1.4054651 = score(freq=2.0), product of:
      1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        1 = docFreq, number of documents containing term
        2 = docCount, total number of documents with field
      1.4142135 = tf(freq=2.0), with freq of:
        2.0 = freq, occurrences of term within document
      0.70710677 = fieldNorm

And a document with the field "spar spars" comes in ever so slightly higher
at

1.5404116 = sum of:
  0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of:
    0.74536043 = score(freq=1.0), product of:
      0.75 = boost
      1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        1 = docFreq, number of documents containing term
        2 = docCount, total number of documents with field
      1.0 = tf(freq=1.0), with freq of:
        1.0 = freq, occurrences of term within document
      0.70710677 = fieldNorm
  0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of:
    0.79505116 = score(freq=1.0), product of:
      0.8 = boost
      1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:
        1 = docFreq, number of documents containing term
        2 = docCount, total number of documents with field
      1.0 = tf(freq=1.0), with freq of:
        1.0 = freq, occurrences of term within document
      0.70710677 = fieldNorm

Fuzzy Query Similarity

Reply via email to