Hi folks, I'm working with some fuzzy queries and trying my best to understand what is the expected behaviour of the searcher. I'm not sure if this is a similarity bug or an incorrect usage on my end.
The problem is when I do a fuzzy search for a term "spark~" then instead of matching documents with spark first, it will match other documents that have multiple other near terms like "spar" and "spars". I see this same thing with both ClassicSimilarity and BM25. This is from a much smaller (two document) index when I was trying to isolate and reproduce the issue, but I see comparable behaviour with more varied scoring on a much larger corpus. The two documents are: addDoc("spark spark", writer); // exact match addDoc("spar spars", writer); // multiple fuzzy terms The non-zero edit distance terms get a slight down-boost, but it's not enough to overcome their sum exceeding even the TF boost for the desired document. A full reproducible unit test is at https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546 What is the recommended approach to get the document with exact term matching for me again? I don't see an option to tweak the internal boost provided by FuzzyQuery, that's one idea I had. Or is this a different change that needs to be fixed at the lucene level rather than application level? Thanks, Mike More detail: The first document with the field "spark spark" has a score explanation: 1.4054651 = sum of: 1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of: 1.4054651 = score(freq=2.0), product of: 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from: 1 = docFreq, number of documents containing term 2 = docCount, total number of documents with field 1.4142135 = tf(freq=2.0), with freq of: 2.0 = freq, occurrences of term within document 0.70710677 = fieldNorm And a document with the field "spar spars" comes in ever so slightly higher at 1.5404116 = sum of: 0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of: 0.74536043 = score(freq=1.0), product of: 0.75 = boost 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from: 1 = docFreq, number of documents containing term 2 = docCount, total number of documents with field 1.0 = tf(freq=1.0), with freq of: 1.0 = freq, occurrences of term within document 0.70710677 = fieldNorm 0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of: 0.79505116 = score(freq=1.0), product of: 0.8 = boost 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from: 1 = docFreq, number of documents containing term 2 = docCount, total number of documents with field 1.0 = tf(freq=1.0), with freq of: 1.0 = freq, occurrences of term within document 0.70710677 = fieldNorm