I am no expert with this, but I got curious and looked at FuzzyQuery/MultiTermQuery and I don't see any way to "boost" exact matches, or even to incorporate the edit distance more generally into the per-term score, although it does seem like that would be something people would generally expect. So maybe FuzzyQuery should somehow do that? But without changing it, you could also use a query that does it explicitly; if you get a term "foo", you could maybe search for "foo OR foo~" ?
On Fri, Jul 8, 2022 at 4:14 PM Mike Drob <md...@mdrob.com> wrote: > > Hi folks, > > I'm working with some fuzzy queries and trying my best to understand what > is the expected behaviour of the searcher. I'm not sure if this is a > similarity bug or an incorrect usage on my end. > > The problem is when I do a fuzzy search for a term "spark~" then instead of > matching documents with spark first, it will match other documents that > have multiple other near terms like "spar" and "spars". I see this same > thing with both ClassicSimilarity and BM25. > > This is from a much smaller (two document) index when I was trying to > isolate and reproduce the issue, but I see comparable behaviour with more > varied scoring on a much larger corpus. The two documents are: > > addDoc("spark spark", writer); // exact match > > addDoc("spar spars", writer); // multiple fuzzy terms > > The non-zero edit distance terms get a slight down-boost, but it's not > enough to overcome their sum exceeding even the TF boost for the desired > document. > > A full reproducible unit test is at > https://github.com/apache/lucene/commit/dbf8e788cd2c2a5e1852b8cee86cb21a792dc546 > > What is the recommended approach to get the document with exact term > matching for me again? I don't see an option to tweak the internal boost > provided by FuzzyQuery, that's one idea I had. Or is this a different > change that needs to be fixed at the lucene level rather than application > level? > > Thanks, > Mike > > > > More detail: > > > The first document with the field "spark spark" has a score explanation: > > 1.4054651 = sum of: > 1.4054651 = weight(field:spark in 0) [ClassicSimilarity], result of: > 1.4054651 = score(freq=2.0), product of: > 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from: > 1 = docFreq, number of documents containing term > 2 = docCount, total number of documents with field > 1.4142135 = tf(freq=2.0), with freq of: > 2.0 = freq, occurrences of term within document > 0.70710677 = fieldNorm > > And a document with the field "spar spars" comes in ever so slightly higher > at > > 1.5404116 = sum of: > 0.74536043 = weight(field:spar in 1) [ClassicSimilarity], result of: > 0.74536043 = score(freq=1.0), product of: > 0.75 = boost > 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from: > 1 = docFreq, number of documents containing term > 2 = docCount, total number of documents with field > 1.0 = tf(freq=1.0), with freq of: > 1.0 = freq, occurrences of term within document > 0.70710677 = fieldNorm > 0.79505116 = weight(field:spars in 1) [ClassicSimilarity], result of: > 0.79505116 = score(freq=1.0), product of: > 0.8 = boost > 1.4054651 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from: > 1 = docFreq, number of documents containing term > 2 = docCount, total number of documents with field > 1.0 = tf(freq=1.0), with freq of: > 1.0 = freq, occurrences of term within document > 0.70710677 = fieldNorm --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org