Re: FuzzyLikeThis query and exact matches

Mark Harwood Thu, 27 Aug 2009 07:25:43 -0700

I think those boosts shown are reflecting the edit distance. What we can't see 
from this is that the Similarity class used in execution is using the same IDF 
for all terms. The other factors at play will be the term frequency in the doc, 
its length and any doc boost.
I don't have access to the code right now but that is how I remember it 
working. There may be an option to turn term frequency off too.




On 27 Aug 2009, at 14:25, Berkes Adam <adam.ber...@intland.com> wrote:

After searching for term "desy" which has lot of variants in our index a 
rewritten (sub)query will look like this:

(text:dey^0.22828968 text:des^0.22828968 text:dest^1.1557184 
text:desk^1.1557184 text:desi^1.1557184 text:desf^1.1557184 text:desc^1.1557184 
text:deny^1.1557184 text:defy^1.1557184 text:desy^8.218443)

but what I would like to achive to have all exact matches (even if rankings 
"validly" send it to the end of matches) on top (or highest possible) while let 
variants to follow them according to their relevancy.

Maybe I understand wrongly but the edit distance is not a factor in that query 
type: index is search for terms with edit distance within a certain limit, 
eliminate IDF (with the factors above) and then create a coordinationless 
boolean query. I might play around (post modify) scoring for exact match 
subterm but I'm not sure that is a working solution.

Best regards,
Adam
Despite making IDF a constant the edit distance should remain a factor in the 
rankings so I would have thought this would give you what you need.

Can you supply a more detailed example? Either print the rewritten query or use 
the explain function

Cheers
Mark

On 27 Aug 2009, at 13:22, Berkes Adam wrote:

Hi,

In our java project we uses a (slightly modifed) version of FuzzyLikeThis query 
which

"For each source term the fuzzy variants are held in a BooleanQuery with no 
coord factor (because
we are not looking for matches on multiple variants in any one doc). 
Additionally, a specialized
TermQuery is used for variants and does not use that variant term's IDF because 
this would favour rarer
terms eg misspellings. Instead, all variants use the same IDF ranking (the one 
for the source query
term) and this is factored into the variant's boost. If the source query term 
does not exist in the
index the average IDF of the variants is used."

In most cases it performs well but if there is short query term with (as usual) 
big number of variants the exact matches will be stay spreaded among the others 
which is not so useful: it should be "sorted" like (or forcibly set more 
relevant) exact matches and variant matches according to relevancy.
Is there any simple solution or already implemented contrib query class for 
this problem?

Best regards,
Adam Berkes,
Intland Software

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org






---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: FuzzyLikeThis query and exact matches

Reply via email to