Re: fuzziness & score computation

Zachary Tong Thu, 20 Mar 2014 07:03:50 -0700

You are correct in your analysis of the fuzzy scoring.  Fuzzy variants are 
scored (relatively) the same as the exact match, because they are treated 
the same when executed internally.

If you want to score exact matches higher, I would use a boolean 
combination of an exact match and a fuzzy match.  Semi-pseudo-query here:

{
    "query": {
        "bool": {
            "should": [
               {
                    "match" : {
                        "my_field" : {
                            "query" : "car renting london",
                            "operator" : "and"
                        },
                        "boost" : 2
                    }
                },
                {
                    "fuzzy_like_this": {}   
                }
            ]
        }
    }
}

Basically, the match query is set to AND operator (so all terms are 
required) and it is given a boost of 2.  That means that exact matches will 
be boosted preferentially over the fuzzy matches, which will have the 
default boost of 1.

Also I get results with more terms getting the same score, like "cheap car 
> renting London", "offers car renting London". 
>

The reason you are seeing results like this is because you are using the 
fuzzy_like_this query.  It's a combination of more_like_this and fuzzy. 
 The way MLT works is that it takes all the individual terms in your query, 
builds a big boolean and searches the index for the boolean.  Docs just 
need the terms, in no particular order.  The Fuzzy Like This works the 
same, except terms are allowed to fuzzily match.   With MLT and FLT, you're 
bound to find "off-target" results because these queries are sorta like 
shotguns, looking for a wide spread of terms.

*2) fuzzy query*
>
> That doesn't make what I want since it does not analyze the query (I 
> think) and so it will treat the query in an unexpected way for my purposes 
> of "free text" search
>

As an alternative, you can use the Match query and set the "fuzziness" 
parameter.  You'll get fuzzy like the fuzzy query, but analysis from the 
Match query.

As a general comment, trying to deal with misspellings and fuzziness is 
always a game between precision (number of returned results that are 
correct) and recall (number of correct results that are returned).  As you 
increase fuzziness, you increase recall -- more of your correct results are 
in your search hits...but you lose precision...they may be at position 200. 
 You'll always be battling the precision/recall fight.

I would instead search for exact matches, and prompt user to fix 
mispellings with suggesters.  This makes your search and relevancy *vastly* 
simpler, 
and tends to provide a better user experience because they can just click 
the as-you-type suggestion or the "Did you mean?" link.  Win win for 
everyone.

-Zach

On Thursday, March 20, 2014 4:46:49 AM UTC-5, Adrian Luna wrote:
>
> Hi, 
>
> Sorry that I am relatively fresh to elasticsearch so please don't be too 
> harsh.
>
> I feel like I'm not being able to understand the behaviour of any of the 
> fuzzy queries in ES.
>
> *1) match with fuzziness enabled*
>
> {
>   "query": {
>     "fuzzy_like_this_field": {
>       "field_name": {
>         "like_text": "car renting London",
>         "fuzziness": "0.5"
>       }
>     }
>   }
> }
>
> As I see it from my tests, this kind of query will give same score to 
> documents with field_name="car renting London" and "car ranting London" or 
> "car renting Londen" for example. That means, it will not give any 
> negatively score misspellings. I can imagine that first the possible 
> variants are computed and then the score is just computed with a 
> "representative score" which is the same for every variant that match the 
> requirements. 
>
> Am I right? If I am, is it any way to boost the exact match over the fuzzy 
> match?
>
> Also I get results with more terms getting the same score, like "cheap car 
> renting London", "offers car renting London". That's something I cannot get 
> to understand. When I use the explain API, it seems that the resulting 
> score is a sum of the different matches with its internal weightings, 
> tf-idf, etc. but it seems to not be considering the terms outside the 
> query, while I would expect the exact match to score at least slightly 
> higher. 
>
> Am I missing something here? Is it just the expected result and I am just 
> being too demanding?
>
> *2) fuzzy query*
>
> That doesn't make what I want since it does not analyze the query (I 
> think) and so it will treat the query in an unexpected way for my purposes 
> of "free text" search
>
> *3) fuzzy_like_this or fuzzy_like_this_field*
>
> This other search takes rid of the first problem in point 1, since as I 
> read from the documentation, it seems to use some tricks to avoid favouring 
> rare terms (misspellings will be here) over more frequent terms, etc. but 
> it's still giving the same score to exact match and matches where other 
> terms are present. 
>
> Is there any way to get the expected behaviour?. By this I mean to be able 
> to execute almost free-text queries with some fuzziness to take rid of 
> possible misspellings in the query terms, but with an (at least for me) 
> more exhaustive score computation. If not, is there any other more complex 
> query or a function_score to get such a performance.
>
> Thank you very much, any comment will be pretty much appreciated. Also, if 
> I am not right in my suppositions, any clarification will be very welcome.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/a8e3e438-9d27-449f-81c2-b50907dcd184%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: fuzziness & score computation

Reply via email to