[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Robert Muir (JIRA) Mon, 26 Jul 2010 07:12:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892315#action_12892315
 ]


Robert Muir commented on LUCENE-2557:
-------------------------------------

bq. If this policy is a performance concern then we could reduce the number of 
terms as you suggest or just ignore IDF entirely in this case but I'm not sure 
the averaging costs represent any kind of real performance concern given the IO 
costs of accessing TermDocs.

I suggested reducing the number of terms (for the averaging), but also the 
number of default expansions.
I think in general expanding to 1024 is obscene...

But also, if we reduce this number, FuzzyTermsEnum itself gets faster, too.
FuzzyTermsEnum is aware (via an attribute) when the priority queue is filled, 
and it knows the minimal score to be competitive.
When a certain edit distance is no longer competitive, it optimizes itself by 
swapping in a more efficient Automaton.
This is safe because the pq's comparator is score, then the term's compareTo 
(lexicographic order).

Simple example: lets say you ask for a max of 1 expansions, but with a fuzzy 
query of max 1 edit distance.
as soon as the enum finds a term of ed=1, terms of ed=1 are no longer 
competitive, so it will then try to seek
to an exact match (swapping in an ed=0 automaton) and exit, instead of wasting 
time seeking to useless terms.

its a bit more complicated since the boost value is really not just edit 
distance but also string length, but I think this illustration works,
its one reason why I think we should try to 'improve the defaults'.


> FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches
> ------------------------------------------------------------------------------
>
>                 Key: LUCENE-2557
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2557
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Query/Scoring
>    Affects Versions: 3.0.2
>            Reporter: Jingkei Ly
>         Attachments: idf-scoring-test-case.patch, LUCENE-2557.patch
>
>
> The FuzzyQuery often causes misspellings to be ranked higher than the exact 
> match, which seems to be an undesirable property generally. 
> For example, in an index of surnames, if I search using a FuzzyQuery for 
> "smith", the misspellings such as "smiith", or "smiht" would appear near the 
> top of the search results ahead of documents that match "smith".

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2557) FuzzyQuery - fuzzy terms and misspellings are ranked higher than exact matches

Reply via email to