Re: Performance improvements for fuzzy queries ?

Paul Taylor Thu, 08 Mar 2012 14:00:16 -0800

On 03/02/2012 15:01, Paul Taylor wrote:

Using Lucene 3.5, I created a query parser based on the dismax parserbut in order to get matches on misspellings ecetra I additionally do afuzzy search and a wildcard search
http://svn.musicbrainz.org/search_server/trunk/servlet/src/main/java/org/musicbrainz/search/servlet/DismaxQueryParser.java
So a search for 'echo bunneymen' searches for over three fields(alias, sortname, artist) and becomes dijunction searches on these andphrase search
custom(+((
alias:echo~0.5^0.71999997 | alias:echo*^0.71999997 | alias:echo^0.9
| sortname:echo~0.5^0.88000005 | sortname:echo*^0.88000005 |sortname:echo^1.1
| artist:echo~0.5^1.04 | artist:echo*^1.04 | artist:echo^1.3)~0.1
 (
alias:bunneymen~0.5^0.71999997 | alias:bunneymen*^0.71999997 |alias:bunneymen^0.9| sortname:bunneymen~0.5^0.88000005 | sortname:bunneymen*^0.88000005 |sortname:bunneymen^1.1| artist:bunneymen~0.5^1.04 | artist:bunneymen*^1.04 |artist:bunneymen^1.3)~0.1)(alias:"echo bunneymen"^0.2 | sortname:"echo bunneymen"^0.2 |artist:"echo bunneymen"^0.2)~0.1)
and it gives me exactly the results and scoring that I want, troubleis that its TOO SLOW
I tried using a different write mechanism as recommended newMultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100) but then itdoesn't consider the query idf which makes sense so that rare queryterms aren't boosted, but neither does it consider the idf orfield/norm of the matching document this seems wrong because thisstill seem relevent, and more problematically the fuzzy query scoresare so much lower than normaland phrase matches, so it doesn't seem to work when using fuzzyqueries mixed in with other queries, is there a better option or evensome better documentation on the rewrite method so I can understand itbetter.
Alternatively, is there an analyzer I can use to analyse the fieldsusing the fuzzy/levenstein logic so I can do this at index timeinstead then just use a normal term query with same analyzer insteadof a fuzzy query
Paul

FYI turns out the performance problems were more to do with the factthat I hadn't changed prefixLength from zero , although I only did fuzzyqueries when the term length was at least 4 characters I didn't realisethat unless I set the prefix length to four this wouldn't preventmatching the query term to terms shorter than 4.

But interestingly just came acrosshttp://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.htmlso looking forward to the 4.0 release, whenever that happens



Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Performance improvements for fuzzy queries ?

Reply via email to