On 03/02/2012 15:01, Paul Taylor wrote:
Using Lucene 3.5, I created a query parser based on the dismax parser
but in order to get matches on misspellings ecetra I additionally do a
fuzzy search and a wildcard search
http://svn.musicbrainz.org/search_server/trunk/servlet/src/main/java/org/musicbrainz/search/servlet/DismaxQueryParser.java
So a search for 'echo bunneymen' searches for over three fields
(alias, sortname, artist) and becomes dijunction searches on these and
phrase search
custom(+((
alias:echo~0.5^0.71999997 | alias:echo*^0.71999997 | alias:echo^0.9
| sortname:echo~0.5^0.88000005 | sortname:echo*^0.88000005 |
sortname:echo^1.1
| artist:echo~0.5^1.04 | artist:echo*^1.04 | artist:echo^1.3)~0.1
(
alias:bunneymen~0.5^0.71999997 | alias:bunneymen*^0.71999997 |
alias:bunneymen^0.9
| sortname:bunneymen~0.5^0.88000005 | sortname:bunneymen*^0.88000005 |
sortname:bunneymen^1.1
| artist:bunneymen~0.5^1.04 | artist:bunneymen*^1.04 |
artist:bunneymen^1.3)~0.1)
(alias:"echo bunneymen"^0.2 | sortname:"echo bunneymen"^0.2 |
artist:"echo bunneymen"^0.2)~0.1)
and it gives me exactly the results and scoring that I want, trouble
is that its TOO SLOW
I tried using a different write mechanism as recommended new
MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite(100) but then it
doesn't consider the query idf which makes sense so that rare query
terms aren't boosted, but neither does it consider the idf or
field/norm of the matching document this seems wrong because this
still seem relevent, and more problematically the fuzzy query scores
are so much lower than normal
and phrase matches, so it doesn't seem to work when using fuzzy
queries mixed in with other queries, is there a better option or even
some better documentation on the rewrite method so I can understand it
better.
Alternatively, is there an analyzer I can use to analyse the fields
using the fuzzy/levenstein logic so I can do this at index time
instead then just use a normal term query with same analyzer instead
of a fuzzy query
Paul
FYI turns out the performance problems were more to do with the fact
that I hadn't changed prefixLength from zero , although I only did fuzzy
queries when the term length was at least 4 characters I didn't realise
that unless I set the prefix length to four this wouldn't prevent
matching the query term to terms shorter than 4.
But interestingly just came across
http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html
so looking forward to the 4.0 release, whenever that happens
Paul
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org