search FuzzyQuery.java FuzzyTermEnum.java

David Spencer Fri, 17 Sep 2004 13:39:48 -0700

Doug Cutting wrote:

Christoph Goller wrote:
Doug Cutting wrote:
It might also be good if one could set the non-fuzzy prefix length used by the QueryParser. As it stands, fuzzy queries with large indexes that use QueryParser are so slow they're unusable. But a default prefix of just a couple of characters would make a huge performance improvement.
I will think about extending QueryParser as you proposed (should not be too difficult, we only have to find a reasonable syntax)
I don't think we need to extend the query syntax, just add a method:
   QueryParser.setFuzzyPrefixLength(int)
and have the default value be 2. This is not back-compatible, but I think it makes fuzzy queries *much* more usable, and is hence worth the incompatibility.

Another idea might be to, rather than (or in addition to) limiting the number of expanded terms by similarity, to limit them by number. So one could keep, e.g., just the top-scoring 100 terms whose score is greater than 0.5, or somesuch. This way FuzzyQuery would never trigger BooleanQuery.TooManyClauses. What do you think?

Replying to the above (TooManyClauses) - if not being done, perhaps it could not throw this and instead just catch it and happily return the current, incomplete, query as a best effort kind of thing.

Also sounds reasonable. Of course it does not solve the efficiency problem of rewriting a FuzzyQuery. Do you think the expensive part is going through all terms of a field or is it the Levenstein-computation, or both?
Both. Simply enumerating all of the terms is pretty fast for small collections (<1M documents), but adding a Levenstein computation for all

I guess if you Lev distance is a parameter then you could check the diff of the string lengths first, and if they differ by more than that param you skip the Lev calculation.

words makes fuzzy search prohibitively slow. But the Levenstein computation will not be too expensive if it doesn't need to be computed for so many terms. On average, a two-character prefix should reduce the number of Levenstein computations by a factor of several hundred, no?

I hope you like my extensions to PraseQuery and PhrasePrefixQuery too :-)
Yes, very much!  I like all of your changes!  Thanks!
The QueryParser feature that would leverage this best might be to make asterisk (*) in a phrase increment the position. Thus a search for ""John * Kennedy" would match "John Fitzgerald Kennedy", in case you'd forgotten his middle name.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search FuzzyQuery.java FuzzyTermEnum.java

Reply via email to