Doug Cutting wrote:It might also be good if one could set the non-fuzzy prefix length used by the QueryParser. As it stands, fuzzy queries with large indexes that use QueryParser are so slow they're unusable. But a default prefix of just a couple of characters would make a huge performance improvement.
I will think about extending QueryParser as you proposed (should not be too difficult, we only have to find a reasonable syntax)
I don't think we need to extend the query syntax, just add a method:
QueryParser.setFuzzyPrefixLength(int)
and have the default value be 2. This is not back-compatible, but I think it makes fuzzy queries *much* more usable, and is hence worth the incompatibility.
Another idea might be to, rather than (or in addition to) limiting the number of expanded terms by similarity, to limit them by number. So one could keep, e.g., just the top-scoring 100 terms whose score is greater than 0.5, or somesuch. This way FuzzyQuery would never trigger BooleanQuery.TooManyClauses. What do you think?
Also sounds reasonable. Of course it does not solve the efficiency problem of rewriting a FuzzyQuery. Do you think the expensive part is going through all terms of a field or is it the Levenstein-computation, or both?
Both. Simply enumerating all of the terms is pretty fast for small collections (<1M documents), but adding a Levenstein computation for all words makes fuzzy search prohibitively slow. But the Levenstein computation will not be too expensive if it doesn't need to be computed for so many terms. On average, a two-character prefix should reduce the number of Levenstein computations by a factor of several hundred, no?
I hope you like my extensions to PraseQuery and PhrasePrefixQuery too :-)
Yes, very much! I like all of your changes! Thanks!
The QueryParser feature that would leverage this best might be to make asterisk (*) in a phrase increment the position. Thus a search for ""John * Kennedy" would match "John Fitzgerald Kennedy", in case you'd forgotten his middle name.
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]