"Erick Erickson" <[EMAIL PROTECTED]> wrote on 09/10/2006 13:09:21: > ... The kicker is that what we are indexing is > OCR data, some of which is pretty trashy. So you wind up with "interesting" > words in your index, things like rtyHrS. So the whole question of allowing > very specific queries on detailed wildcards (combined with spans) is under > discussion. It's not at all clear to me that there's any value to the end > users in the capability of, say, two character prefixes. And, it's an easy > rule that "prefix queries must specify at least 3 non-wildcard > characters"....
Erick, I may be out of course here, but, fwiw, have you considered n-gram indexing/search for a degree of fuzziness to compensate for OCR errors..? For a four words query you would probably get ~20 tokens (bigrams?) - no matter what the index size is. You would then probably want to score higher by LA (lexical affinity - query terms appear close to each other in the document) - and I am not sure to what degree a span query (made of n-gram terms) would serve that, because (1) all terms in the span need to be there (well, I think:-); and, (2) you would like to increase doc score for close-by terms only for close-by query n-grams. So there might not be a ready to use solution in Lucene for this, but perhaps this is a more robust direction to try than the wild card approach - I mean, if users want to type a wild card query, it is their right to do so, but for an application logic this does not seem the best choice. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]