Hello
about 3 months ago I posted some idea about wildcard searching.

main idea was to index every character of input as separate term. and then search using PhraseQuery. for example word "12345" would be indexed as "1" "2" "3" "4" "5". to find "*23*" you can use PhraseQuery with this two terms ("2" "3"). But this approach is limited only to queries with wildcards in the begin or end.

Later I did some research and wrote Extension to PhraseQuery that allows to set term relative position to range of values (to insert gaps for "*" and "?") this approach is good because it does not rewrite queries and never run into OutOfMemory or TooManyClauses Exceptions

regards,
Volodymyr Bychkoviak


14.03.2005 13:54

Dave Kor wrote:

Quoting Dave Kor <[EMAIL PROTECTED]>:

Quoting Erik Hatcher <[EMAIL PROTECTED]>:

Anyone tried this technique with Lucene?
Actually, the problem is that the wildcard code has to search over a large
subset of terms because the list of terms is, well, a linear structure.

If, for example, all terms in the index is arranged as a suffix tree, the
sort
of wildcard search that currently is cpu intensive will no longer be cpu
intensive.

Hmm I realized I should add a qualifier to the above statement. Searching for
matching terms would no longer be cpu intensive, especially for wildcards like
*foo* or *foo. The other wildcard search problem of having too many matching
terms to lookup in the index still remains unsolved.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to