AFAIK Lucene is not able to look DNA strings up effectively. You would use DASG+Lev (see my previous post - 05/30/2003 1916CEST).

-g-

Jim Hargrave wrote:

Our application is a string similarity searcher where the query is an input string and we want to find all "fuzzy" variants of the input string in the DB. The Score is basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in common, Q is the number of unique query terms and D is the number of unique document terms. Our documents will be sentences.

I know Lucene has a fuzzy search capability - but I assume this would be very slow since it must search through the entire term list to find candidates.

In order to do the calculation I will need to have 'C' - the number of terms in common between query and document. Is there an API that I can call to get this info? Any hints on what it will take to modify Lucene to handle these kinds of queries?




--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to