Probably shouldn't have added that last bit. Our app isn't a DNA searcher. But DASG+Lev does look interesting. Our app is a linguistic application. We want to search for sentences which have many ngrams in common and rank them based on the score below. Similar to the TELLTALE system (do a google search TELLTALE + ngrams) - but we are not interested in IR per se - we want to compute a score based on pure string similarity. Sentences are docs, ngrams are terms. Jim
>>> [EMAIL PROTECTED] 06/05/03 03:55PM >>> AFAIK Lucene is not able to look DNA strings up effectively. You would use DASG+Lev (see my previous post - 05/30/2003 1916CEST). -g- Jim Hargrave wrote: >Our application is a string similarity searcher where the query is an input string >and we want to find all "fuzzy" variants of the input string in the DB. The Score is >basically dice's coefficient: 2C/Q+D, where C is the number of terms (n-grams) in >common, Q is the number of unique query terms and D is the number of unique document >terms. Our documents will be sentences. > >I know Lucene has a fuzzy search capability - but I assume this would be very slow >since it must search through the entire term list to find candidates. > >In order to do the calculation I will need to have 'C' - the number of terms in >common between query and document. Is there an API that I can call to get this info? >Any hints on what it will take to modify Lucene to handle these kinds of queries? > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] ------------------------------------------------------------------------------ This message may contain confidential information, and is intended only for the use of the individual(s) to whom it is addressed. ==============================================================================
