This would make an awesome addition to Lucene! This is similar to how Lucene's spellchecker identifies candidates, if I understand it right.
Would you be able to port it to java? Mike On Thu, Jun 18, 2009 at 7:12 AM, Varun Dhussa<va...@mapmyindia.com> wrote: > Hi, > > I wrote on this a long time ago, but haven't followed it up. I just finished > a C++ implementation of a spell check module in my software. I borrowed the > idea from Xapian. It is to use a trigram index to filter results, and then > use Edit Distance on the filtered set. Would such a solution be acceptable > to the Lucene Community? The details of my implementation are as follows: > > 1) QDBM data store hash map > 2) Trigram tokenizer on the input string > 3) Data store hash(key,value) = (trigram, keyword_id_list<kw1...kwN) > 4) Use trigram tokenizer and match with the trigram index > 5) Get the IDs within the input cutoff > 6) Run Edit Distance on the list and return > > In my tests on a Intel Core 2 Duo with 3 GB RAM and Windows XP 32 bit, it > runs in <0.5 sec with a keyword record count of about 1,000,000 records. > This is at least 3-4 times less than the current search times on Lucene. > > Since the results can be put in a thread safe hash table structure, the > trigram search can be distributed over a thread pool also. > > Does this seem like a workable suggestion to the community? > > Regards > > -- > Varun Dhussa > Product Architect > CE InfoSystems (P) Ltd > http://www.mapmyindia.com > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org