If you look at the FuzzyQuery code, it is based on computing Levenshtein distance between the original term and every term in the index and keeping the terms that are within the specified relative distance of the original term. This would explain why FuzzyQuery may work well for small indexes but for large indexes (I have ~5 million terms in mine) it is impossibly slow.
What n-gram based (or any other secondary index based) spell checkers are trying to do is to select a limited number of candidate terms in a very quick manner and then apply the distance algorithm to them. If you use the same cutoff rules as the FuzzyQuery, you will get a very similar result set. Secondary index-based spell checkers also give you a lot more control on how many similar terms to bring back and in what order. Regards, Alexey -----Original Message----- From: Jonathan Hager [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 20, 2004 6:48 PM To: Lucene Users List Subject: Re: Spell checker I investigated how the algorithm implemented in this spell checker compares with my simple implementation of a spell checker. First here is what my implementation looks like: //Each word becomes a single Lucene Document //To find suggestions: FuzzyQuery fquery = new FuzzyQuery(new Term("word", word)); Hits dicthits = dictionarySearcher.search(fquery); For a simple test I misspelled brown, as follows: * bronw * bruwn * brownz To validate my testcases I checked if Microsoft Word and Google had any idea what I was trying to spell. Google suggested brown, brown, browns, respectively. Words suggestions were: bronw==>brown, brow bruwn==>brown, brawn, bruin brownz==>browns, brown The suggestions using David Spencer/Nicolas Maisonneuve's algorithm against my index were: bronw==>jaron, brooks, citron, brookline bruwn==>brush brownz==>bronze, brooks, brooke, brookline The suggestions using my real simple algorithm against my index were: bronw==>brown, brwn, brush bruwn==>brown, brwn, brush brownz==>brown, bronze It appears that David Spencer/Nicolas Maisonneuve's Spell Checking Algorithm returns a broader result set than most commercial algorithms or a real simple algorithm. I will be the first to say, that this is just anecdotal evidence and not a rigourous test of either algorithm. But until extensive testing has been done I'm going to stick with my real simple dictionary lookup. Jonathan On Wed, 20 Oct 2004 12:56:39 -0400, Aviran <[EMAIL PROTECTED]> wrote: > Here http://issues.apache.org/bugzilla/showattachment.cgi?attach_id=13009 > > Aviran > http://aviran.mordos.com > > > > -----Original Message----- > From: Lynn Li [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 20, 2004 10:52 AM > To: 'Lucene Users List' > Subject: RE: Spell checker > > Where can I download it? > > Thanks, > Lynn > > -----Original Message----- > From: Nicolas Maisonneuve [mailto:[EMAIL PROTECTED] > Sent: Monday, October 11, 2004 1:26 PM > To: Lucene Users List > Subject: Spell checker > > hy lucene users > i developed a Spell checker for lucene inspired by the David Spencer code > > see the wiki doc: http://wiki.apache.org/jakarta-lucene/SpellChecker > > Nicolas Maisonneuve > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
