On Thu, Feb 3, 2011 at 8:55 AM, Emmanuel Espina <espinaemman...@gmail.com> wrote: > It uses fuzzy queries instead of a ngram query, and then I rank the results > by word frequency in the text with the aid of a python script (all that is > explained in the post). I got pretty good results (between 50% and 90% > improvements), but slower (about double time). >
Hi Emmanuel: I think its great you are evaluating different techniques here, our spelling could use some help :) By the way: we added a new spellchecking technique that sounds quite similar to what you describe (DirectSpellChecker), but hopefully without the performance issues. Its only available in trunk (http://svn.apache.org/repos/asf/lucene/dev/trunk/) I tried to do a very rough evaluation on its jira issue: https://issues.apache.org/jira/browse/LUCENE-2507, but nothing very serious and as in-depth as what it looks like you did. Anyway, if you want to play you can experiment with it either at the lucene level (its in contrib/spellchecker) or via solr, by using DirectSolrSpellChecker... though I think the parameters in the example solrconfig are likely not the best :) I have an app using this more fleshed-out config (in combination with the new collation options), and it seems to be reasonable: <!-- a spellchecker that uses no auxiliary index --> <lst name="spellchecker"> <str name="name">default</str> <str name="field">text</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="minPrefix">1</str> <str name="maxEdits">2</str> <str name="maxInspections">25</str> <!-- probably way too high for most apps though --> <str name="minQueryLength">3</str> <str name="comparatorClass">freq</str> <str name="thresholdTokenFrequency">1</str> <str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str> </lst>