Re: My spellchecker experiment

Robert Muir Thu, 03 Feb 2011 06:38:27 -0800

On Thu, Feb 3, 2011 at 8:55 AM, Emmanuel Espina
<espinaemman...@gmail.com> wrote:
> It uses fuzzy queries instead of a ngram query, and then I rank the results
> by word frequency in the text with the aid of a python script (all that is
> explained in the post). I got pretty good results (between 50% and 90%
> improvements), but slower (about double time).
>


Hi Emmanuel:

I think its great you are evaluating different techniques here, our
spelling could use some help :)

By the way: we added a new spellchecking technique that sounds quite
similar to what you describe (DirectSpellChecker),
but hopefully without the performance issues.
Its only available in trunk (http://svn.apache.org/repos/asf/lucene/dev/trunk/)

I tried to do a very rough evaluation on its jira issue:
https://issues.apache.org/jira/browse/LUCENE-2507, but nothing very
serious and as in-depth as what it looks like you did.

Anyway, if you want to play you can experiment with it either at the
lucene level (its in contrib/spellchecker) or via solr, by using
DirectSolrSpellChecker... though I think the parameters in the example
solrconfig are likely not the best :)

I have an app using this more fleshed-out config (in combination with
the new collation options), and it seems to be reasonable:

<!-- a spellchecker that uses no auxiliary index -->
    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">text</str>
      <str name="classname">solr.DirectSolrSpellChecker</str>
      <str name="minPrefix">1</str>
      <str name="maxEdits">2</str>
      <str name="maxInspections">25</str> <!-- probably way too high
for most apps though -->
      <str name="minQueryLength">3</str>
      <str name="comparatorClass">freq</str>
      <str name="thresholdTokenFrequency">1</str>
      <str 
name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
    </lst>

Re: My spellchecker experiment

Reply via email to