[jira] Commented: (LUCENE-2507) automaton spellchecker

Robert Muir (JIRA) Thu, 30 Sep 2010 19:02:56 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916757#action_12916757
 ]


Robert Muir commented on LUCENE-2507:
-------------------------------------

bq. Do you have any benchmarks for this spellchecker? I notice you mention a 
few times that you improved the performance. Do you know how it compares 
against the separate index approach?

In general I think the performance is fine. I did a lot of testing against the 
geonames database (> 2 million unique terms).
But, it completely depends upon the parameters you set. Here are some that can 
affect performance and quality:
* avoid doing work if the query term is already spelled correctly:
** minQueryLength (example: 4), query words of 3 characters or less are not 
checked. 
In general, with any metric, the candidates here will mostly be nonsense anyway.
** maxQueryFrequency (example: 1% or 1):  if the query word is high frequency 
(e.g. appears in more 
than 1% of the documents its assumed to be correct, and no suggestions are 
given.
You can also set this to something like 1, if say you have a small product 
database 
and you feel all your products are spelled completely correct in your index, 
and you 
don't want to *ever* suggest anything if the query term is in your products 
database.
* avoid doing work examining potentially bad suggestions:
** maxEdits (example: 1), the majority of misspellings are only 1 distance 
away. 
So if you lower this from the default "2" to 1, its faster and more 
"lightweight" in the sense you get less a chance of getting a bad suggestion.
** minPrefix (example: 1), most misspellings don't occur in the first 
character. 
For the solr example, i set this to zero (the wiki has an example correcting 
"hell" with "dell"), but in practice I think 1 is a good value. 
Additionally this has a practical use for solr users: you need a rather "raw" 
(e.g. not stemmed) analyzed field for spellchecking,
if you set this to 1 you can re-use your reverse-wildcard field for 
spellchecking too, and it will never suggest reversed terms.
** thresholdFrequency (example: 1% or 1): this plays the role of Solr's 
"HighFrequencyDictionary". 
In other words, you could set this to 1 to never suggest words that only appear 
in a single document... in many cases these are misspellings.
** maxInspections (example: 5), the existing spellchecker uses a hardcoded 10 
here. 
A lower value can work well here, since the algorithm used to draw candidates 
is actually levenshtein. 
However, I set the default to 5 (instead of 1), because its good to gather a 
few candidates for docFreq-sorting.... 
but if you increase thresholdFrequency you can probably lower this.

bq. Equally, is this spellchecker a conceptual drop in replacement? By that I 
mean, are the suggestions it generates radically different to the separate 
index spellcheckers or are they along the same lines?

I think they are better, e.g. if you are ranking by an edit-distance like 
function such as Levenshtein or Jaro-Winkler, it makes more sense to get your 
*candidates* via the same or similar algorithm! The existing spellchecker gets 
candidates with n-grams... I think this causes a mismatch... (Of course the 
inverse is true, if you use NGramDistance, use the existing spellchecker!)

Again I did a lot of testing with various corpora, and I'm not a spellchecking 
expert but i didn't get particularly good results from the existing 
spellchecker.
And for some corpora such as geonames, it didnt seem to have the 
configurability I needed to tune the spellchecker to that domain.

For example, i queried on "zeeland" and the existing spellchecker returned 
freeland, leland, ireland, deland, and cleland as suggestions.
Whats worse, is that it created a 240MB auxiliary index when my original index 
was only 300MB, and it took it 141 seconds to do this.

The idea here isn't to solve the world's spellchecking problems, its mainly to 
get rid of the extra index. I think its trivial to
set this one up to beat SpellChecker's suggestions, because I don't think 
SpellChecker's suggestions are very good.


> automaton spellchecker
> ----------------------
>
>                 Key: LUCENE-2507
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2507
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2507.patch, LUCENE-2507.patch, LUCENE-2507.patch, 
> LUCENE-2507.patch
>
>
> The current spellchecker makes an n-gram index of your terms, and queries 
> this for spellchecking.
> The terms that come back from the n-gram query are then re-ranked by an 
> algorithm such as Levenshtein.
> Alternatively, we could just do a levenshtein query directly against the 
> index, then we wouldn't need
> a separate index to rebuild.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2507) automaton spellchecker

Reply via email to