[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Mark Miller (JIRA) Sun, 01 Feb 2009 19:28:24 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669502#action_12669502
 ]


Mark Miller commented on LUCENE-1532:
-------------------------------------

A little experimentation showed better results. It may depend though. I think 
its more useful when the dictionary contains lots of misspellings (many index 
based spellcheck indexes). In this case, I think its more important that 
docFreq play a role with edit distance to get good results (rather than just 
being an edit distance tie breaker). The fact that one term appeared 30,000 
times and another 36,700 doesn't make much of a difference in spell checking. 
Words that are relatively similar in frequency are bucketed together, and then 
edit distance can judge from there. Especially with misspellings, this can work 
really well. The unaltered term frequencies are too widely distributed to be 
super helpful as part of a weight. Normalizing down allows the edit distance to 
play a stronger role, and keeps super frequent terms from clobbering good 
results. But it makes the more frequent terms more likely to be chosen as the 
suggestion. The edit distances will likely be similar too - but say one word 
beats another by a small edit distance - it can certainly make sense to choose 
the word that lost because its a 10 freq and the other word a 1 freq. You will 
satisfy more users. Even a 10 vs a 4 or 10 vs a 5 - you will likely guess 
better.

Keep in mind, I'm no expert on spell checking though.

I have a feeling that a similar move would be beneficial to a dictionary based 
spellchecker too. Breaking the freqs down into smaller buckets keeps 
insignificant differences from  playing a role in the correction. I'd love to 
test a little and see how straight edit distance compares to an edit distance / 
freq weight with a dictionary approach. I still wouldn't be surprised if 
slightly favoring more frequent words by allowing a bit of edit distance leeway 
wouldnt improve results. Saying this word is chosen because it beats the other 
by a slim edit distance, when the loser is a high frequency word in the 
language, and the winner a low, makes little sense.

I just kind of like the idea of unifying the two approaches also. Really just 
talking out loud though.

- Mark

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Reply via email to