[
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669502#action_12669502
]
Mark Miller commented on LUCENE-1532:
-------------------------------------
A little experimentation showed better results. It may depend though. I think
its more useful when the dictionary contains lots of misspellings (many index
based spellcheck indexes). In this case, I think its more important that
docFreq play a role with edit distance to get good results (rather than just
being an edit distance tie breaker). The fact that one term appeared 30,000
times and another 36,700 doesn't make much of a difference in spell checking.
Words that are relatively similar in frequency are bucketed together, and then
edit distance can judge from there. Especially with misspellings, this can work
really well. The unaltered term frequencies are too widely distributed to be
super helpful as part of a weight. Normalizing down allows the edit distance to
play a stronger role, and keeps super frequent terms from clobbering good
results. But it makes the more frequent terms more likely to be chosen as the
suggestion. The edit distances will likely be similar too - but say one word
beats another by a small edit distance - it can certainly make sense to choose
the word that lost because its a 10 freq and the other word a 1 freq. You will
satisfy more users. Even a 10 vs a 4 or 10 vs a 5 - you will likely guess
better.
Keep in mind, I'm no expert on spell checking though.
I have a feeling that a similar move would be beneficial to a dictionary based
spellchecker too. Breaking the freqs down into smaller buckets keeps
insignificant differences from playing a role in the correction. I'd love to
test a little and see how straight edit distance compares to an edit distance /
freq weight with a dictionary approach. I still wouldn't be surprised if
slightly favoring more frequent words by allowing a bit of edit distance leeway
wouldnt improve results. Saying this word is chosen because it beats the other
by a slim edit distance, when the loser is a high frequency word in the
language, and the winner a low, makes little sense.
I just kind of like the idea of unifying the two approaches also. Really just
talking out loud though.
- Mark
> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
> Key: LUCENE-1532
> URL: https://issues.apache.org/jira/browse/LUCENE-1532
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Reporter: David Bowen
> Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally
> valid, so it can suggest a very obscure word rather than a more common word
> which is equally close to the misspelled word that was entered. It would be
> very useful to have the option of supplying an integer with each word which
> indicates its commonness. I.e. the integer could be the document frequency
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a
> word, and a class which implements the interface by looking up the frequency
> in an index. So Lucene users can provide alternative implementations of
> DocFrequencyInfo. I could submit this as a patch if there is interest.
> Alternatively, it might be better to just extend the spellcheck API to have a
> way to supply the frequencies when you create a PlainTextDictionary, but that
> would mean storing the frequencies somewhere when building the spellcheck
> index, and I'm not sure how best to do that.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]