[ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669029#action_12669029 ]
Mark Miller commented on LUCENE-1532: ------------------------------------- Our spellchecking def needs improvement. I like the idea of using a weight measure - something that combines frequency and edit distance. The spellchecker can make much better suggestions this way, and do things like only return a suggestion if it has a higher frequency or higher weight. I've found that unaltered frequency is not a great stat to use though - it becomes much better if you do something like normalize freq to a value between 1-10. Then use that with the edit distance to calculate the weight. Or some such magic. - Mark > File based spellcheck with doc frequencies supplied > --------------------------------------------------- > > Key: LUCENE-1532 > URL: https://issues.apache.org/jira/browse/LUCENE-1532 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/spellchecker > Reporter: David Bowen > > The file-based spellchecker treats all words in the dictionary as equally > valid, so it can suggest a very obscure word rather than a more common word > which is equally close to the misspelled word that was entered. It would be > very useful to have the option of supplying an integer with each word which > indicates its commonness. I.e. the integer could be the document frequency > in some index or set of indexes. > I've implemented a modification to the spellcheck API to support this by > defining a DocFrequencyInfo interface for obtaining the doc frequency of a > word, and a class which implements the interface by looking up the frequency > in an index. So Lucene users can provide alternative implementations of > DocFrequencyInfo. I could submit this as a patch if there is interest. > Alternatively, it might be better to just extend the spellcheck API to have a > way to supply the frequencies when you create a PlainTextDictionary, but that > would mean storing the frequencies somewhere when building the spellcheck > index, and I'm not sure how best to do that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org