[ https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669595#action_12669595 ]
Eks Dev commented on LUCENE-1532: --------------------------------- .bq but I'm not sure the exact frequency number at just word-level is really that useful for spelling correction, assuming a normal zipfian distribution. you are probably right, you cannot expect high resolution from frequency, but exact frequency information is your "source information". Clustering it on anything is just one algorithmic modification where, at the end, less information remains. Mark suggests 1-10, someone else would be happy with 1-3 ... who could tell? Therefore I would recommend real frequency information and leave possibility for end user to decide what to do with it. Frequency distribution is not simple measure, depends heavily on corpus composition, size. In one corpus doc. frequency of 3 means it is probably a typo, in another this means nothing... My proposal is to work with real frequency as you have no information loss there ... > File based spellcheck with doc frequencies supplied > --------------------------------------------------- > > Key: LUCENE-1532 > URL: https://issues.apache.org/jira/browse/LUCENE-1532 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/spellchecker > Reporter: David Bowen > Priority: Minor > > The file-based spellchecker treats all words in the dictionary as equally > valid, so it can suggest a very obscure word rather than a more common word > which is equally close to the misspelled word that was entered. It would be > very useful to have the option of supplying an integer with each word which > indicates its commonness. I.e. the integer could be the document frequency > in some index or set of indexes. > I've implemented a modification to the spellcheck API to support this by > defining a DocFrequencyInfo interface for obtaining the doc frequency of a > word, and a class which implements the interface by looking up the frequency > in an index. So Lucene users can provide alternative implementations of > DocFrequencyInfo. I could submit this as a patch if there is interest. > Alternatively, it might be better to just extend the spellcheck API to have a > way to supply the frequencies when you create a PlainTextDictionary, but that > would mean storing the frequencies somewhere when building the spellcheck > index, and I'm not sure how best to do that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org