[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Mark Miller (JIRA) Mon, 02 Feb 2009 05:00:31 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669589#action_12669589
 ]


Mark Miller commented on LUCENE-1532:
-------------------------------------

bq. but I'm not sure the exact frequency number at just word-level is really 
that useful for spelling correction, assuming a normal zipfian distribution. 

Thats what normalizing down takes care of. 1-10 is just out of the hat. You 
could do 1-3 and have low freq, med freq, hi freq.

Consider I make a site called MarkMiller.com - its full of stuff about Mark 
Miller. In my dictionary is Mike Muller though, which is mentioned on the site 
twice. Mark Miller is mentioned thousands of times. Now if I type something 
like Mlller and it suggest Muller just using edit distance - that type of thing 
will create a lot of bad suggestions. Muller is practically unheard of on my 
site, but I am suggesting it over Miller which is all over the place. Edit 
distance by itself as the first cut off creates too many of these close bad 
suggestions. So its not that freq should be used heavily - but it can clear up 
this little oddities quite nicely.


> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Reply via email to