[ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669599#action_12669599
 ] 

Mark Miller commented on LUCENE-1532:
-------------------------------------

bq. In one corpus doc. frequency of 3 means it is probably a typo, in another 
this means nothing...

I think typos are another problem though. There can be too many high frequency 
typos and low frequency correct spellings. I think this has to be attacked in 
other ways anyway. Maybe favoring an a true dictionary slightly, over the user 
dictionary. Certainly its hard to use freq for it though. Helps keep those 
typos from getting suggested though - and you only pay by seeing fewer less 
common, but correct, suggestions as well.

bq. My proposal is to work with real frequency as you have no information loss 
there ... 

I don't think the info you are losing is helpful. If you start favoring a word 
that occurs 70,000 times heavily over words that occur 40,000 times, I think it 
works in the favor of bad suggestions. On a scaled freq chart, they might 
actually be a 4 and 5 or both the same freq even. Considering that 70k-40k 
doesnt likely tell you much in terms of which is a better suggestion, this 
allows edit distance to play the larger role that it should in deciding.

Of course it makes sense for the implementation to be able to work with the raw 
values as you say though. We wouldn't want to hardcode the normalization. Your 
right - who knows what is right to do for it, or whether you should even 
normalize at the end of the day. I don't. A casual experience showed good 
results though, and I think supplying something like that out of the box will 
really improve lucenes spelling suggestions.
- Mark

> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>            Priority: Minor
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to