[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Eks Dev (JIRA) Fri, 30 Jan 2009 12:55:21 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669018#action_12669018
 ]


Eks Dev commented on LUCENE-1532:
---------------------------------

bq. so it can suggest a very obscure word rather than a more common word which 
is equally close to the misspelled word that was entered

in my experience freq information brings there a lot, but is not linear. It is 
not always that word with higher frequency makes better suggestion. Common 
sense is that high frequency words get often misspelled in different ways in 
normal corpus. Making following patterns: 

HF(High Freiquency) Word against LF(Low Frequency) that is similar in edit 
distance sense is much more likely typo/misspelling than HF vs HF case. 

Similar cases with HF vs LF
"the" against "hte"
"think" vs "tihnk"

Very similar, but HF vs HF 
"think" vs "thing"

some cases that jump out of these ideas are synonyms, alternative spellings and 
very common mistakes. Very tricky to isolate just by using some distance 
measure and frequency. Her you need context.
similar and HF vs HF
"thomas" vs "tomas" sometimes spelling mistake, sometimes different names...

depends what you are trying to achieve, if you expect mistakes in query you are 
good if you assume HF suggestions are better, but if you go for high recall you 
need to cover cases where query  term is correct  you have to dig into your 
corpus to find incorrect words (Query "think about it" should find document 
containing "tihnk about it")

very challenging problem.... but cutting to the chase. The proposal is to make 
it possible to define
 float Function(Edit distance, Query_Token_Freq, Corpus_Token_Freq) that 
returns some measure that is higher  for more similar pairs considering edit 
distance and frequency (value that gets used as condition for priority queue) . 
Default could just work as you described. (It is maybe already possible, I  did 
not look at it). 

  



 


> File based spellcheck with doc frequencies supplied
> ---------------------------------------------------
>
>                 Key: LUCENE-1532
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1532
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: David Bowen
>
> The file-based spellchecker treats all words in the dictionary as equally 
> valid, so it can suggest a very obscure word rather than a more common word 
> which is equally close to the misspelled word that was entered.  It would be 
> very useful to have the option of supplying an integer with each word which 
> indicates its commonness.  I.e. the integer could be the document frequency 
> in some index or set of indexes.
> I've implemented a modification to the spellcheck API to support this by 
> defining a DocFrequencyInfo interface for obtaining the doc frequency of a 
> word, and a class which implements the interface by looking up the frequency 
> in an index.  So Lucene users can provide alternative implementations of 
> DocFrequencyInfo.  I could submit this as a patch if there is interest.  
> Alternatively, it might be better to just extend the spellcheck API to have a 
> way to supply the frequencies when you create a PlainTextDictionary, but that 
> would mean storing the frequencies somewhere when building the spellcheck 
> index, and I'm not sure how best to do that.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1532) File based spellcheck with doc frequencies supplied

Reply via email to