Re: [lingu-dev] Spell checking metrics; was:[native-lang] Status update season!

eleonora46 Wed, 20 Dec 2006 02:46:42 -0800

Kevin,

Thanks for your excellent analysis. 
In fact ALL languages are minority languages, aren't they?


I would simplify your statements like that:
AI (Accepted invalid) should tend to 0
FV (Flagged valid) should tend to 0.

Therefore A shold tend to AV
and F should tend to FI.

If both the above are true, then the spell checker 
did a really good work.

The recognition of obscure words is more the area of grammar checkers,
they should mark obscure words being similar to often used,
mispelled words. 
Since this can be context dependent, different 
settings of grammar checkers are be required for the above.

-eleonora

> Anyway, this is a great question and something I've thought a lot
> about as part of my work on developing language technology for
> minority languages.
> 
> There are a few naive metrics one can use.  I usually set things up this
> way:
> 
> (1) Given any text, you can split the words into "valid" (V) and
> "invalid" (I) words, independent of your spellchecker.  I define
> "valid" to mean "a word that a human proofreader would correct *in the
> given context*".  So, given sequence of characters like "sed", it
> might be valid in once place ("Did you know sed is Turing complete?)
> and not in another ("Did you know Turing sed that?").   There are no
> hairs to split in most cases, of course -- something like
> "misssspelling" is surely invalid (except, of course, in the message
> I'm typing right now, where I definitely would not want it corrected!)
> 
> (2) Next, when you run a spellchecker on your text, it splits the same
> words into "accepted" (A) words or "flagged" (F) words.   So the whole
> text consists of words I label as
> AV=accepted and valid
> AI=accepted and invalid
> FV=flagged and valid
> FI=flagged and invalid
> 
> (3) With this notation, the standard metrics are "recall":  R=AV/V
> (i.e. what fraction of the valid words do you recognize) and
> "precision": P=AV/A (i.e. what fraction of the recognized words are
> actually valid).   You see these a lot in evaluating search engines
> (relevant vs. irrelevant documents) and spam filters (spam vs.
> non-spam).   Since they work against each other, it is useful to
> combine them into a single "F-score":   F=2PR/(P+R)
> 
> You can also write down recall/precision for the spellchecker's
> performance at flagging invalid words: R=FI/I, P=FI/F, but I prefer
> the approach above.
> 
> 
> (4) To estimate these, run your spellchecker on, say, 100,000 words of
> text and note which of the flagged words are valid or invalid.    You
> then have FV and FI, and you know the total number of accepted words
> (A = 100,000 - F).   The tricky part is to estimate AI.  These words
> are the problem children in the world of spellcheckers.   One source
> of AI words are misspellings in the word list - but hopefully with
> care and proofreading these can be avoided or eliminated.  The hard
> ones are things like "right", which is invalid (usually) in the
> context  "right of passage" (should be "rite") but is so commonly
> valid that it clearly cannot be removed from the word list.  Other
> cases, like the obscure word "yor" in English, should clearly not be
> included since they are most likely to be a misspelling of a common
> word.   The precision/recall measures give you a disciplined,
> mathematical way to decide between including/excluding a given word,
> and I've found it very useful for Irish and some other languages.
> 
> In any case, you might make the optimistic assumption that AI is very
> close to 0, so precision is 100%, and recall is just A/(A+FV) - this
> is a simple quality measure, easy to compute.
> 
> There are other measures one can use, for instance evaluating the
> quality of the suggestions made by the spellchecker, but I'll leave
> that for another time, since it is more the responsibility of the
> language-independent engine vs. the dictionary.
> 
> -Kevin

-- 
Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! 
Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] Spell checking metrics; was:[native-lang] Status update season!

Reply via email to