22.9.2011 20:59, [email protected] kirjoitti: > On 09/22, Warren Togami Jr. wrote: >> On a separate note, I have a volunteer at school willing to help us build >> a Mandarin language ham corpus a few months from now. That will be >> interesting to see how that throws off our statistics. =) > > I've been wondering about SA's accuracy on other languages. It looks like > the only corpus we have is your wt-jp1? What's the accuracy like on that? > Is the accuracy available somewhere on ruleqa? I'm actually more curious > about accuracy of *spam* in non-English, because I'd say a very > significant portion of my missed spam is in a non-Latin alphabet. > And I don't want to just tell SA to classify non-English as spam because > it would be nice if SA was actually usable for people who speak these > languages. > > 75 out of the 113 spams SA has missed so far this month have subjects in a > non-Latin alphabet. 66.4%. That doesn't even include a bunch of the > non-English stuff. > > (I'm also not using bayes.) >
My smallish corpus (mostly ham) is Finnish language, but also English in
it. Spam is of course English and other languages, there is no Finnish
spam available ;)
--
"I wonder", he said to himself, "what's in a book while it's closed. Oh, I
know it's full of letters printed on paper, but all the same, something must
be happening, because as soon as I open it, there's a whole story with
people
I don't know yet and all kinds of adventures and battles."
-- Bastian B. Bux
signature.asc
Description: OpenPGP digital signature
