On Fri, Sep 23, 2011 at 10:30:18AM +0300, Jari Fredriksson wrote: > 22.9.2011 20:59, [email protected] kirjoitti: > > On 09/22, Warren Togami Jr. wrote: > >> On a separate note, I have a volunteer at school willing to help us > >> build > >> a Mandarin language ham corpus a few months from now. That will be > >> interesting to see how that throws off our statistics. =) > > > > I've been wondering about SA's accuracy on other languages. It looks like > > the only corpus we have is your wt-jp1? What's the accuracy like on that? > > Is the accuracy available somewhere on ruleqa? I'm actually more curious > > about accuracy of *spam* in non-English, because I'd say a very > > significant portion of my missed spam is in a non-Latin alphabet. > > And I don't want to just tell SA to classify non-English as spam because > > it would be nice if SA was actually usable for people who speak these > > languages. > > > > 75 out of the 113 spams SA has missed so far this month have subjects in a > > non-Latin alphabet. 66.4%. That doesn't even include a bunch of the > > non-English stuff. > > > > (I'm also not using bayes.) > > > > My smallish corpus (mostly ham) is Finnish language, but also English in > it. Spam is of course English and other languages, there is no Finnish > spam available ;)
There isn't any Finnish spam per se, but there are loads of that "badly autotranslated" Finnish langauge spam/phishing coming in daily.
