On Fri, Sep 23, 2011 at 10:30:18AM +0300, Jari Fredriksson wrote:
> 22.9.2011 20:59, [email protected] kirjoitti:
> > On 09/22, Warren Togami Jr. wrote:
> >>    On a separate note, I have a volunteer at school willing to help us 
> >> build
> >>    a Mandarin language ham corpus a few months from now.  That will be
> >>    interesting to see how that throws off our statistics. =)
> > 
> > I've been wondering about SA's accuracy on other languages.  It looks like
> > the only corpus we have is your wt-jp1?  What's the accuracy like on that?
> > Is the accuracy available somewhere on ruleqa?  I'm actually more curious
> > about accuracy of *spam* in non-English, because I'd say a very
> > significant portion of my missed spam is in a non-Latin alphabet.
> > And I don't want to just tell SA to classify non-English as spam because
> > it would be nice if SA was actually usable for people who speak these
> > languages.
> > 
> > 75 out of the 113 spams SA has missed so far this month have subjects in a
> > non-Latin alphabet.  66.4%.  That doesn't even include a bunch of the
> > non-English stuff.
> > 
> > (I'm also not using bayes.)
> > 
> 
> My smallish corpus (mostly ham) is Finnish language, but also English in
> it. Spam is of course English and other languages, there is no Finnish
> spam available ;)

There isn't any Finnish spam per se, but there are loads of that "badly
autotranslated" Finnish langauge spam/phishing coming in daily.

Reply via email to