22.9.2011 20:59, [email protected] kirjoitti:
> On 09/22, Warren Togami Jr. wrote:
>>    On a separate note, I have a volunteer at school willing to help us build
>>    a Mandarin language ham corpus a few months from now.  That will be
>>    interesting to see how that throws off our statistics. =)
> 
> I've been wondering about SA's accuracy on other languages.  It looks like
> the only corpus we have is your wt-jp1?  What's the accuracy like on that?
> Is the accuracy available somewhere on ruleqa?  I'm actually more curious
> about accuracy of *spam* in non-English, because I'd say a very
> significant portion of my missed spam is in a non-Latin alphabet.
> And I don't want to just tell SA to classify non-English as spam because
> it would be nice if SA was actually usable for people who speak these
> languages.
> 
> 75 out of the 113 spams SA has missed so far this month have subjects in a
> non-Latin alphabet.  66.4%.  That doesn't even include a bunch of the
> non-English stuff.
> 
> (I'm also not using bayes.)
> 

My smallish corpus (mostly ham) is Finnish language, but also English in
it. Spam is of course English and other languages, there is no Finnish
spam available ;)

-- 

"I wonder", he said to himself, "what's in a book while it's closed.  Oh, I
know it's full of letters printed on paper, but all the same, something must
be happening, because as soon as I open it, there's a whole story with
people
I don't know yet and all kinds of adventures and battles."
                -- Bastian B. Bux

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to