https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6155





--- Comment #10 from Justin Mason <[email protected]>  2009-08-18 01:15:46 PST ---
(In reply to comment #9)
> http://ruleqa.spamassassin.org/20090817-r804903-n/TVD_SPACE_RATIO/detail
> 90% FP rate for Japanese
> http://ruleqa.spamassassin.org/20090817-r804903-n/PLING_QUERY/detail
> 52% FP rate for Japanese
> http://ruleqa.spamassassin.org/20090817-r804903-n/GAPPY_SUBJECT/detail
> 44% FP rate for Japanese
> 
> All three of these rules do very poorly with Japanese mail, and the total %
> SPAM is lower than the % FP.  Yet the GA scores are rather high since we don't
> have a statistically significant amount of Japanese mail in the corpus.
> 
> What language are the SPAM hits?  Perhaps many are examples of identifying
> foreign languages instead of determining if it is ham or spam?
> 
> Bug #6149 is related to this problem.

I plan to fix that, alright. 

> I am attempting to convince Japanese, Chinese and Korean users to join the
> nightly masscheck, but it is very difficult.

BTW, you could also take copies of their mail samples and add them to your own
corpora, in effect acting as a proxy for them.  that's easier for them than
setting up all the infrastructure.  (I thought you were already doing this ;)

You may need to be able to ask them if a mail _really_ is ham, down the line,
though, so it needs to remain a two-way arrangement.

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to