... I don't know how to best guide people on how to do this (myself
not currently involved in the hand-classified-corpora process) and I
can't find the wonderful links Warren posted a while ago (all I can
find is http://wiki.apache.org/spamassassin/HandClassifiedCorpora )
... Maybe Warren can take the reins on issuing this request to the
users list?

----

Almost all mail in the current masscheck corpora that comes in from
Latin America / Caribbean (LACNIC) and Asia / Australia / Pacific
(APNIC) is spam.  This biases both region and language in our genetic
algorithm (and thus our scores) quite tremendously.  I haven't tested
AFRINIC, but I suspect the same is true there as well, /especially/
given the whole ~Nigerian scam thing.

WE SORELY NEED CONTRIBUTORS FROM THESE REGIONS!



A short while ago, I tested this theory with a few rules.

http://ruleqa.spamassassin.org/?rule=/RCVD_VIA
... or if you think the last --net might be of relevance (it isn't),
http://ruleqa.spamassassin.org/?daterev=20100116-r899903-n&rule=/RCVD_VIA

Explanation of these rules:

LACNIC_LE examines only the connecting (last-external) server (much
like the DNSBL tests).  LACNIC itself just looks anywhere in the
external relays and is akin to APNIC_E rather than APNIC itself.

APNIC_LE again looks only at that final connecting server while
APNIC_E looks anywhere in the external relays.  APNIC_I looks at
relays in *internal* relays, which is why it has almost no hits.
APNIC proper, a rule Warren added from my channels before I had commit
access, is a slightly sloppier rule that looks for a relay anywhere.

(I created _I so that APNIC_E + APNIC_I should equal APNIC itself ...
it doesn't, which is interesting.  Even more interesting is that
APNIC_E is larger in both spam and ham than APNIC itself, indicating a
bug or two somewhere.)

>From the extremely small volumes of ham reported by LACNIC and its
sister APNIC_E, we have clear evidence that there is almost no ham
volume coming from either of these two areas, especially LACNIC, which
accounted for only 65 hams last night (APNIC saw 7401, which I still
consider miniscule).


Warren has two -jp corpora which collectively account for 60% of the
APNIC ham in the overall corpus.  This is far better than nothing, but
I strongly suspect that the overwhelming majority of APNIC-originating
spam comes from southeast Asia while Japan and Australia/New Zealand
likely have among the lowest spam/ham ratios of APNIC's email-heavy
nations.

What we really need is corpora from sources in China (but not Hong
Kong) and Brazil, though at our current standing, anything from
outside ARIN and RIPE would be extremely useful.  RIPE sources from
Cyrillic-heavy areas would probably help too.

Reply via email to