... I don't know how to best guide people on how to do this (myself not currently involved in the hand-classified-corpora process) and I can't find the wonderful links Warren posted a while ago (all I can find is http://wiki.apache.org/spamassassin/HandClassifiedCorpora ) ... Maybe Warren can take the reins on issuing this request to the users list?
---- Almost all mail in the current masscheck corpora that comes in from Latin America / Caribbean (LACNIC) and Asia / Australia / Pacific (APNIC) is spam. This biases both region and language in our genetic algorithm (and thus our scores) quite tremendously. I haven't tested AFRINIC, but I suspect the same is true there as well, /especially/ given the whole ~Nigerian scam thing. WE SORELY NEED CONTRIBUTORS FROM THESE REGIONS! A short while ago, I tested this theory with a few rules. http://ruleqa.spamassassin.org/?rule=/RCVD_VIA ... or if you think the last --net might be of relevance (it isn't), http://ruleqa.spamassassin.org/?daterev=20100116-r899903-n&rule=/RCVD_VIA Explanation of these rules: LACNIC_LE examines only the connecting (last-external) server (much like the DNSBL tests). LACNIC itself just looks anywhere in the external relays and is akin to APNIC_E rather than APNIC itself. APNIC_LE again looks only at that final connecting server while APNIC_E looks anywhere in the external relays. APNIC_I looks at relays in *internal* relays, which is why it has almost no hits. APNIC proper, a rule Warren added from my channels before I had commit access, is a slightly sloppier rule that looks for a relay anywhere. (I created _I so that APNIC_E + APNIC_I should equal APNIC itself ... it doesn't, which is interesting. Even more interesting is that APNIC_E is larger in both spam and ham than APNIC itself, indicating a bug or two somewhere.) >From the extremely small volumes of ham reported by LACNIC and its sister APNIC_E, we have clear evidence that there is almost no ham volume coming from either of these two areas, especially LACNIC, which accounted for only 65 hams last night (APNIC saw 7401, which I still consider miniscule). Warren has two -jp corpora which collectively account for 60% of the APNIC ham in the overall corpus. This is far better than nothing, but I strongly suspect that the overwhelming majority of APNIC-originating spam comes from southeast Asia while Japan and Australia/New Zealand likely have among the lowest spam/ham ratios of APNIC's email-heavy nations. What we really need is corpora from sources in China (but not Hong Kong) and Brazil, though at our current standing, anything from outside ARIN and RIPE would be extremely useful. RIPE sources from Cyrillic-heavy areas would probably help too.
