On 3.7.2012 2:24, [email protected] wrote: > On 07/02, RW wrote: >> On Mon, 2 Jul 2012 12:01:32 -0700 (PDT) >> John Hardin wrote: >>> On Mon, 2 Jul 2012, Jari Fredriksson wrote: >>>> http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29 >>> >>> That says to not include any _spams_ received via those channels, not >>> to discard them _in toto_. >>> >> It actually says: >> >> >> DO NOT include such mail in either ham or spam folder. Just delete it. >> Why? We don't want to count these as spam, causing false marks against >> highly safe whitelist rules like USER_IN_DEF_DKIM_WL. They do not count >> as ham either, because spam URL's or spam text would throw off the >> statistics if they show up in the ham folder. Simply delete them > > Jari had been deleting non-spam from facebook. As John said, that wiki > page says to not include *spam* from places like facebook. Legit mail > from facebook, which Jari had been deleting, has value when appropriately > reported as non-spam. >
My so far finalized version of the script deletes only 2 HAMs now from the whole corpus. bin/delete-unwanted-mail.sh: removing unwanted HAM mail from corpus Removing Maildir/.Confirmed-HAM/cur/1325191790.M834211P3551V000000000000FE00I000000000007650B_0.hurricane,S=8885:2,S... done. Removing Maildir/.Confirmed-HAM/cur/1333374426.M539856P18381V000000000000FE00I00000000000605A5_4.hurricane,S=6426:2,S... done. bin/delete-unwanted-mail.sh: removing unwanted SPAM mail from corpus Those were not really bad ham, but they contained ^List-Id AND ^Received:.*MAILER-DAEMON in an attachment. I do not bother to do something about those, they are rare examples of HAM. Sent by ezmail from Debian because I had something wrong in my server and they tried to send list post to me. Only two deleted. -- Among the lucky, you are the chosen one.
signature.asc
Description: OpenPGP digital signature
