On 3.7.2012 2:24, [email protected] wrote:
> On 07/02, RW wrote:
>> On Mon, 2 Jul 2012 12:01:32 -0700 (PDT)
>> John Hardin wrote:
>>> On Mon, 2 Jul 2012, Jari Fredriksson wrote:
>>>> http://wiki.apache.org/spamassassin/HandClassifiedCorpora?highlight=%28facebook%29
>>>
>>> That says to not include any _spams_ received via those channels, not
>>> to discard them _in toto_.
>>>
>> It actually says:
>>
>>
>> DO NOT include such mail in either ham or spam folder. Just delete it.
>> Why? We don't want to count these as spam, causing false marks against
>> highly safe whitelist rules like USER_IN_DEF_DKIM_WL. They do not count
>> as ham either, because spam URL's or spam text would throw off the
>> statistics if they show up in the ham folder. Simply delete them
> 
> Jari had been deleting non-spam from facebook.  As John said, that wiki
> page says to not include *spam* from places like facebook.  Legit mail
> from facebook, which Jari had been deleting, has value when appropriately
> reported as non-spam.
> 

My so far finalized version of the script deletes only 2 HAMs now from
the whole corpus.

bin/delete-unwanted-mail.sh: removing unwanted HAM mail from corpus
Removing
Maildir/.Confirmed-HAM/cur/1325191790.M834211P3551V000000000000FE00I000000000007650B_0.hurricane,S=8885:2,S...
done.
Removing
Maildir/.Confirmed-HAM/cur/1333374426.M539856P18381V000000000000FE00I00000000000605A5_4.hurricane,S=6426:2,S...
done.
bin/delete-unwanted-mail.sh: removing unwanted SPAM mail from corpus

Those were not really bad ham, but they contained ^List-Id AND
^Received:.*MAILER-DAEMON in an attachment. I do not bother to do
something about those, they are rare examples of HAM. Sent by ezmail
from Debian because I had something wrong in my server and they tried to
send list post to me.

Only two deleted.

-- 

Among the lucky, you are the chosen one.



Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to