On 2.7.2012 5:27, John Hardin wrote:
> On Sun, 1 Jul 2012, [email protected] wrote:
> 
>> On 07/01, Jari Fredriksson wrote:
>>> Did re-read wiki about cleaning corpus, and removed all mail from
>>> Facebook
>>> and Linkedin etc. from corpus. Also mail from MAILER-DAEMON and from
>>> ALL_TRUSTED removed.
>>
>> I wouldn't remove the facebook stuff... linkedin seems kind of evil
>> though.
>> But if you got a legit email from facebook, and it hit a blacklist, that
>> was a legit failure of that blacklist, and valuable information.
>> Especially since things like sought have a bad habit of inappropriately
>> causing stuff from facebook to get flagged as spam.
>>
>> Removing MAILDER-DAEMON and ALL_TRUSTED stuff is probably fine.
> 
> I'd mildly disagree. Having ALL_TRUSTED hams is useful for FP analysis
> and prevention, and having an ALL_TRUSTED spam is equally valuable.
> ALL_TRUSTED means "not forged", not "not spam".
> 

I follow the wiki page. I have now implemented the following

function remove-unwanted-mail
{
    echo "$0: removing unwanted $1 mail from corpus"
    for file in `egrep -l -m 1
"^List-id\:|^(Reply-To|From|Received)\:.*(uusisuomi\.fi|talentum\.com|linkedin\.com|hs\.fi|iltalehti\.fi|nytimes\.com|facebook\.com|facebookmail\.com)|^Delivered-To:
washingtonpost@fred.*\.fi|^From\: .*MAILER-DAEMON" \
                  Maildir/.Confirmed-$1/cur/*`
    do
      if test -f "$file"; then
        echo -n "Removing $file... "
        rm "$file" || exit 1
        echo "done."
      fi
    done

    for file in `grep ALL_TRUSTED
masscheckwork/*_mass_check/masses/*am-jarif.log | awk '{print $3}'`
    do
      if test -f "$file"; then
        echo -n "Removing $file... "
        rm "$file" || exit 1
        echo "done."
      fi
    done
}

remove-unwanted-mail HAM
remove-unwanted-mail SPAM


-- 

Your analyst has you mixed up with another patient.  Don't believe a
thing he tells you.



Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to