On 2.7.2012 5:27, John Hardin wrote: > On Sun, 1 Jul 2012, [email protected] wrote: > >> On 07/01, Jari Fredriksson wrote: >>> Did re-read wiki about cleaning corpus, and removed all mail from >>> Facebook >>> and Linkedin etc. from corpus. Also mail from MAILER-DAEMON and from >>> ALL_TRUSTED removed. >> >> I wouldn't remove the facebook stuff... linkedin seems kind of evil >> though. >> But if you got a legit email from facebook, and it hit a blacklist, that >> was a legit failure of that blacklist, and valuable information. >> Especially since things like sought have a bad habit of inappropriately >> causing stuff from facebook to get flagged as spam. >> >> Removing MAILDER-DAEMON and ALL_TRUSTED stuff is probably fine. > > I'd mildly disagree. Having ALL_TRUSTED hams is useful for FP analysis > and prevention, and having an ALL_TRUSTED spam is equally valuable. > ALL_TRUSTED means "not forged", not "not spam". >
I follow the wiki page. I have now implemented the following
function remove-unwanted-mail
{
echo "$0: removing unwanted $1 mail from corpus"
for file in `egrep -l -m 1
"^List-id\:|^(Reply-To|From|Received)\:.*(uusisuomi\.fi|talentum\.com|linkedin\.com|hs\.fi|iltalehti\.fi|nytimes\.com|facebook\.com|facebookmail\.com)|^Delivered-To:
washingtonpost@fred.*\.fi|^From\: .*MAILER-DAEMON" \
Maildir/.Confirmed-$1/cur/*`
do
if test -f "$file"; then
echo -n "Removing $file... "
rm "$file" || exit 1
echo "done."
fi
done
for file in `grep ALL_TRUSTED
masscheckwork/*_mass_check/masses/*am-jarif.log | awk '{print $3}'`
do
if test -f "$file"; then
echo -n "Removing $file... "
rm "$file" || exit 1
echo "done."
fi
done
}
remove-unwanted-mail HAM
remove-unwanted-mail SPAM
--
Your analyst has you mixed up with another patient. Don't believe a
thing he tells you.
signature.asc
Description: OpenPGP digital signature
