On 2.7.2012 9:00, Jari Fredriksson wrote: > On 2.7.2012 5:27, John Hardin wrote: >> On Sun, 1 Jul 2012, [email protected] wrote: >> >>> On 07/01, Jari Fredriksson wrote: >>>> Did re-read wiki about cleaning corpus, and removed all mail from >>>> Facebook >>>> and Linkedin etc. from corpus. Also mail from MAILER-DAEMON and from >>>> ALL_TRUSTED removed. >>> >>> I wouldn't remove the facebook stuff... linkedin seems kind of evil >>> though. >>> But if you got a legit email from facebook, and it hit a blacklist, that >>> was a legit failure of that blacklist, and valuable information. >>> Especially since things like sought have a bad habit of inappropriately >>> causing stuff from facebook to get flagged as spam. >>> >>> Removing MAILDER-DAEMON and ALL_TRUSTED stuff is probably fine. >> >> I'd mildly disagree. Having ALL_TRUSTED hams is useful for FP analysis >> and prevention, and having an ALL_TRUSTED spam is equally valuable. >> ALL_TRUSTED means "not forged", not "not spam". >> > > I follow the wiki page. I have now implemented the following > > function remove-unwanted-mail > { > echo "$0: removing unwanted $1 mail from corpus" > for file in `egrep -l -m 1 > "^List-id\:|^(Reply-To|From|Received)\:.*(uusisuomi\.fi|talentum\.com|linkedin\.com|hs\.fi|iltalehti\.fi|nytimes\.com|facebook\.com|facebookmail\.com)|^Delivered-To: > washingtonpost@fred.*\.fi|^From\: .*MAILER-DAEMON" \ > Maildir/.Confirmed-$1/cur/*` > do > if test -f "$file"; then > echo -n "Removing $file... " > rm "$file" || exit 1 > echo "done." > fi > done > > for file in `grep ALL_TRUSTED > masscheckwork/*_mass_check/masses/*am-jarif.log | awk '{print $3}'` > do > if test -f "$file"; then > echo -n "Removing $file... " > rm "$file" || exit 1 > echo "done." > fi > done > } > > remove-unwanted-mail HAM > remove-unwanted-mail SPAM > >
This is now running always before the masscheck. It ruins the idea of Warren, who urged me to order and collect Finnish ham mail from news agencies and such, trying to grab a sample of what Finnish email users get into their inbox. There still is this kind of massa email, which is not personal to me: railroad (vr.fi) air finnair.(fi|com) to name a couple. Lots of stuff will be removed from news agencies like talentum.com, hs.fi, iltasanomat.fi. Lots of mail that is not personal. Spirit of wiki page. -- You will forget that you ever knew me.
signature.asc
Description: OpenPGP digital signature
