On Mon, July 2, 2012 09:37, Jari Fredriksson wrote:
>> function remove-unwanted-mail
>> {
>> echo "$0: removing unwanted $1 mail from corpus"
>> for file in `egrep -l -m 1
>>
"^List-id\:|^(Reply-To|From|Received)\:.*(uusisuomi\.fi|talentum\.com|linkedin\.com|hs\.fi|iltalehti\.fi|nytimes\.com|facebook\.com|facebookmail\.com)|^Delivered-To:
>> washingtonpost@fred.*\.fi|^From\: .*MAILER-DAEMON" \
>> Maildir/.Confirmed-$1/cur/*`
>> do
>> if test -f "$file"; then
>> echo -n "Removing $file... "
>> rm "$file" || exit 1
>> echo "done."
>> fi
>> done
>>
>> for file in `grep ALL_TRUSTED
>> masscheckwork/*_mass_check/masses/*am-jarif.log | awk '{print $3}'`
>> do
>> if test -f "$file"; then
>> echo -n "Removing $file... "
>> rm "$file" || exit 1
>> echo "done."
>> fi
>> done
>> }
>>
>> remove-unwanted-mail HAM
>> remove-unwanted-mail SPAM
>>
>>
>
> This is now running always before the masscheck. It ruins the idea of
> Warren, who urged me to order and collect Finnish ham mail from news
> agencies and such, trying to grab a sample of what Finnish email users
> get into their inbox.
>
> There still is this kind of massa email, which is not personal to me:
> railroad (vr.fi) air finnair.(fi|com) to name a couple. Lots of stuff
> will be removed from news agencies like talentum.com, hs.fi,
> iltasanomat.fi. Lots of mail that is not personal. Spirit of wiki page.
>
Too aggressive filter on ^From:
Too aggressive filter on Finnish mail.
Fixed and restored from backup the corpus.
filter="^List-id\:"
filter="$filter|^Received\:.*(linkedin\.com|hs\.fi|facebook\.com|facebookmail\.com)"
filter="$filter|^From:.*(MAILER-DAEMON|nytdirect\@nytimes\.com)"
filter="$filter|^Delivered-To: washingtonpost@fred.*\.fi"
for file in `egrep -l -m 1 "$filter" Maildir/.Confirmed-$1/cur/*`
signature.asc
Description: OpenPGP digital signature
