On Mon, July 2, 2012 09:37, Jari Fredriksson wrote:
>> function remove-unwanted-mail
>> {
>>     echo "$0: removing unwanted $1 mail from corpus"
>>     for file in `egrep -l -m 1
>> "^List-id\:|^(Reply-To|From|Received)\:.*(uusisuomi\.fi|talentum\.com|linkedin\.com|hs\.fi|iltalehti\.fi|nytimes\.com|facebook\.com|facebookmail\.com)|^Delivered-To:
>> washingtonpost@fred.*\.fi|^From\: .*MAILER-DAEMON" \
>>                   Maildir/.Confirmed-$1/cur/*`
>>     do
>>       if test -f "$file"; then
>>         echo -n "Removing $file... "
>>         rm "$file" || exit 1
>>         echo "done."
>>       fi
>>     done
>>
>>     for file in `grep ALL_TRUSTED
>> masscheckwork/*_mass_check/masses/*am-jarif.log | awk '{print $3}'`
>>     do
>>       if test -f "$file"; then
>>         echo -n "Removing $file... "
>>         rm "$file" || exit 1
>>         echo "done."
>>       fi
>>     done
>> }
>>
>> remove-unwanted-mail HAM
>> remove-unwanted-mail SPAM
>>
>>
>
> This is now running always before the masscheck. It ruins the idea of
> Warren, who urged me to order and collect Finnish ham mail from news
> agencies and such, trying to grab a sample of what Finnish email users
> get into their inbox.
>
> There still is this kind of massa email, which is not personal to me:
> railroad (vr.fi) air finnair.(fi|com) to name a couple. Lots of stuff
> will be removed from news agencies like talentum.com, hs.fi,
> iltasanomat.fi. Lots of mail that is not personal. Spirit of wiki page.
>

Too aggressive filter on ^From:
Too aggressive filter on Finnish mail.

Fixed and restored from backup the corpus.

    filter="^List-id\:"
    
filter="$filter|^Received\:.*(linkedin\.com|hs\.fi|facebook\.com|facebookmail\.com)"
    filter="$filter|^From:.*(MAILER-DAEMON|nytdirect\@nytimes\.com)"
    filter="$filter|^Delivered-To: washingtonpost@fred.*\.fi"
    for file in `egrep -l -m 1 "$filter" Maildir/.Confirmed-$1/cur/*`




Reply via email to