[EMAIL PROTECTED] wrote:
> To help me understand this, please could you define what you mean by
> 'pollution' and how others
> could identify if it has happened to them.
Paul,
Pollution *to me* is repeating content that keeps getting added to the
corpus, incorrectly bolstering the scoring of that content. Its
completely subjective to your organization and the spam it gets.
Here is what specifically causes pollution at the location I am
currently administering ASSP at: repetitive whitelisted personal
email. ...And I mean this on a large scale. Hundreds of emails per
user going in/out between recipients a day. Email being used on the
communication scale of IM. Emails that are small, one or two sentences,
and are are continuously quoted replies.
So, what *I* do to circumvent it is to use the redRe (I previously used
external scripts until the redRe behavior was modified recently). I
look for phrase matches in personal email - such as how people typically
start off a personal conversation (things I have seen repeated on a
daily basis for these problematic users). Certain things they say, and
certain slang they use. It's targeted to the types of conversations
that I believe pollute my corpus.
Here are two examples of redRe Regular Expressions that I use:
subject: ?(fw|fwd):# forwarded messages
subject:.{0,32}(re|fw|fwd):(?!(\w| +\w))# replied to or forwarded
messages that had no subject
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Assp-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-user