[EMAIL PROTECTED] wrote:
> To help me understand this, please could you define what you mean by 
> 'pollution' and how others 
> could identify if it has happened to them.

Paul,

Pollution *to me* is repeating content that keeps getting added to the 
corpus, incorrectly bolstering the scoring of that content. Its 
completely subjective to your organization and the spam it gets.

Here is what specifically causes pollution at the location I am 
currently administering ASSP at:  repetitive whitelisted personal 
email.  ...And I mean this on a large scale.  Hundreds of emails per 
user going in/out between recipients a day.  Email being used on the 
communication scale of IM.  Emails that are small, one or two sentences, 
and are are continuously quoted replies.

So, what *I* do to circumvent it is to use the redRe (I previously used 
external scripts until the redRe behavior was modified recently).  I 
look for phrase matches in personal email - such as how people typically 
start off a personal conversation (things I have seen repeated on a 
daily basis for these problematic users).  Certain things they say, and 
certain slang they use.  It's targeted to the types of conversations 
that I believe pollute my corpus.

Here are two examples of redRe Regular Expressions that I use:

subject: ?(fw|fwd):#                      forwarded messages
subject:.{0,32}(re|fw|fwd):(?!(\w| +\w))# replied to or forwarded 
messages that had no subject





-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Assp-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/assp-user

Reply via email to