Re: Mass-check Corpora (once was: Re: Update Mirror Issues)

Warren Togami Jr. Wed, 02 Feb 2011 03:46:18 -0800

On 2/2/2011 1:01 AM, Justin Mason wrote:

2011/2/2 Warren Togami Jr.<[email protected]>:

On 2/1/2011 1:02 PM, Karsten Bräckelmann wrote:


Yikes indeed.

Maybe Joao should answer these himself...

Given the numbers, is that purely trap driven? Is there a legion human
users manually verifying the spam?

What exactly does "filter duplicates" mean? If that includes "identical"
payload sent to different users, these dupes should not be eliminated I
believe, since it will bias results. A random sample already will
eliminate most duplicates, while preserving distribution.


Good point. +1


+1.

My approach btw when dealing with traps is to (a) upload those using a
distinct filename if possible (e.g. "ham-jm-traps.log" or similar),
and (b) sample randomly to get the volume down to something comparable
to the other corpora.  Trap spam tends to contain  bounce blowback and
other "noise" that we don't necessarily want in large numbers in our
corpora.

Good point about bounce blowback (or backscatter as some people callit). I forgot about that because my traps automatically filter that outfrom the corpus.


Warren

Re: Mass-check Corpora (once was: Re: Update Mirror Issues)

Reply via email to