On 2/2/2011 1:01 AM, Justin Mason wrote:
2011/2/2 Warren Togami Jr.<[email protected]>:
On 2/1/2011 1:02 PM, Karsten Bräckelmann wrote:
Yikes indeed.
Maybe Joao should answer these himself...
Given the numbers, is that purely trap driven? Is there a legion human
users manually verifying the spam?
What exactly does "filter duplicates" mean? If that includes "identical"
payload sent to different users, these dupes should not be eliminated I
believe, since it will bias results. A random sample already will
eliminate most duplicates, while preserving distribution.
Good point. +1
+1.
My approach btw when dealing with traps is to (a) upload those using a
distinct filename if possible (e.g. "ham-jm-traps.log" or similar),
and (b) sample randomly to get the volume down to something comparable
to the other corpora. Trap spam tends to contain bounce blowback and
other "noise" that we don't necessarily want in large numbers in our
corpora.
Good point about bounce blowback (or backscatter as some people call
it). I forgot about that because my traps automatically filter that out
from the corpus.
Warren