----- "Warren Togami Jr." <[email protected]> wrote: > On 2/2/2011 5:25 PM, Karsten Bräckelmann wrote: > > Spam to live accounts strongly preferred, human reviewed by > "trained > > monkeys". Emphasis on trained. ;) Some crap like backscatter should > be > > filtered from the trap data, if possible, and trap volume kept lower > -- > > best done by random sampling, rather than dupe elimination. > > > > How much will that add to the corpus? In particular, how much would > the > > first class be, without trap data at all? > > Karsten brings up a good point about two types of spam. How about > something like: > > * We want a total of 70K spam in your nightly corpus over the past > week. > This means 10K spam per day. > * 3K spam on Monday is from trained monkeys. Include 7K from a random > > selection of trap spam. > * 2K spam on Tuesday is from trained monkeys. Include 8K from a > random > selection of trap spam. > * etc. > > You could even split it into two separate masscheck runs. > anubis-monkey > anubis-trap
Thanks for the clear specs Warren, that helps ;-) We shall try to do it like that. I still need to setup a proper environment for this. Hopefully on this next weekend. > > > > > > Given we're talking original figures of 1 million spam per *day*, > > already discussing ways to cut that down to 50-100k -- over a period > of > > up to 2 months for spam, 60 days, mind you -- which is less than 2k > a > > day... > > It seems his spam is lacking spamassassin headers, so without "reuse" > we > are unable to determine delivery-time status of the network rules. I > > suggested that as long as his mail is lacking spamassassin headers, > perhaps his random sample should be limited to the past week. > Although > not perfect, the past week might be closest to "reuse" in results. > > A better alternative would to add spamassassin headers as each message > > was decided to be added to nightly masscheck corpus. The random > subset > of trap spam would have headers from seconds after delivery, and > trained-monkey spam headers would be from whenever it was sorted. > "reuse" would then be possible, and the age of spam included in the > nightly masscheck can be calibrated based upon how much this corpus > overwhelms everyone else's recent spam. > > Warren -- João Gouveia
