On Sun, 26 Jun 2005, Justin Mason spake:
> Nix writes:
>> I'm still not sure how intensely to de-dupe: should I zap articles with
>> identical bodies? identical bodies except for MIME headers? identical
>> bodies except for identifiable bayes poison? Until the obfu rules came
>> in, I'd have said the latter... but now I'm just zapping articles with
>> identical bodies and rule hits, as the obfu rules make it very likely
>> that two articles differing only in bayes poison will end in different
>> rule-hit partitions anyway.)
>
> Yeah, I think de-duping is a bit of a lost cause. I'd say if you see 50
> copies of the same message arriving one after the other, go ahead and
> de-dupe, but in general, the volume is just too high to be able to humanly
> achieve this any more, so let's just not worry about it :(
Well, my dumb-as-stumps deduping script (`zap identically-scored
identical messages') eliminates some 5% of my corpus (hand-checked this
time to make sure they really are duplicates), and only removes 2% if I
constrain it to remove consecutive identical messages only, taking about
two minutes to chew over 30,000 messages on this 500MHz UltraSPARC.
(It's critical to ignore all the headers when doing the comparisons,
especially Subject and From, as the spamware seems to perturb these
automatically even when it leaves Bayes poison and things in the body
alone.)
Doing it by hand is of course totally impractical (especially if you're
as lazy as I am). When I've finished debugging the script I'll post it
here or something so other people can laugh at it.
One downside: the comments inserted by mass-check describing skipped
messages get moved to the top of the mass-check log by the deduper (I
said it wasn't very smart). Since there should be no skippable messages
left in the spam corpus by the time the real mass-check happens, I doubt
this matters. ;)
--
`I lost interest in "blade servers" when I found they didn't throw knives
at people who weren't supposed to be in your machine room.'
--- Peter da Silva