Justin Mason wrote:
Well, it'd be worth cc'ing the dev list, if that's ok. With any luck
there'll be future people trying similar stuff and it'll be handy to have
a thread URL to point at ;)
Quick intro - I have been working on automatically generatig rules from
the Sane Security Clamav signatures. With a fair bit of help from
Justin I have something up and running so I wanted to share what I have
done so far to see what people think and for some feedback.
I have a small perl script that extracts the rules from the scam.ndb and
phish.ndb files and generates 2 MAMOTH rulesets (60000 rules!).
I then run a mass check and then hit frequencies
Then the selection of rules to import is based on Justin's suggestion:
More or less -- I'd keep it even simpler. Select if column 2 ("SPAM %
hit") > 0.5, and discard if column 3 ("HAM % hit") > 0.
The reason is, this is an automatically generated ruleset -- avoiding FPs
in auto-generated stuff is critical in my opinion. Some of those are
pretty bad: an 8.8% false positive rate, ouch!!
The rule of thumb for false positives is that you will only see a fraction
of the "real-world" false positive rate in any measurement, since the
degree of variation between people's ham collections can be very large.
Finally I run a mkrules (that took a while to work out where all the
files had to be - either that or I can't read documentation ;-))
And have a first stab at a ruleset avaliable:
http://www.coders.co.uk/80_sane.cf
I am concerned with the results of some of the rules e.g.
##{ SANE_f48d6d7bf39ebd0b4e830b808d5b45bd
body SANE_f48d6d7bf39ebd0b4e830b808d5b45bd /\.cn\//
describe SANE_f48d6d7bf39ebd0b4e830b808d5b45bd
Email.Malware.Sanesecurity.08022207u
score SANE_f48d6d7bf39ebd0b4e830b808d5b45bd 0.01
##} SANE_f48d6d7bf39ebd0b4e830b808d5b45bd
Sorry the rule names are long - I haven't truncated the hash yet!
It isn't automatically updating at the moment and all of the scores are
set to 0.01
matt