Matt Hampton writes:
> Justin Mason wrote:
> 
> > Well, it'd be worth cc'ing the dev list, if that's ok.   With any luck
> > there'll be future people trying similar stuff and it'll be handy to have
> > a thread URL to point at ;)
> 
> Quick intro - I have been working on automatically generatig rules from 
> the Sane Security Clamav signatures.  With a fair bit of help from 
> Justin I have something up and running so I wanted to share what I have 
> done so far to see what people think and for some feedback.
> 
> I have a small perl script that extracts the rules from the scam.ndb and 
> phish.ndb files and generates 2 MAMOTH rulesets (60000 rules!).
> 
> I then run a mass check and then hit frequencies
> 
> Then the selection of rules to import is based on Justin's suggestion:
> > More or less -- I'd keep it even simpler.  Select if column 2 ("SPAM %
> > hit") > 0.5, and discard if column 3 ("HAM % hit") > 0.
> >
> > The reason is, this is an automatically generated ruleset -- avoiding FPs
> > in auto-generated stuff is critical in my opinion.  Some of those are
> > pretty bad: an 8.8% false positive rate, ouch!!
> >
> > The rule of thumb for false positives is that you will only see a fraction
> > of the "real-world" false positive rate in any measurement, since the
> > degree of variation between people's ham collections can be very large.
> >
> Finally I run a mkrules (that took a while to work out where all the 
> files had to be - either that or I can't read documentation ;-))

er, yeah, sorry about the lack of documentation on that tool ;)

> And have a first stab at a ruleset avaliable:
> 
> http://www.coders.co.uk/80_sane.cf
> 
> I am concerned with the results of some of the rules e.g.
> 
> ##{ SANE_f48d6d7bf39ebd0b4e830b808d5b45bd
> body SANE_f48d6d7bf39ebd0b4e830b808d5b45bd /\.cn\//
> describe SANE_f48d6d7bf39ebd0b4e830b808d5b45bd 
> Email.Malware.Sanesecurity.08022207u
> score SANE_f48d6d7bf39ebd0b4e830b808d5b45bd 0.01
> ##} SANE_f48d6d7bf39ebd0b4e830b808d5b45bd

yeah, that seems a bit dangerous.

It might be worthwhile discarding rules that are less than a certain
length, in characters.  That's another thing the "sought" ruleset does.
it, again, reduces FPs nicely.

Also is there any way to get it to produce _more_ rules? that 80_sane.cf
seems pretty short, compared to the 60k-rules input ;)   Sounds like
0.5% spam hits is too high a threshold, I think.

> Sorry the rule names are long - I haven't truncated the hash yet!
> 
> It isn't automatically updating at the moment and all of the scores are 
> set to 0.01

btw you can also safely drop the "require_version" line, that only makes
sense as part of the SpamAssassin source tree.

--j.

Reply via email to