I've been attempting to lighten the load for SpamAssassin a little by
creating signatures for the stock and pill spams that are flooding in
these days.  More specifically, I'm creating signatures for the attached
images in the spams.  (Upgrading SA, to be able to use OCR plugins and
so on, is not really possible, mostly due to system load.)

However, I'm having some odd problems with signatures that, so far as I
can tell, are *legitimate*, if perhaps a bit long.  Here's what I'm
doing to create signatures:

I take a set of images, manually sorted for rough similarity, and run
them through a script that calls sigtool --hex-dump, and picks out a
segment of the data.  (I started with just the first 400 characters of
hex, and pushed it up to 600;  with the current set I'm picking out ~600
characters starting with "2c00000000" from anywhere.)

I further sort the resulting data by hand to find similar data, and then
feed that through another script that splits each line up into octets
and notes which octet has been seen in which position for the entire
data set.  It then constructs what should be a "correct" signature that
will match each line of the input according to the rules for ClamAV
signatures.  (More than 5 different octets at a position get converted
to ??, and finally long segments of ??????...  get converted to {nn}.)

However, far too often, ClamAv rejects it as a malformed signature.
Chopping {nn} bits off the end often fixes that issue, but not always;
in some cases I've had to trim further (aa|bb|cc) blocks, along with
trailing {nn} and/or ?? segments that may get "exposed" at the end.

That still doesn't make a good signature for my purposes;  I often have
to trim *further* to get a signature that actually matches on the image
files I started with.  Manually spreading the data out shows it *should*
match fine before I've done any trimming.

>From the problems I'm having with supposedly malformed signatures, it
looks like there's an effective complexity limit;  from the problems in
*matching* a signature that's finally been found to be acceptable, it
looks like there's a (lower) limit on what Clam can actually use in
matching.

Any suggestions on what I might be doing wrong?

I can post the scripts and some example signatures if needed.

-kgd
_______________________________________________
http://lurker.clamav.net/list/clamav-users.html

Reply via email to