Bill Randle wrote:
On Sat, 2006-10-28 at 16:54 -0400, Kris Deugau wrote:
Henrik Krohns wrote:
I don't get it.. unless you have some big honeypot, maybe 5% of traffic
contain small images to be OCRd. If your server can't handle that, I guess
it's running out of juice anyway. :)
Well... yeah.  <g>  The basic problem is that all the other garbage
(with the occasional inevitable exception) is getting caught by Clam
(viruses and most phishes) or SpamAssassin (all but a few text-based spams.

I've found *enough* similarities in the raw binary image data to
usefully make signatures for a lot of what is otherwise getting through;
 at the moment this is just a stopgap until these machines can be retired.

However, in the long run, OCR to feed the text to SpamAssassin's other
rules is a better solution;  it's much more flexible.

Indeed. For those interested in the topic of OCR to feed SpamAssassin,
there's an active project with its own mailing list that does just this.
It turns out to be a non-trivial task because many of these image spam
are animated gifs, so you need to find the right frame to pass to the
OCR program.

Start here: http://wiki.apache.org/spamassassin/FuzzyOcrPlugin then
subscribe to the Devel-Spam mailing list (there's a link on that page).


You might want to consider the next level of image spam before you go too far down the OCR path:

http://www.iss.net/threats/Animated%20GIF.html

dp
_______________________________________________
http://lurker.clamav.net/list/clamav-users.html

Reply via email to