Martin Blapp wrote:

> I already log possible text (I count alphanummeric chars in the ocr output)

I think it would be interesting to add a new text/plain part to the e-mail
consisting of the OCR'd text, and feed that into Bayes.  Even if OCR gets
some words wrong, I bet the same mis-spelled tokens would quickly rise
to the top of the "spammy" token list.

We did some tests along these lines, and as a side-benefit, we discovered
some SARE stock-scam tests firing on the OCR output.

Regards,

David.
_______________________________________________
NOTE: If there is a disclaimer or other legal boilerplate in the above
message, it is NULL AND VOID.  You may ignore it.

Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list [email protected]
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang

Reply via email to