I got an idea the other day, that detecting whether an image contains mostly text is much easier than recognizing what the text is (OCR).

So I wrote a program that does that. It's short. It's in python. It's not polished or fast; I loop through every pixel at the python level, which is avoidable. Decoding the image is necessary though.

But mostly the idea and proof of concept might be of interest for catching image spam, on the assumption that images that look like a block of text probably are spam.

I'm a lazy and it's been a while since I've touched perl, so the odds of me going anywhere with integrating this into spamassassin are low. Does anyone else want it?

The code is at http://mawbid.com/detext.py
The code plus a few text and non-text images for testing are at http://mawbid.com/detext.tar.gz (2.2MB). It's on a slow ADSL line, so don't fetch it if you have images around for testing.

You need PIL to run it (on Debian, package name: python-imaging) and psyco is used if available (python-psyco on Debian).


What the code does is basically this:

Decode the image
For each horizontal line:
Calculate is "frequency" (how often you go from light to dark). (This measure is very high for text)
        Store whether the frequency is above a threshold.
See how many runs of above-threshold lines there are and how long they are. Throw away runs of 5 lines or less (assuming text would be higher than 5 pixels). Check whether there are few, common run lengths (corresponding ot the regular beat of over/under-threshold lines expected in text).

If you find that the code fails with a larger/different set of test images, you can play around with constants and add a high threshold to deal with stipple areas. The constants in the program are just guessess. I tweaked them over the course of just a few runs and got a 100% detection rate. I was kind of amazed.







Reply via email to