Checking an image for text is easy. Useful for blocking image spam?

Haukur Hreinsson Tue, 22 Aug 2006 05:31:23 -0700

I got an idea the other day, that detecting whether an image containsmostly text is much easier than recognizing what the text is (OCR).

So I wrote a program that does that. It's short. It's in python. It'snot polished or fast; I loop through every pixel at the python level,which is avoidable. Decoding the image is necessary though.

But mostly the idea and proof of concept might be of interest forcatching image spam, on the assumption that images that look like ablock of text probably are spam.

I'm a lazy and it's been a while since I've touched perl, so the oddsof me going anywhere with integrating this into spamassassin are low.Does anyone else want it?


The code is at http://mawbid.com/detext.py

The code plus a few text and non-text images for testing are athttp://mawbid.com/detext.tar.gz (2.2MB). It's on a slow ADSL line, sodon't fetch it if you have images around for testing.

You need PIL to run it (on Debian, package name: python-imaging) andpsyco is used if available (python-psyco on Debian).



What the code does is basically this:

Decode the image
For each horizontal line:

Calculate is "frequency" (how often you go from light to dark).(This measure is very high for text)

        Store whether the frequency is above a threshold.

See how many runs of above-threshold lines there are and how longthey are.Throw away runs of 5 lines or less (assuming text would be higherthan 5 pixels).Check whether there are few, common run lengths (corresponding ot theregular beat of over/under-threshold lines expected in text).

If you find that the code fails with a larger/different set of testimages, you can play around with constants and add a high thresholdto deal with stipple areas. The constants in the program are justguessess. I tweaked them over the course of just a few runs and got a100% detection rate. I was kind of amazed.

Checking an image for text is easy. Useful for blocking image spam?

Reply via email to