On Thursday 02 November 2006 21:42, Steve Holdoway wrote: > Hey, > > If you're interested, bounce your ideas off us! If we can work out a new > way of fuzzy matching on these images, then you'll be helping the OSS cause > big time. I haven't manipulated images in this way in the best part of 20 > years, so I'm a bit rusty! I did some very preliminary tests last night and found that the Tesseract OCR [1] engine made a fair to muddling hack of extracting text from a couple of spam attachments. It worked much better when I enlarged the image by about 3 times, and converted to pure black and white.
The accuracy of the text extraction does not matter all that much does it? Surely it's only got to be able to say that there is text in this image? BUT it all takes such a load on the computer, like several seconds to OCR an image, that it would appear to me that some sort of distributed system like [EMAIL PROTECTED] would be needed to get sufficient speed. I am somewhat interested in pursuing this more, but I'm not as up with the state of the art with programming as I used to be. [1] http://sourceforge.net/projects/tesseract-ocr -- CS
