I'm more than happy to discuss any of this offlist (please!!!), but can't get technical for fairly obvious reasons!
Cheers, Steve On Tue, 07 Nov 2006 14:40:16 +1300 Christopher Sawtell <[EMAIL PROTECTED]> wrote: > On Thursday 02 November 2006 21:42, Steve Holdoway wrote: > > Hey, > > > > If you're interested, bounce your ideas off us! If we can work out a new > > way of fuzzy matching on these images, then you'll be helping the OSS cause > > big time. I haven't manipulated images in this way in the best part of 20 > > years, so I'm a bit rusty! > I did some very preliminary tests last night and found that the > Tesseract OCR [1] engine made a fair to muddling hack of extracting text from > a couple of spam attachments. It worked much better when I enlarged the image > by about 3 times, and converted to pure black and white. > > The accuracy of the text extraction does not matter all that much does it? > Surely it's only got to be able to say that there is text in this image? > > BUT it all takes such a load on the computer, like several seconds to OCR an > image, that it would appear to me that some sort of distributed system like > [EMAIL PROTECTED] would be needed to get sufficient speed. > > I am somewhat interested in pursuing this more, but I'm not as up with the > state of the art with programming as I used to be. > > [1] http://sourceforge.net/projects/tesseract-ocr > > -- > CS > > > >
