I'm more than happy to discuss any of this offlist (please!!!), but can't get 
technical for fairly obvious reasons!

Cheers,

Steve

On Tue, 07 Nov 2006 14:40:16 +1300
Christopher Sawtell <[EMAIL PROTECTED]> wrote:

> On Thursday 02 November 2006 21:42, Steve Holdoway wrote:
> > Hey,
> >
> > If you're interested, bounce your ideas off us! If we can work out a new
> > way of fuzzy matching on these images, then you'll be helping the OSS cause
> > big time. I haven't manipulated images in this way in the best part of 20
> > years, so I'm a bit rusty!
> I did some very preliminary tests last night and found that the
> Tesseract OCR [1] engine made a fair to muddling hack of extracting text from 
> a couple of spam attachments. It worked much better when I enlarged the image 
> by about 3 times, and converted to pure black and white.
> 
> The accuracy of the text extraction does not matter all that much does it? 
> Surely it's only got to be able to say that there is text in this image?
> 
> BUT it all takes such a load on the computer, like several seconds to OCR an 
> image, that it would appear to me that some sort of distributed system like 
> [EMAIL PROTECTED] would be needed to get sufficient speed.
> 
> I am somewhat interested in pursuing this more, but I'm not as up with the 
> state of the art with programming as I used to be.
> 
> [1] http://sourceforge.net/projects/tesseract-ocr
> 
> --
> CS
> 
> 
> 
> 

Reply via email to