Re: Document Management

Alex Dean Tue, 03 Nov 2009 18:15:01 -0800


On Nov 2, 2009, at 11:32 AM, Craig White wrote:

On Mon, 2009-11-02 at 08:31 -0700, Matt Graham wrote:

I spent 3 or 4 years doing stuff like this on the NYT, Wall Street
Journal, Christian Science Monitor, and Boston Globe.  You will NOT
be able to get decent OCR with free software.  Newspapers require
a different approach than most OCR packages take; you have to split
each article up into multiple individual image files and OCR each
file separately, then stitch the results back together.  And editing
the results is totally necessary since newspaper text is so horrible
in quality.

----
I don't know anything about GOCR at all.

A few years ago I set up tesseract and it worked as well as I have seen

any OCR program work (in terms of accuracy) though clearly there are

many limitations compared to something like Omnipage. In the end it was

rather easy to install and get it working.

http://code.google.com/p/tesseract-ocr/

Google uses tesseract in their ocropus project. Ocropus seems promising, but is still at a fairly early stage.

http://code.google.com/p/ocropus/

alex

PGP.sig
Description: This is a digitally signed message part

---------------------------------------------------
PLUG-discuss mailing list - [email protected]
To subscribe, unsubscribe, or to change your mail settings:
http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss

Re: Document Management

Reply via email to