> I dabbled in the possibility of using some automated OCR processing to
> handle this mountain of work, and the initial results seemed promising
> (using Tesseract, a python wrapper (PyTesser), and the python image
> library (PIL)).   I think what makes our project highly amenable to
> OCR is: (1) segmentation is easy because the pages are all alike, and
> it is not too hard to find the desired column regions on each page
> with some simple image processing routines; (2) the characters are
> numeric only, typically; (3) there are usually check sums in columns
> and rows.

That preprocessing work should be usable with OCRopus as well then.

> Regarding (2), in an earlier version of Tesseract, I was able to make
> a kludgy modification to the program to allow dynamic setting of the
> recognition character whitelist via an environment variable, so I
> could restrict the allowed character set, and change it, depending
> upon the column on the page I was working with. A crude but effective
> language model.   I don't know how to get the same dynamic character
> whitelist hack to work in the recent version of Tesseract (the code
> has changed substantially) and I am wondering if something similar
> might be possible to implement in Ocropus?

There are two aspects to that.  First, there is the character
recognizer, which you probably want to retrain for numerical data.
Second, there is the language model; it essentially allows you to
specify a set of regular expressions or a grammar for your data.
Unlike regular expressions or grammars, the language models in OCRopus
are weighted based on probabilities.

There will be more examples and information about how to do this kind
of customization starting with the beta release.

Tom

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to