We have a similar historical  effort at the University of California,
Berkeley, involving printed mortality statistics  for the United
States and other countries (Japan, France).  In our case, the source
material is page after page of nearly identically formatted tables of
numbers.

I dabbled in the possibility of using some automated OCR processing to
handle this mountain of work, and the initial results seemed promising
(using Tesseract, a python wrapper (PyTesser), and the python image
library (PIL)).   I think what makes our project highly amenable to
OCR is: (1) segmentation is easy because the pages are all alike, and
it is not too hard to find the desired column regions on each page
with some simple image processing routines; (2) the characters are
numeric only, typically; (3) there are usually check sums in columns
and rows.

Regarding (2), in an earlier version of Tesseract, I was able to make
a kludgy modification to the program to allow dynamic setting of the
recognition character whitelist via an environment variable, so I
could restrict the allowed character set, and change it, depending
upon the column on the page I was working with. A crude but effective
language model.   I don't know how to get the same dynamic character
whitelist hack to work in the recent version of Tesseract (the code
has changed substantially) and I am wondering if something similar
might be possible to implement in Ocropus?

Any advice would be appreciated.

--Carl
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to