We have a similar historical effort at the University of California, Berkeley, involving printed mortality statistics for the United States and other countries (Japan, France). In our case, the source material is page after page of nearly identically formatted tables of numbers.
I dabbled in the possibility of using some automated OCR processing to handle this mountain of work, and the initial results seemed promising (using Tesseract, a python wrapper (PyTesser), and the python image library (PIL)). I think what makes our project highly amenable to OCR is: (1) segmentation is easy because the pages are all alike, and it is not too hard to find the desired column regions on each page with some simple image processing routines; (2) the characters are numeric only, typically; (3) there are usually check sums in columns and rows. Regarding (2), in an earlier version of Tesseract, I was able to make a kludgy modification to the program to allow dynamic setting of the recognition character whitelist via an environment variable, so I could restrict the allowed character set, and change it, depending upon the column on the page I was working with. A crude but effective language model. I don't know how to get the same dynamic character whitelist hack to work in the recent version of Tesseract (the code has changed substantially) and I am wondering if something similar might be possible to implement in Ocropus? Any advice would be appreciated. --Carl --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
