Not to seem disrespectful, but we have tried the OCR performance of this package and it is not up to the standards we can get from abbyy and other commercial packages.
But-- I would like to ask for help in doing a new form of OCR that I believe will work, but do not know for sure. We would like to do OCR at the book level, and migrate to language independent OCR. The Internet Archive is in the process of scanning a large number of books and making them publicly available (http://www.archive.org/details/texts). The books we scan ourselves (for instance http://www.archive.org/details/americana ) are in very consistent form and we have alot of control as to how they are imaged and processed. What we want to output is ocr output that is in an XML file format that keeps the pixel locations of words and then the utf-8 of the characters in that word. We believe we can create a large training set for a large number of languages to train a word based OCR engine. This could lead to language independent OCR based on relatively simple pattern matching. See http://www.archive.org/details/document-word-segmenter for an overview of the idea. What we believe we need: A word segmenter, and a trainable system for word recognition, and then large training sets. The large training sets we can get from output of other programs. So we have it for several romance languages already. Is anyone interested in helping with this? We can pay something, but most of the "compensation" will be in doing something publicly good and huge scale. -brewster Digital Librarian Internet Archive _______________________________________________ Bug-ocrad mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-ocrad
