> > I could not say the same about ocropus. ocropus recognized a lot less > text and it took considerably longer to process the image (I used > ocropus page <jpeg image name>).
Please read the release notes: http://code.google.com/p/ocropus/wiki/ReleaseNotes The recognizer shipping with OCRopus 0.4 is new (only about nine months old) and trained only on a fairly small sample of characters (1.5M characters). > Now the question is what can I do - as user not programmer - to > improve ocropus at recognizing the text on printed documents, to get > to the same levels of recognition as the ms office ocr engine? In my > naive world I thought that ocropus would be capable of recognizing > printed text out of the box with an accuracy of at least 95%. In our benchmarks, the OCRopus recognizer achieves excellent recognition rates (<0.7% error on a standard test set of scanned documents, comparable to good commercial engines). However, there's a big difference between benchmark performance and real-world performance. To get good real-world performance, two things need to happen. First, the recognizer needs to be trained on much more data. Second, the recognizer needs to be made robust against a lot of possible idiosyncracies in real-world documents. That's what we're planning on doing for 0.5. In different words, think of the OCRopus 0.4 recognizer as a very smart kid, but it's competing with experienced adults. Even if the adults are not as smart as the kid, they will probably beat it for a little while longer. I'm glad that you're interested in helping. For OCRopus 0.5, we're going to have a simple kind of "distributed model". With that, you can use any kind of scanned document you're interested in, run the OCRopus training procedure on it, and submit the trained model to a central repository. Your data doesn't need to be transcribed, and the contents of the document remain private. We collect the models from the central repository and combine them into a "super model". We hope that the first version of that code is going to be available in a few months. Tom --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
