Re: ocropus vs. MS Office 2007 ocr

Thomas Breuel Thu, 18 Jun 2009 03:17:55 -0700

>
> I could not say the same about ocropus. ocropus recognized a lot less
> text and it took considerably longer to process the image (I used
> ocropus page <jpeg image name>).



Please read the release notes:

http://code.google.com/p/ocropus/wiki/ReleaseNotes

The recognizer shipping with OCRopus 0.4 is new (only about nine months old)
and trained only on a fairly small sample of characters (1.5M characters).


> Now the question is what can I do - as user not programmer - to
> improve ocropus at recognizing the text on printed documents, to get
> to the same levels of recognition as the ms office ocr engine? In my
> naive world I thought that ocropus would be capable of recognizing
> printed text out of the box with an accuracy of at least 95%.


In our benchmarks, the OCRopus recognizer achieves excellent recognition
rates (<0.7% error on a standard test set of scanned documents, comparable
to good commercial engines).

However, there's a big difference between benchmark performance and
real-world performance.  To get good real-world performance, two things need
to happen.  First, the recognizer needs to be trained on much more data.
Second, the recognizer needs to be made robust against a lot of possible
idiosyncracies in real-world documents.  That's what we're planning on doing
for 0.5.  In different words, think of the OCRopus 0.4 recognizer as a very
smart kid, but it's competing with experienced adults.  Even if the adults
are not as smart as the kid, they will probably beat it for a little while
longer.

I'm glad that you're interested in helping.  For OCRopus 0.5, we're going to
have a simple kind of "distributed model".  With that, you can use any kind
of scanned document you're interested in, run the OCRopus training procedure
on it, and submit the trained model to a central repository.  Your data
doesn't need to be transcribed, and the contents of the document remain
private.  We collect the models from the central repository and combine them
into a "super model".  We hope that the first version of that code is going
to be available in a few months.

Tom

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: ocropus vs. MS Office 2007 ocr

Reply via email to