Finally, after managing to compile ocropus 0.4 (with tesseract), I
gave it a try with a printed document I scanned and emailed myself as
a pdf document. I printed the document to MS OneNote 2007. OneNotes
create a set of pictures, one for each page. I selected one page, I
saved it as a jpeg, then I extracted the text (right click and select
copy text from image) and also saved in OneNote.

I was impressed by the ms ocr engine which retrieved the text quite
accurately (the fonts of the pasted text also matched the fonts on the
paper).

I could not say the same about ocropus. ocropus recognized a lot less
text and it took considerably longer to process the image (I used
ocropus page <jpeg image name>).

Now the question is what can I do - as user not programmer - to
improve ocropus at recognizing the text on printed documents, to get
to the same levels of recognition as the ms office ocr engine? In my
naive world I thought that ocropus would be capable of recognizing
printed text out of the box with an accuracy of at least 95%.

Thanks
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to