Finally, after managing to compile ocropus 0.4 (with tesseract), I gave it a try with a printed document I scanned and emailed myself as a pdf document. I printed the document to MS OneNote 2007. OneNotes create a set of pictures, one for each page. I selected one page, I saved it as a jpeg, then I extracted the text (right click and select copy text from image) and also saved in OneNote.
I was impressed by the ms ocr engine which retrieved the text quite accurately (the fonts of the pasted text also matched the fonts on the paper). I could not say the same about ocropus. ocropus recognized a lot less text and it took considerably longer to process the image (I used ocropus page <jpeg image name>). Now the question is what can I do - as user not programmer - to improve ocropus at recognizing the text on printed documents, to get to the same levels of recognition as the ms office ocr engine? In my naive world I thought that ocropus would be capable of recognizing printed text out of the box with an accuracy of at least 95%. Thanks --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
