Also,I tried testing some of the images with stand alone tesseract as the OCR recognizer and found that the results were on average better ,in case the images do not demand any layout analysis(which is expected,I believe).I had come across discussion threads stating tesseract is not the default OCR for ocropus and the pluggable integration is still in works.Any updates on this?
Yes, Tesseract is pretty fast and fairly good; a lot of time has gone into tuning it. Mostly, we haven't been tracking Tesseract because its API has been in flux. If you can get Tesseract to recognize lines, just run it over the line images. Keep in mind, however, that Tesseract does not output probabilities and its language models work differently. The default character and language models you're using with OCRopus right now are not very good; we're training new ones that work better. Furthermore, you're probably seeing Tesseract output with adaptation and language modeling and OCRopus without adaptation and without language modeling. Finally, Unicode and ligature support was buggy but is much better now (ligatures are used for recognizing hard-to-segment characters like "oo" and have a significant influence on recognition rates). Tom -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
