This was reported before and should be fixed in the latest version. http://code.google.com/p/ocropus/issues/detail?id=255#c2
The reason behind it is that pages2lines splits up both the grayscale and the binary versions of the image, but it used the original version for one and the deskewed version for the other. Since extraction from the grayscale image involves masking, you're seeing JPEG noise through character-shaped masks. If you give poor quality input to the rest of OCRopus, you will get "index out of range" and "beam search" errors; the former usually indicates that it couldn't find something, and the latter indicates that the raw output from the character recognizer doesn't match the language model. We're trying to improve the error messages in future versions. Please have a look at the new commands in ocropy/ocropus-*; they are functionally analogous, but much easier to read and modify, and they take command line switches and have more built-in documentation. Tom On Apr 29, 11:28 pm, Ben <[email protected]> wrote: > I want to start by thanking you for helping out. I have been trying > for a while to get ocropus working and have been having some issues. > My first issue is that the version of ocropus I compiled (0.4.4) is > behaving very poorly on most images. For example, the image > "alice_1.png", which comes with ocropus in the "data/testimages" > folder, returns very poor results when I run it through the > recommended process of "book2pages", "pages2lines", "lines2fsts"... > (seehttp://benhansen.me/ocropus/alice_1.png.html) but returns great > results when I use the "ocropus page <imagefilename>" command line > parameter (seehttp://benhansen.me/ocropus/alice_1.pngPageOutput.html). > After looking into the problem it seems that images are loosing a lot > of quality in the "pages2lines" step of the process. The binarization > process seems to be very clean and effective > (seehttp://benhansen.me/ocropus/0001.bin.png) > but the segmented line images have lost a lot of quality > (seehttp://benhansen.me/ocropus/010003.png). I am confused on the reason > behind this quality loss. Shouldn't the pages2lines process only split > up the already touched up binarized image? Any ideas? Robert B. > submitted "Which image for training?" on Apr 22 I couldn't find a > responce. > Also I keep getting a lot of "[error] narray: index out of range", and > "[error] beam search failed" errors. There was a discussion > "introduction and request for help getting up and running" on May 9 > where these errors and garbage output were mentioned to be caused by a > PPI of less than 300 or otherwise improperly formatted images. The > images I have been using are 300 DPI as well as have a standard font > size with regards to the dpi. Are there any other specifications that > I should know about the images. > I would like to start working on these problems. My feelings are that > this might just be a glitch in the current version and I don't want to > waste energy solving a problem that has already been solved. > Thanks Again! > -Ben > > -- > You received this message because you are subscribed to the Google Groups > "ocropus" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group > athttp://groups.google.com/group/ocropus?hl=en. -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
