Thanks Tom. I did some more experimenting.
I opened a page of text from the NY times in print mode, and captured a portion of the screen. Ocropus got the headline perfectly! I was happy to see that. However the rest of the article was all gibberish. The font was a reasonably large font size, probably a bit larger than I usually use to read. I measured a lowercase h in pixels, it's 14 pixels high and 10 pixels wide. Maybe that's too small? I took an image including graphics and text and ran it through ocropus, and it got a portion of the text on the image correctly. At first, it didn't get any of the text. I tripled image size in photoshop with resampling of the image, saved as PNG. Then I got a "PNG error" when trying the recognition. Not sure what that's about. I cropped the image to include only a portion of the original image, in case it was too large now, and ran it through ocropus. This time, it was able to detect some of the text on the image and interpret it properly. The portion that it looked at contained both non-text graphics as well as text. Since a portion of the data returned was non-gibberish, it seems we could at least use this to detect whether or not an image contains text. I.e., if ocropus returns a string of at least X characters of a- z, then probably contains text. The gibberish seems to be mainly symbols such as #, etc. That could be useful info, whether or not an image contains text. Does ocropus want the entire image to be just text? Does it confuse it if text is present along with graphics? Is there perhaps a way to run it such that it only tries to identify obvious text and return whether or not it thinks text is present on the page? We could potentially do the programming for that. If that's all the info that's needed, I assume it could run a lot faster. We'd like it to be able to do a lot more, but I'm just exploring some possibilities here. -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
