Hi, I am trying to do some analysis of the journal content and wish to > assemble sentences from the OCR text. I need to identify things like > page headers and footers, footnotes, pullquotes, etc. so that I can > ignore them and just assemble sentences from the main body text. > Though the OCR XML I have contains region information, it isn't very > reliable for identifying these kinds of layout elements. > > So, my questions are: > > 1. Is Ocropus suitable for this task?
OCRopus has functions that will help you, but it's not a turnkey solution. In particular, right now, OCRopus has good physical layout analysis (finding "text blocks"). 2. If so, where do I start looking to see how to use Ocropus to do this? The primary layout analysis methods are the ones implemented by ISegmentPage. In addition, OCRopus contains some Lua scripts that implement a limited form of logical labeling. > 3. Of the layout analysis algorithms included with Ocropus, which do > you recommend for this task? The RAST algorithm will perform physical layout analysis for you. OCRopus as a whole does some limited logical layout analysis. 4. Should I be analyzing the page images directly, or could I use the > region, line, and word bounding boxes from the OCR XML? Which one is better depends on how good the preprocessing is. There is a lot of research-quality code that we and others have for more sophisticated kinds of layout analysis and that isn't part of OCRopus yet. The bottleneck is funding and people to clean it up and wrap it up (see below); there is a big difference between code that one can run by hand over a bunch of databases, and code that's actually useful for production. Upcoming releases of OCRopus will incorporate additional forms of logical layout analysis, including trainable logical layout analysis. Cheers, Thomas. PS: here are two pointers to the kinds of approaches we are pursuing. Our focus is on methods that can be made robust even in the presence of noise and other image quality problems. Structural Mixtures for Statistical Layout Analysis *Faisal Shafait, Joost van Beusekom, Daniel Keysers, Thomas M. Breuel* Proc. 8th Int. Workshop on Document Analysis Systems (DAS) Accepted for publication <http://pubs.iupr.org/#2008-IUPR-07Aug_0828> Layout Analysis by Exploring the Space of Segmentation Parameters *T.M. Breuel* Proceedings of the International Association for Pattern Recognition Workshop (Document Analysis Systems) Also selected for inclusion in the post-conference book. <http://pubs.iupr.org/#2000-breuel-das> --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
