Hello, I am hoping to use Ocropus for layout analysis and have a few questions about how to get started.
First, a bit about my task. I have a collection of about 50,000 page images from scanned journals and OCR XML generated from these images. I'm not sure what generated the OCR XML; it looks like this: <ocr-page type="coords" bpp="1" ocr-vers="1.0" res="600" width="9190" height="12996"> <region region-id="0"> <rc l="1562" t="1400" r="7802" b="11900"/> <ln lx="2894" ly="1406" rx="7745" ry="1670"> <wd l="2894" t="1411" r="3218" b="1601" fs="11"><![CDATA[the]]></wd> ... I am trying to do some analysis of the journal content and wish to assemble sentences from the OCR text. I need to identify things like page headers and footers, footnotes, pullquotes, etc. so that I can ignore them and just assemble sentences from the main body text. Though the OCR XML I have contains region information, it isn't very reliable for identifying these kinds of layout elements. So, my questions are: 1. Is Ocropus suitable for this task? 2. If so, where do I start looking to see how to use Ocropus to do this? 3. Of the layout analysis algorithms included with Ocropus, which do you recommend for this task? 4. Should I be analyzing the page images directly, or could I use the region, line, and word bounding boxes from the OCR XML? Thank you, Ryan Shaw School of Information University of California, Berkeley --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
