Hello,

I am hoping to use Ocropus for layout analysis and have a few
questions about how to get started.

First, a bit about my task. I have a collection of about 50,000 page
images from scanned journals and OCR XML generated from these images.
I'm not sure what generated the OCR XML; it looks like this:

<ocr-page type="coords" bpp="1" ocr-vers="1.0" res="600" width="9190"
height="12996">
<region region-id="0">
<rc l="1562" t="1400" r="7802" b="11900"/>
<ln lx="2894" ly="1406" rx="7745" ry="1670">
<wd l="2894" t="1411" r="3218" b="1601" fs="11"><![CDATA[the]]></wd>
...

I am trying to do some analysis of the journal content and wish to
assemble sentences from the OCR text. I need to identify things like
page headers and footers, footnotes, pullquotes, etc. so that I can
ignore them and just assemble sentences from the main body text.
Though the OCR XML I have contains region information, it isn't very
reliable for identifying these kinds of layout elements.

So, my questions are:

1. Is Ocropus suitable for this task?
2. If so, where do I start looking to see how to use Ocropus to do this?
3. Of the layout analysis algorithms included with Ocropus, which do
you recommend for this task?
4. Should I be analyzing the page images directly, or could I use the
region, line, and word bounding boxes from the OCR XML?

Thank you,

Ryan Shaw
School of Information
University of California, Berkeley

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to