Hi Tom,

I just tried using the ocropus-pages command with the --hocr switch
and while it generates xhtml that has bbox info etc, the output
appears to be different from that generated by ocropus-hocr. For
example a span of text in ocropus-pages seems to use:

<span class='ocr_line' bbox='%d %d %d %d'>%s</span>

whereas in ocropus-hocr it uses:

<span class='ocr_line' title='bbox %d %d %d %d'>"%
(x0,y0,x1,y1),text,"</span>

Is this intentional?

Thanks again
Brendan

On May 13, 3:26 pm, Tom <[email protected]> wrote:
> We're converting the top level commands from C++ to Python.  Most of
> the OCRopus C++ command line programs now have Python command line
> equivalents.
>
> There are two ways of generating hOCR output.  You can do the
> traditional multi-step processing, in which case you use ocropus-hocr
> to generate the final output, and you can use ocropus-pages, which
> does all recognition in a single program.  Both output bounding boxes
> for lines.  We'll be adding support for word and character bounding
> boxes later as well.
>
> Tom
>
> On May 3, 6:59 pm, Wolfgang Schwarz <[email protected]> wrote:
>
>
>
>
>
> > Hello,
>
> > I've already tried to ask this question on 24 April, but it seems to
> > not have made it through to the group, so I'm trying again.
>
> > I would like to retrieve the layout coordinates of text blocks in a
> > document. This worked well with ocroscript in version 0.2, but in more
> > recent versions the hocr output has no coordinates. Example:
>
> > ocropus book2pages _temp data/testimages/simple.png
> > ocropus pages2lines _temp
> > ocropus lines2fsts _temp
> > ocropus fsts2text _temp
> > ocropus buildhtml _temp
>
> > The output is:
>
> > <!DOCTYPE html
> >    PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN
> >    http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
> > <html>
> > <head>
> > <meta name="ocr-capabilities" content="ocr_line ocr_page" />
> > <meta name="ocr-langs" content="en" />
> > <meta name="ocr-scripts" content="Latn" />
> > <meta name="ocr-microformats" content="" />
> > <meta http-equiv="Content-Type" content="text/html;charset=utf-8" 
> > /><title>OCR Output</title>
>
> > </head>
> > <body>
> > <div class="ocr_page">
> > <span class="ocr_line">
> > This is a lot of 1 2 point text to test the</span><span
> > class="ocr_line">
> > ocr code and see if it works on all types</span><span
> > class="ocr_line">
> > of file format.</span><span class="ocr_line">
> > The quick brown dog jumped over the</span><span class="ocr_line">
> > lazy fox. The quick brown dog jumped</span><span class="ocr_line">
> > over the lazy fox. The quick brown dog</span><span class="ocr_line">
> > jumped over the lazy fox. The quick</span><span class="ocr_line">
> > brown dog jumped over the lazy fox.</span></div>
> > </body>
> > </html>
>
> > I was expecting each span element to have a title attribute like
>
> > <span class="ocr_line" title="bbox 313 324 733 1922">...</span>
>
> > Is there any way to turn this on?
>
> > Thanks,
> > Wolfgang
>
> > --
> > You received this message because you are subscribed to the Google Groups 
> > "ocropus" group.
> > To post to this group, send email to [email protected].
> > To unsubscribe from this group, send email to 
> > [email protected].
> > For more options, visit this group 
> > athttp://groups.google.com/group/ocropus?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups 
> "ocropus" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to 
> [email protected].
> For more options, visit this group 
> athttp://groups.google.com/group/ocropus?hl=en.

-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en.

Reply via email to