Hi Tom, I just tried using the ocropus-pages command with the --hocr switch and while it generates xhtml that has bbox info etc, the output appears to be different from that generated by ocropus-hocr. For example a span of text in ocropus-pages seems to use:
<span class='ocr_line' bbox='%d %d %d %d'>%s</span> whereas in ocropus-hocr it uses: <span class='ocr_line' title='bbox %d %d %d %d'>"% (x0,y0,x1,y1),text,"</span> Is this intentional? Thanks again Brendan On May 13, 3:26 pm, Tom <[email protected]> wrote: > We're converting the top level commands from C++ to Python. Most of > the OCRopus C++ command line programs now have Python command line > equivalents. > > There are two ways of generating hOCR output. You can do the > traditional multi-step processing, in which case you use ocropus-hocr > to generate the final output, and you can use ocropus-pages, which > does all recognition in a single program. Both output bounding boxes > for lines. We'll be adding support for word and character bounding > boxes later as well. > > Tom > > On May 3, 6:59 pm, Wolfgang Schwarz <[email protected]> wrote: > > > > > > > Hello, > > > I've already tried to ask this question on 24 April, but it seems to > > not have made it through to the group, so I'm trying again. > > > I would like to retrieve the layout coordinates of text blocks in a > > document. This worked well with ocroscript in version 0.2, but in more > > recent versions the hocr output has no coordinates. Example: > > > ocropus book2pages _temp data/testimages/simple.png > > ocropus pages2lines _temp > > ocropus lines2fsts _temp > > ocropus fsts2text _temp > > ocropus buildhtml _temp > > > The output is: > > > <!DOCTYPE html > > PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN > > http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > > <html> > > <head> > > <meta name="ocr-capabilities" content="ocr_line ocr_page" /> > > <meta name="ocr-langs" content="en" /> > > <meta name="ocr-scripts" content="Latn" /> > > <meta name="ocr-microformats" content="" /> > > <meta http-equiv="Content-Type" content="text/html;charset=utf-8" > > /><title>OCR Output</title> > > > </head> > > <body> > > <div class="ocr_page"> > > <span class="ocr_line"> > > This is a lot of 1 2 point text to test the</span><span > > class="ocr_line"> > > ocr code and see if it works on all types</span><span > > class="ocr_line"> > > of file format.</span><span class="ocr_line"> > > The quick brown dog jumped over the</span><span class="ocr_line"> > > lazy fox. The quick brown dog jumped</span><span class="ocr_line"> > > over the lazy fox. The quick brown dog</span><span class="ocr_line"> > > jumped over the lazy fox. The quick</span><span class="ocr_line"> > > brown dog jumped over the lazy fox.</span></div> > > </body> > > </html> > > > I was expecting each span element to have a title attribute like > > > <span class="ocr_line" title="bbox 313 324 733 1922">...</span> > > > Is there any way to turn this on? > > > Thanks, > > Wolfgang > > > -- > > You received this message because you are subscribed to the Google Groups > > "ocropus" group. > > To post to this group, send email to [email protected]. > > To unsubscribe from this group, send email to > > [email protected]. > > For more options, visit this group > > athttp://groups.google.com/group/ocropus?hl=en. > > -- > You received this message because you are subscribed to the Google Groups > "ocropus" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group > athttp://groups.google.com/group/ocropus?hl=en. -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
