We're converting the top level commands from C++ to Python. Most of the OCRopus C++ command line programs now have Python command line equivalents.
There are two ways of generating hOCR output. You can do the traditional multi-step processing, in which case you use ocropus-hocr to generate the final output, and you can use ocropus-pages, which does all recognition in a single program. Both output bounding boxes for lines. We'll be adding support for word and character bounding boxes later as well. Tom On May 3, 6:59 pm, Wolfgang Schwarz <[email protected]> wrote: > Hello, > > I've already tried to ask this question on 24 April, but it seems to > not have made it through to the group, so I'm trying again. > > I would like to retrieve the layout coordinates of text blocks in a > document. This worked well with ocroscript in version 0.2, but in more > recent versions the hocr output has no coordinates. Example: > > ocropus book2pages _temp data/testimages/simple.png > ocropus pages2lines _temp > ocropus lines2fsts _temp > ocropus fsts2text _temp > ocropus buildhtml _temp > > The output is: > > <!DOCTYPE html > PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN > http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> > <html> > <head> > <meta name="ocr-capabilities" content="ocr_line ocr_page" /> > <meta name="ocr-langs" content="en" /> > <meta name="ocr-scripts" content="Latn" /> > <meta name="ocr-microformats" content="" /> > <meta http-equiv="Content-Type" content="text/html;charset=utf-8" > /><title>OCR Output</title> > > </head> > <body> > <div class="ocr_page"> > <span class="ocr_line"> > This is a lot of 1 2 point text to test the</span><span > class="ocr_line"> > ocr code and see if it works on all types</span><span > class="ocr_line"> > of file format.</span><span class="ocr_line"> > The quick brown dog jumped over the</span><span class="ocr_line"> > lazy fox. The quick brown dog jumped</span><span class="ocr_line"> > over the lazy fox. The quick brown dog</span><span class="ocr_line"> > jumped over the lazy fox. The quick</span><span class="ocr_line"> > brown dog jumped over the lazy fox.</span></div> > </body> > </html> > > I was expecting each span element to have a title attribute like > > <span class="ocr_line" title="bbox 313 324 733 1922">...</span> > > Is there any way to turn this on? > > Thanks, > Wolfgang > > -- > You received this message because you are subscribed to the Google Groups > "ocropus" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group > athttp://groups.google.com/group/ocropus?hl=en. -- You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/ocropus?hl=en.
