Hello,

I've already tried to ask this question on 24 April, but it seems to
not have made it through to the group, so I'm trying again.

I would like to retrieve the layout coordinates of text blocks in a
document. This worked well with ocroscript in version 0.2, but in more
recent versions the hocr output has no coordinates. Example:

ocropus book2pages _temp data/testimages/simple.png
ocropus pages2lines _temp
ocropus lines2fsts _temp
ocropus fsts2text _temp
ocropus buildhtml _temp

The output is:

<!DOCTYPE html
   PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN
   http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
<html>
<head>
<meta name="ocr-capabilities" content="ocr_line ocr_page" />
<meta name="ocr-langs" content="en" />
<meta name="ocr-scripts" content="Latn" />
<meta name="ocr-microformats" content="" />
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" /
><title>OCR Output</title>
</head>
<body>
<div class="ocr_page">
<span class="ocr_line">
This is a lot of 1 2 point text to test the</span><span
class="ocr_line">
ocr code and see if it works on all types</span><span
class="ocr_line">
of file format.</span><span class="ocr_line">
The quick brown dog jumped over the</span><span class="ocr_line">
lazy fox. The quick brown dog jumped</span><span class="ocr_line">
over the lazy fox. The quick brown dog</span><span class="ocr_line">
jumped over the lazy fox. The quick</span><span class="ocr_line">
brown dog jumped over the lazy fox.</span></div>
</body>
</html>

I was expecting each span element to have a title attribute like

<span class="ocr_line" title="bbox 313 324 733 1922">...</span>

Is there any way to turn this on?

Thanks,
Wolfgang

-- 
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en.

Reply via email to