On Sat, Sep 6, 2008 at 3:37 PM, René Rebe <[EMAIL PROTECTED]> wrote:
> Jussi: Do you have enough overview of the code to add bounding box > information to the text lines of the HTML output? While the HTML > output code is quite straight forward, I have not quickly found how to > access the positional informations of the elements written out. I asked about this on the russian Cuneiform forum: http://openocr.org/forum/viewtopic.php?f=7&t=2829 I got the following information via email from a person at Cognitive: ---8<---- Cuneform in low level doesn't have such things like "text paragraph", but rfrmt library take blocks (text block, image block and so on), text fragments (lines of text) and create rtf-like document description. ced library is a container for document rtf-like description and I think you can try extract paragraph border or set of text lines for particular paragraph (and find paragraph border as a cover rectangle for text lines rectangles) ---8<---- > Are there any preferences where to add this hOCR related HTML > annotation? Conditionalized into the existing html writer, or as a > second copy of it adding those bounding boxes and possibly some post > processing for the issues mentioned above? I see no point in duplicating the HTML writer part as hOCR just adds some simple tags. The only reason to not always have hOCR tags is that they can bloat the size of the file. Having a span/bbox for every single character quickly adds up. Paragraph-sized bounding boxes would not bloat up the file all that much, but as mentioned above, they seem to be directly accessible. _______________________________________________ Mailing list: https://launchpad.net/~cuneiform Post to : [email protected] Unsubscribe : https://launchpad.net/~cuneiform More help : https://help.launchpad.net/ListHelp

