Hi again,

I had to find out that using ocropus as hOCR reference output is not
of too much value, as we would need a bounding box at least per line,
to "sort of" accurately position the text in the PDF behind the image
layer, and as far as I have seen ocropus currently only outputs a
bounding box for the whole body, ... which is not really of much value
if you want to position the various glyphs sort of correctly behind
the image.

So I have to skip the first test with ocropus and go back straight to cuneiform.

Jussi: Do you have enough overview of the code to add bounding box
information to the text lines of the HTML output? While the HTML
output code is quite straight forward, I have not quickly found how to
access the positional informations of the elements written out.

Ideally, we should also generate span tags for each line of text to
have a change to add the bounding box to each line.

I'll now try to probe the structures from within a debugger to
hopefully get a better visualization of the in-memory document
structure and find the bounding box information.

PS: Something does not yet work quite right for me with the cuneiform
launchpad mailing list, though I became team member and subscribed to
the list I apparently do not receive copies of the messages, which
makes following the development or replying a little more difficult
than it should be ...

René

_______________________________________________
Mailing list: https://launchpad.net/~cuneiform
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~cuneiform
More help   : https://help.launchpad.net/ListHelp

Reply via email to