Hi again, I had to find out that using ocropus as hOCR reference output is not of too much value, as we would need a bounding box at least per line, to "sort of" accurately position the text in the PDF behind the image layer, and as far as I have seen ocropus currently only outputs a bounding box for the whole body, ... which is not really of much value if you want to position the various glyphs sort of correctly behind the image.
So I have to skip the first test with ocropus and go back straight to cuneiform. Jussi: Do you have enough overview of the code to add bounding box information to the text lines of the HTML output? While the HTML output code is quite straight forward, I have not quickly found how to access the positional informations of the elements written out. Ideally, we should also generate span tags for each line of text to have a change to add the bounding box to each line. I'll now try to probe the structures from within a debugger to hopefully get a better visualization of the in-memory document structure and find the bounding box information. PS: Something does not yet work quite right for me with the cuneiform launchpad mailing list, though I became team member and subscribed to the list I apparently do not receive copies of the messages, which makes following the development or replying a little more difficult than it should be ... René _______________________________________________ Mailing list: https://launchpad.net/~cuneiform Post to : [email protected] Unsubscribe : https://launchpad.net/~cuneiform More help : https://help.launchpad.net/ListHelp

