I have got in touch with the developer - he has very much todo, but I sent a donation and he looked at the issue (I exchanged a few emails with him) - here is his final response so far:
On Mon, Sep 13, 2010 at 10:28, Rene Rebe <r...@exactcode.de> wrote: Dear Martin, the problem is that the latest cuneiform version completely changed the way the bounding box information is written. Actually in a way that makes no sense to me. Before each glyph had a bounding box, which is exactly what we need to write a proper PDF. Now they have a bounding box per line (we we do not need at all) and then an additional array of x start position. However, this can easily get out of sync in regard to multi-byte utf-8 sequences, and also in regards to whitespace. It would also be particularly ugly to adapt the horc2pdf HTML parser to cope with this x position spans written out after the actual text. I doubt this is valid hOCR, and even if it is, it makes no sense to first write out the <span> with the text, and then another <span> just for the x coordinates. And for proper font size estimation we even need the real y-height of the single glyphs in any case (information not present in the new format). I suggest to revert the change that mangled the hOCR annotation in cuneiform, ... That would approximately be these: revno: 415 committer: julien <jul...@student.chalmers.se> branch nick: cuneiform-linux timestamp: Wed 2009-10-07 10:10:13 +0200 message: moved some tags around, now follows html spec and hocr spec. fixed russian comments that were destroyed during encoding ------------------------------------------------------------ revno: 414 committer: julien <jul...@student.chalmers.se> branch nick: cuneiform-linux timestamp: Fri 2009-10-02 21:48:45 +0200 message: separated ocr_line and character bboxes. now follows the hocr standard using the ocr_cinfo tag for char bboxes ------------------------------------------------------------ revno: 413 author: Dmitry Polevoy committer: julien <jul...@student.chalmers.se> branch nick: cuneiform-linux timestamp: Thu 2009-10-01 17:07:51 +0200 message: hocr format now supports ocr_line. Replaced cuneiform_src/Kern/rout/src/html.cpp to the patch submitted in the cuneiform mailing list the 24th of February by Dmitry Polevoy. Cha nged %d to %l in a few sprintf statements in html.cpp -- Font size not correct in merged sandvich PDF https://bugs.launchpad.net/bugs/623438 You received this bug notification because you are a member of Cuneiform Linux, which is the registrant for Cuneiform for Linux. Status in Linux port of Cuneiform: Invalid Status in “exactimage” package in Ubuntu: New Bug description: After processing with Cuneiform for Linux 1.0.0 and hOCR to PDF converter, version 0.7.4 (should be the most current version) I get a sandvich pdf that looks nice until I select text. See the sample 5AADFEE1-0000.* files in the attachment and the result.pdf. The effect is shown in screen087.png For another file (Test10pages.pdf) the effect is either worse - basically I cannot really select any more text to copy because I only can guess where to move with the mouse. It looks like that the font size in the HTML is somehow not correct - I am not an expert, but this link might help you: http://www.emdpi.com/fontsize.html _______________________________________________ Mailing list: https://launchpad.net/~cuneiform Post to : cuneiform@lists.launchpad.net Unsubscribe : https://launchpad.net/~cuneiform More help : https://help.launchpad.net/ListHelp