Hi, I am new to this list and I am interesting in using hOCR in order to generate a hidden text layer in DjVu books.
I see hOCR support has been greatly improved after the latest release. However still there are some glitches. First of all, the format currently used for x_bboxes data looks a bit strange: cuneiform first writes a text line and then an empty <span> element with character bbox info, i.e.: <span class='ocr_line'...>Some text<span class='ocr_cinfo'...></span></span> I may be wrong here, but, according to my understanding of the spec this <span> should rather enclose the corresponding text, i. e.: <span class='ocr_line'...><span class='ocr_cinfo'...>Some text</span></span> I am not sure writing a parser for the currently produced hOCR would make any sense, as such a parser probably would be incompatible with the output of other hOCR-capable engines. Can anybody comment on this issue? Moreother, one final "</span> per line is still written if html output (i. e. no hOCR tags) is requested. So the generated html is essentially invalid. -- Regards, Alexey Kryukov <anagnost at yandex dot ru> Moscow State University Historical Faculty _______________________________________________ Mailing list: https://launchpad.net/~cuneiform Post to : cuneiform@lists.launchpad.net Unsubscribe : https://launchpad.net/~cuneiform More help : https://help.launchpad.net/ListHelp