Hi All! Some time ago I have encountered a document (can be provided privately via e-mail on request) which had a strange problem: spacing between some characters in words was uneven. Some sort of broken kerning or some such.
When I tried to convert the PDF into HTML/XML, I have noticed that the extra distance between characters was causing pretty much all PDF conversion and reading tools to not recognize the words as words - but instead as two or more words. I have spent several weeks trying to salvage the document and I've done it. Result of the work led me to try to hack on the poppler and see if I can make that task somehow easier. The result is the pretty simple patch for pdftohtml attached to the bug: https://bugs.freedesktop.org/show_bug.cgi?id=47022 It introduces a command line option to adjust the normally hard coded the coefficient 0.1 used to detect when word break should occur. One one side, the patch somehow doesn't fit the whole picture: apparently pretty much all tools use the 0.1 coefficient for breaking up words. On the other side, it would be IMO good to have at least one tool capable of salvaging such documents. The alternative is lengthy and tedious menial proof-reading and editing. Does the community have any opinion on the topic in general or the patch in particular? Thanks. _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
