Given the simplicity of the fix, I think this is a great addition. FWIW, there are other things we can do to correctly identify spaces without trial and error -- for instance, the threshold should not be fixed, but should depend upon the current character spacing, word-spacing, and width of the "space" character (if present in the selected font.) My colleagues and I have had some success automating using some of these metrics.
--josh On 3/12/12 3:45 PM, "Ihar `Philips` Filipau" <[email protected]> wrote: >Hi All! > >Some time ago I have encountered a document (can be provided privately >via e-mail on request) which had a strange problem: spacing between >some characters in words was uneven. Some sort of broken kerning or >some such. > >When I tried to convert the PDF into HTML/XML, I have noticed that the >extra distance between characters was causing pretty much all PDF >conversion and reading tools to not recognize the words as words - but >instead as two or more words. > >I have spent several weeks trying to salvage the document and I've >done it. Result of the work led me to try to hack on the poppler and >see if I can make that task somehow easier. The result is the pretty >simple patch for pdftohtml attached to the bug: > >https://bugs.freedesktop.org/show_bug.cgi?id=47022 > >It introduces a command line option to adjust the normally hard coded >the coefficient 0.1 used to detect when word break should occur. > >One one side, the patch somehow doesn't fit the whole picture: >apparently pretty much all tools use the 0.1 coefficient for breaking >up words. > >On the other side, it would be IMO good to have at least one tool >capable of salvaging such documents. The alternative is lengthy and >tedious menial proof-reading and editing. > >Does the community have any opinion on the topic in general or the >patch in particular? > >Thanks. >_______________________________________________ >poppler mailing list >[email protected] >http://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
