Le 10/09/2017 à 02:17, Adrian Johnson a écrit : >> On 10/09/17 03:33, Daniel Flipo wrote: >> >> Digging into the code (v. 0.59), I found the culprit: in file UTF.cc, >> function UnicodeIsWhitespace lists all Unicode spaces on which to break >> lines into words (used *only* in TextOutputDev.cc line 2610). >> >> UnicodeIsWhitespace includes both non-breaking spaces U+a0 and U+202f >> (all other are fine). Is it intended to break lines into words on >> /non-breaking/ spaces? > > The bug that added that code is: > > https://bugs.freedesktop.org/show_bug.cgi?id=97399 > > So at least for some PDFs, yes it is intentional. I tested with Adobe > Reader and it is also converting non-breaking spaces to U+0020.
OK thanks, I understand the situation now. > The solution is not as simple as removing U+00A0 from > UnicodeIsWhitespace. That doesn't mean we can't do a better job of > handling non-breaking space in PDFs. But it would require a non-trivial > solution. Maybe check the ratio of non-breaking space characters to > space characters on a page. If there is more non-breaking space than > space, assume the PDF is broken and convert to space. If there is less > non-breaking space than space, preserve the non-breaking space characters. Instead of relying on a statistical test to decide whether U+00A0 is an intentional nbsp or not, I suggest to add an option to pdftotext which would remove U+00A0 and U+202F from UnicodeIsWhitespace for users who do want nbsp to be honoured. What do you think? -- Daniel Flipo _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
