On 10/09/17 03:33, Daniel Flipo wrote: > > Le 02/09/2017 à 10:19, Daniel Flipo a écrit : > >> Even when pdftotext is run with option "-enc UTF-8", it converts all >> non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder >> whether this feature is intended or not. > > Digging into the code (v. 0.59), I found the culprit: in file UTF.cc, > function UnicodeIsWhitespace lists all Unicode spaces on which to break > lines into words (used *only* in TextOutputDev.cc line 2610). > > UnicodeIsWhitespace includes both non-breaking spaces U+a0 and U+202f > (all other are fine). Is it intended to break lines into words on > /non-breaking/ spaces?
The bug that added that code is: https://bugs.freedesktop.org/show_bug.cgi?id=97399 So at least for some PDFs, yes it is intentional. I tested with Adobe Reader and it is also converting non-breaking spaces to U+0020. The solution is not as simple as removing U+00A0 from UnicodeIsWhitespace. That doesn't mean we can't do a better job of handling non-breaking space in PDFs. But it would require a non-trivial solution. Maybe check the ratio of non-breaking space characters to space characters on a page. If there is more non-breaking space than space, assume the PDF is broken and convert to space. If there is less non-breaking space than space, preserve the non-breaking space characters. I suggest creating a bug for this and attaching your test cases. Also attach some real world examples so we can see the ratios of space to non-breaking space characters. > > Deleting those two characters from UnicodeIsWhiteSpace and recompiling > poppler built a binary pdftotext which works fine for me now… but I am > not sure it doesn't break anything else in poppler. > > Could the option of removing 0x00A0 and 0x202F from UnicodeIsWhitespace > be investigated? > > Thanks in advance, cheers,-- > Daniel Flipo > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler > _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
