Le 02/09/2017 à 10:19, Daniel Flipo a écrit : > Even when pdftotext is run with option "-enc UTF-8", it converts all > non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder > whether this feature is intended or not.
Digging into the code (v. 0.59), I found the culprit: in file UTF.cc, function UnicodeIsWhitespace lists all Unicode spaces on which to break lines into words (used *only* in TextOutputDev.cc line 2610). UnicodeIsWhitespace includes both non-breaking spaces U+a0 and U+202f (all other are fine). Is it intended to break lines into words on /non-breaking/ spaces? Deleting those two characters from UnicodeIsWhiteSpace and recompiling poppler built a binary pdftotext which works fine for me now… but I am not sure it doesn't break anything else in poppler. Could the option of removing 0x00A0 and 0x202F from UnicodeIsWhitespace be investigated? Thanks in advance, cheers,-- Daniel Flipo _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
