Re: [poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)

Daniel Flipo Sat, 09 Sep 2017 11:02:51 -0700

Le 02/09/2017 à 10:19, Daniel Flipo a écrit :

> Even when pdftotext is run with option "-enc UTF-8", it converts all
> non-breaking spaces U+a0 and U+202f into U+20 (breakable). I wonder
> whether this feature is intended or not.


Digging into the code (v. 0.59), I found the culprit: in file UTF.cc,
function UnicodeIsWhitespace lists all Unicode spaces on which to break
lines into words (used *only* in TextOutputDev.cc line 2610).

UnicodeIsWhitespace includes both non-breaking spaces U+a0 and U+202f
(all other are fine). Is it intended to break lines into words on
/non-breaking/ spaces?

Deleting those two characters from UnicodeIsWhiteSpace and recompiling
poppler built a binary pdftotext which works fine for me now… but I am
not sure it doesn't break anything else in poppler.

Could the option of removing 0x00A0 and 0x202F from UnicodeIsWhitespace
be investigated?

Thanks in advance, cheers,--
Daniel Flipo
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

Re: [poppler] pdftotext converts all non-breaking spaces U+a0 and U+202f into U+20 (breakable)

Reply via email to