Tim - be aware that the PDF standard (ISO 32000-1:2008) refers to a specific version of Unicode (v4). Support for any newer version could potentially introduce compatibility issues.
For the next version of PDF (2.0, ISO 32000-2) we are evaluating updating that reference. Leonard -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Tim Brody Sent: Wednesday, May 11, 2011 2:58 AM To: [email protected] Subject: Re: [poppler] [PATCH] Fixup LaTeX composed characters On Tue, 10 May 2011 19:15:51 +0100, Albert Astals Cid <[email protected]> wrote: > A Tuesday, May 10, 2011, Tim Brody va escriure: >> > Sincerely i am quite hesitant to apply your patch since it "breaks" >> > pdftotext >> > usage in the console (since it seems most of the apps in the console >> > are >> > not >> > able to understand the non-composed form) >> >> Anyway, my patch is only a fix-up of overprinting characters that would >> otherwise get mangled by pfdtotext. It just makes it more apparent that >> your tool-chain is broken because it's producing more non-ASCII7 >> code-points. > > By tool-chain you mean pdftotext? I mean whatever you're piping to. I haven't encountered a problem with decomposed Unicode in bash/less/vim. >> I agree that pdftotext should by default output NFC but you need to >> decide >> whether to implement an NFC against the out of date poppler tables or >> link >> to icu. > > I don't think linking to icu (which last i checked is a huuuuuuuuuge > monster > way bigger than poppler itself in size), otoh why you say poppler tables > are > out of date? Nobody has complained about something not working :D Normalisation relies on the canonical character compositions, which come from the Unicode tables. The poppler .h files are dated 2008 and there have been two new Unicode versions since 2008 (assuming the tables used then were current). I'm not saying they're broken but that Unicode tables have/will change. Regardless, I will normalise the output from pdftotext to NFKC anyway - I just need it to not mangle TeX-generated PDFs. I don't see this as dependent on fixing pfdtotext's normalisation. -- All the best, Tim. _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
