Hi all, Finally I think I found the root of issue and I can propose a fix. pre-patch situation is like this: https://travis-ci.org/mpsuzuki/poppler/builds/357212162
post-patch situation is like this: https://travis-ci.org/mpsuzuki/poppler/builds/357956103 My fix consists from 2 parts. part 1) I replaced all detail::unicode_GooString_to_ustring() by ustring::from_utf8(), this was suggested by Adam. https://github.com/mpsuzuki/poppler/commit/7404f5effa8e303399e5101d54ff954ee5153e44 I think this rather simple fix was already reviewed by Albert. part 2) UTF-16 handling needs some improvement. the issue was reported by Jeroen. https://github.com/mpsuzuki/poppler/commit/b3230c7098b891da0b92742264d78c9bd86750bd 2-a: PDF's document metadata /Info dict provides sometimes ASCII (or UTF-8?) string, sometimes UTF-16 string (with BOM). current implementation assumes it is always UTF-8. I inserted a conditional to check BOM and deal as UTF-16 if BOM is found. for ustring object creation by UTF-16, I added a private method ustring::from_utf16(). I'm afraid there should be a room for discussion about the better design of UTF-16 in ustring(), I don't make ustring::from_utf16 at present. 2-b: the term "UTF-16" is not always same in various iconv(). "UTF-16" is sometimes regarded as UTF-16BE, sometimes as UTF-16LE. please compare the results of "./encoding ./hello.pdf" by Linux and Mac OS X. Linux: https://travis-ci.org/mpsuzuki/poppler/jobs/357212163#L1106 Mac OS X: https://travis-ci.org/mpsuzuki/poppler/jobs/357212164#L2463 my fix is conditionalization of the name by WORDS_BIGENDIAN macro (which is not newly-introduced, it was already used in the source CairoOutputDev). 2-c: unneeded twiced sized buffer is skipped. ======================== There is a room of the discussion about the design how ustring object store UTF-16 data. There is a direction thinking as the buffer should include BOM always, to clarify the byteorder. There is another direction thinking as we should not expect the exist of BOM (sometimes it may have, sometimes it may lack), because the handling of BOM is more complicated than std::basic_string's existing manipulations (like concatenation, splicing, replacing etc). I took the 2nd direction in my proposed fix, but other experts think the 1st direction would be better. In fact, I'm unfamiliar with how the cpp-frontend users think about a BOM in ustring object. If there are so many existing implementations assuming as if ustring always starts with a BOM (and they have their own routines for the concatenation, splicing and replacing), we should care for that. Please let me hear how the users think. Regards, mpsuzuki Jeroen Ooms wrote: > Thanks everyone for the work on this issue, really appreciate the > input. Also excited about mpsuzuki's suggestion to include font data > with the text_list, this will be super helpful. > > I have updated and cleaned my example code a little bit to make it > easier to test these issues. The updated test program and a small pdf > file (with expected output) are now here: > https://github.com/jeroen/popplertest . This makes it easier to update > the example for new features/issues. > > It seems there are at least 3 places where encoding is not behaving as > expected in 0.63: in the text_list and info_keys and toc. I think > mpsuzuki's patch addresses these problems. > _______________________________________________ > poppler mailing list > poppler@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list poppler@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/poppler