Am 15.04.2018 um 05:08 schrieb suzuki toshiya:
> Hi,
> Now I'm reworking with cpp-frontend encoding issue, and found that:
> previous patch fixes the issue for text drawn in a PDF, but does
> not fix the broken metadata if it is coded by PDFDocEncoding
> (sample PDF is attached).
> During the rework, I'm becoming questionable about the usage of
> iconv() in cpp frontend, by following reasons.
> 1) currently, the usage of iconv() in cpp frontend is only the
> conversion of Unicode: UTF-8 <-> UTF-16. This is completely
> mathematical conversion, using iconv() could be overkill.
> in fact, poppler has its own conversion routines (see UTF.h),
> which are independent from iconv().
> 2) iconv() requires the allocated buffer, but does not help
> the calculation of the required buffer size.
> because this difficulty, current usages of iconv() have ugly
> fallbacks for the cases that prepared buffer is too short,
> like this:
>     size_t ir = iconv(ic, (ICONV_CONST char **)&me_data, &me_len_char,
> &str_data, &str_len_left);
>     if ((ir == (size_t)-1) && (errno == E2BIG)) {
>         const size_t delta = str_data - &str[0];
>         str_len_left += str.size();
>         str.resize(str.size() * 2);
>         str_data = &str[delta];
>         ir = iconv(ic, (ICONV_CONST char **)&me_data, &me_len_char, &str_data,
> &str_len_left);
>         if (ir == (size_t)-1) {
>             return byte_array();
>         }
>     }
>     str.resize(str.size() - str_len_left);
>     return str;
> maybe due to this difficulty, glib or Qt frontends do not use
> iconv() directly, they use better utility functions in these
> frameworks.
> the functions in UTF.h have the functions to calculate the
> required buffer size from given UTF-8 or UTF-16 data.
> 3) iconv() receives the buffer filled by, or to be filled by
> UTF-16 data. currently, casting by reinterpret_cast<char *>()
> is used to interchange between ustring (the array of unsigned
> short) and a byte buffer, but it is not theoretically portable.
> On the systems whose short is greater than 16-bit, or its
> alignment is greater than 16-bit, the casted byte buffer would
> have unexpected data from MSB bits or padding, and it could
> make iconv() confused. using uint16_t (as the functions in UTF.h)
> would be safer.
> To do in a theoretically portable way, we should allocate the
> byte buffer for UTF-16 data, and copy to or copy from it,
> instead of the casting. This would be uneasy for the maintenance.
> ==
> How could I resolve this issue? There might be a few candidates.
> A) use the functions in UTF.h, and change ustring class
> to fit them.
> - pros: easiest maintainability, minimum duplication.
> - cons: API changed, and the memory management by gmem.h
> would be introduced in cpp frontend, it would be less
> consistent in cpp frontend internal.
> B) rewrite the functions in UTF.h by template programming
> and have the interfaces fitting to current ustring().
> - pros: minimum duplication, good consistency, API kept.
> - cons: maintainability would be decreased slightly.
> C) write diversions of the functions in UTF.h in cpp frontend.
> - pros: acceptable maintainability, good consistency, API kept.
> - cons: duplication of almost same code into 2 places.

I think what is best here is hard to decide without seeing actual code
that at least starts in one of these directions.

Concerning ustring, I think an API change is not possible except for
changing unsigned short to uint16_t since it already was broken where
sizeof(unsigned short) != 2.

Best regards,

> I hope to hear the comments from experts.
> Regards,
> mpsuzuki
> _______________________________________________
> poppler mailing list
> poppler@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/poppler

Attachment: signature.asc
Description: OpenPGP digital signature

poppler mailing list

Reply via email to