Oh, I should take a look. Do you think any change of public API
of cpp frontend is needed?


On 3/6/2018 12:29 AM, Jeroen Ooms wrote:
A minimal example of this in a simple C++ program: https://git.io/vAQFW

When running the example on a simple english pdf file, the
page->text() gets printed correctly, however the metadata fields as
well as words from the page->text_list() seem to get the wrong
encoding. What am I doing wrong here?

On Mon, Mar 5, 2018 at 3:10 PM, Jeroen Ooms <jer...@berkeley.edu> wrote:
I'm testing the new page::text_list() function but I run into an old
problem where the conversion of the ustring to UTF-8 doesn't do what I

   byte_array buf = x.to_utf8();
   std::string y(buf.begin(), buf.end());
   const char * str = y.c_str();

The resulting char * is not UTF-8. It contains random Chinese
characters for pdf files with plain english ascii text. I can work
around the problem by using x.to_latin1(), which gives the correct
text, mostly, but obviously it doesn't work for non english text.

I remember running into this before for example when reading a
toc_item->title() or document->info_key() the conversion to utf8 als
doesn't seem to work. Perhaps I am misunderstanding how this works. Is
there some limitation on pdfs or ustrings that limits their ability to
be converted to UTF-8?

Somehow I am not getting this problem for ustrings from the page->text() method.
poppler mailing list

poppler mailing list

Reply via email to