On Sun, Dec 2, 2018 at 12:51 PM Adam Reichold <[email protected]> wrote: > > Hello, > > Am 02.12.18 um 00:06 schrieb Albert Astals Cid: > > El dissabte, 1 de desembre de 2018, a les 23:20:46 CET, Jeroen Ooms va > > escriure: > >> I maintain the poppler bindings for the R programming language and get > >> a lot of bug reports about corrupted text extracted with poppler. > >> Below a minimal example that illustrates the problem: > >> > >> git clone https://github.com/jeroen/popplertest > >> cd popplertest > >> g++ -std=c++11 encoding.cpp -o encoding $(pkg-config --cflags --libs > >> poppler-cpp) > >> ./encoding hello.pdf > >> > >> The output shows a lot of Chinese characters which is incorrect (all > >> text in the pdf is english). > >> > >> Back in March 2018, Suzuki Toshiya had posted a patch with at least a > >> partial solution: > >> https://lists.freedesktop.org/archives/poppler/2018-March/012962.html > >> . I hope we can revisit this. > > > > Can someone please post a patch to the new gitlab merge requests? It's > > muuuuuch easier to keep track of what needs reviewing if we have it all > > there. > > Created !129 [1]. Probably a big improvement but I am not completely > convinced that this is all there is to do.
Thank you for reviving this! I tested your branch with my example program [1] and I can confirm it now extracts the correct text for all my example pdf files. I have tested both with plain english documents as well as pdf files with Chinese text. Indeed I am not sure this covers every edge case of rare pdf encoding obscurities, but it is already an enormous improvement over the current situation (in which ustrings contain only gibberish). Hopefully this can be merged soon and we can tweak details in further iterations. [1] https://github.com/jeroen/popplertest/blob/master/encoding.cpp _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
