I learned a bit more about PDFs today :) I believe I've found the offending TJ:
/C0_0 1 Tf 15.9927 0 9.0157 13.2093 304.8821 331.25 Tm [<00170016001000000037>55<000e>74<00000033>9<0057>4<00410052>-24<0054005a00560049004c004c004500000032>-4<0044>20<000e>]TJ Font: ... /Font << /C0_0 18 0 R ... %% Original object ID: 123 0 18 0 obj << /BaseFont /CDGGAZ+Myriad-Roman /DescendantFonts 66 0 R /Encoding /Identity-H /Subtype /Type0 /Type /Font >> endobj Notably, it's missing a /ToUnicode, which all of the other fonts have. I inspected the font object which has `/Subtype CIDFontType0C`, which I extracted using pdftosrc. Unfortunately, file does not recognize the format and I'm struggling to find anything able to read it. Hints appreciated. So, is there a poppler bug here? It seems that the glib API is having Identity-H encoded characters (including nulls) emitted via the poppler_page_get_text API, which is messing up the C-string length. So should the API instead drop those charactars for which there isn't a unicode mapping? Thanks in advance? On 26 May 2015 at 12:56, Peter Waller <[email protected]> wrote: > I forgot to note that I transformed unprintable characters to "X" in > my dumped representation. _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
