Hi William, We see a large number of PDFs with this kind of breakage in them from a diverse set of sources.
I used pdftk to extract the one page from it, the same issue manifests before and after. Thanks, - Peter On 26 May 2015 at 19:26, William Bader <[email protected]> wrote: > Is the difference the italic text "760 W. Swartzville Rd. Reinholds, PA > 17569"? > > That is not the address of Zook Interiors, right? > > Is that a hidden mark added by the person who created the PDF? > > Maybe they intentionally used an incorrect coding. > > Then the question might be how the two different methods of extracting > information respond to invalid data in the PDF. > > pdftotext does not handle that text correctly, and ps2ascii (from > ghostscript 9.16) crashes on it with > > **** Warning: considering '0000000000 XXXXX n' as a free entry. > > *** Warning: composite font characters dumped without decoding. > > If a PDF breaks both poppler and ghostscript, the problem is probably the > PDF. > > pdfinfo shows that the file was made by pdftk 1.44, so it could be a bug or > intentional change in pdftk. > > William > > ________________________________ > From: [email protected] > Date: Tue, 26 May 2015 10:53:52 +0100 > To: [email protected] > Subject: Re: [poppler] Incompatible number of glyphs from glib get_text{, > layout} > > > On 17 January 2014 at 10:30, Peter Waller <[email protected]> wrote: > > A screenshot from the poppler glib demo app demonstrates this, attached > below. Poppler gets 696 characters and 1261 layout rectangles. > > <snip> > > http://pwaller.net/sw/2014-01-17-broken.pdf > > <snip> > > I've reported this on bugzilla here: > https://bugs.freedesktop.org/show_bug.cgi?id=73885 > > > Link to old thread: > http://thread.gmane.org/gmane.comp.freedesktop.poppler/8683 > > I've investigated this briefly. An observation: > > http://cgit.freedesktop.org/poppler/poppler/tree/glib/poppler-page.cc?id=poppler-0.33.0#n825 > > The sel_text->getLength() is 1283 (which doesn't match with the 1261 from > poppler_page_get_layout). > > If I change this to use a g_strndup() with the correct length: > > result = g_strndup (sel_text->getCString (), sel_text->getLength()); > > > And then look at result[696:], then I find that the rest of the string is > filled with 0 bytes. > > I'm extremely keen to get this fixed, so any pointers would be appreciated. > The rate of encountering this bug is increasing all the time! > > Thanks, > > - Peter > > _______________________________________________ poppler mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
