A Dimecres 30 Abril 2008, Ross Moore va escriure: > Hello Albert, > > On 30/04/2008, at 7:56 AM, Albert Astals Cid wrote: > > Available from > > http://poppler.freedesktop.org/poppler-0.8.2.tar.gz > > > > Testing, patches and bug reports welcome. > > I joined this list recently, to see whether the Poppler versions > of the Xpdf utilities worked any differently from the non-Poppler > versions. > > I'm working on a Mac, with MacOS X v10.4.11, and have successfully > built the utilities from this latest release. > > > All of pdfinfo, pdffonts, pdftohtml, pdftotext, pdftops, pdftoppm > and pdfimages work fine on a simple 1-page PDF that I created > with pdfTeX. > > However, all of these fail with a "Bus error" on more > complicated multi-page PDFs, which you can find here: > > http://www.maths.mq.edu.au/~ross/5019-e-cmap.pdf > http://www.maths.mq.edu.au/~ross/5019-e-mmap.pdf
This is due to a problem in how Annotations are handled, i found a possible way to find the problem but i'd like to ask you a question. In the PDF code i can see some buttons with texts like "Shift-click image then move mouse to shift image; click again (no Shift) anchors at destination" Can you see these buttons with Acrobat? I tried to look at them but could not find them. > > I'm particularly interested in pdffonts, pdftohtml, pdftotext > as I want a free tool to be able to correctly extract the text > from documents such as the above PDFs. > > They must extract the *complete* textual contents, using the > CMap font-encoding resources that these PDFs contain. > > > Non-poppler versions of the utilities; e.g. > > rossmoor% pdftotext -v > pdftotext version 3.02 > Copyright 1996-2007 Glyph & Cog, LLC > > work to some extent, but certainly not completely. > (pdfimages works but the output is incomplete and useless > and pdftoppm also gives a Bus error .) > > > For example, this is part of the text extracted from 5019-e-mmap.pdf > using pdftotext (v3.02) > > Figure 1: The Moebius strip. Consider the two-sheeted covering > \pi : \BbbS 2 \rightar P and the inverse image \pi - 1 (L) > of one of these circles. > > It's pretty good, except that \rightarrow has been truncated > to 8 characters. There are many similar instances within the > full text. However, the Poppler version doesn't get far enough > through the document to see this --- at least not for me. This would be a separate bug. Let's sort the first one first. Albert > > > BTW, the text selection in Adobe Reader (versions 7.* & 8.*) > does extract the text more completely; so there is either > a bug or a design flaw within the pdftotext utility. > > > Albert > > Hope this helps, > and that you can help me. > > > Cheers, > > Ross > > ------------------------------------------------------------------------ > Ross Moore [EMAIL PROTECTED] > Mathematics Department office: E7A-419 > Macquarie University tel: +61 (0)2 9850 8955 > Sydney, Australia 2109 fax: +61 (0)2 9850 8114 > ------------------------------------------------------------------------ _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
