Re: [Pythonmac-SIG] PDF reading

DavidW Sun, 25 Jan 2009 19:12:17 -0800

It's a while since I did any text to PDF extraction.
Last time I did, I used some tools that are part of


http://www.hforge.org/itools

Which, I seem to remember, also does elementary decryption.

David.

On 26/01/2009, at 7:02 AM, Bill Janssen wrote:

Paul Brown <[email protected]> wrote:

anyone have any  pointers on reading a pdf file.
i need to extract the text content , page number , text style ,block
, ... all in XML if poss

Paul


Hi, Paul.

I use a patched version of xpdf to get this stuff, which works pretty
well.  Extracts the text and wordbox info (page, word rectangle, font,
bold/italic, etc.)  for each word in the PDF.  You can download the
patch to xpdf from

http://downloads.sourceforge.net/doceng-toolkit/doceng-package-sources.zip.You'll have to unpack the zip file and look for it in there, thenapplyit to the xpdf sources and build xpdf. I've sent the patch to thexpdfmaintainer, but haven't heard more about it from him. See the(patched)

xpdf man page for details of the output format (ASCII text, one word
record per line).

This is also included in the UpLib release at http://uplib.parc.com/;
you'll have to register an account on the blog in order to get the
download link for that.  If you download and install one of the binary
builds of UpLib, the patched xpdf is included.

Bill
_______________________________________________
Pythonmac-SIG maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/pythonmac-sig


________________________________________________
David Worrall.
- Sonic Communications Research Group:  creative.canberra.edu.au/scrg
- Experimental Polymedia:       www.avatar.com.au
- Education for Financial Independence: www.mindthemarkets.com.au


_______________________________________________
Pythonmac-SIG maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/pythonmac-sig

Re: [Pythonmac-SIG] PDF reading

Reply via email to