It's a while since I did any text to PDF extraction.
Last time I did, I used some tools that are part of

http://www.hforge.org/itools

Which, I seem to remember, also does elementary decryption.

David.

On 26/01/2009, at 7:02 AM, Bill Janssen wrote:

Paul Brown <appwo...@mac.com> wrote:

anyone have any  pointers on reading a pdf file.

i need to extract the text content , page number , text style , block
, ... all in XML if poss

Paul

Hi, Paul.

I use a patched version of xpdf to get this stuff, which works pretty
well.  Extracts the text and wordbox info (page, word rectangle, font,
bold/italic, etc.)  for each word in the PDF.  You can download the
patch to xpdf from
http://downloads.sourceforge.net/doceng-toolkit/doceng-package-sources.zip . You'll have to unpack the zip file and look for it in there, then apply it to the xpdf sources and build xpdf. I've sent the patch to the xpdf maintainer, but haven't heard more about it from him. See the (patched)
xpdf man page for details of the output format (ASCII text, one word
record per line).

This is also included in the UpLib release at http://uplib.parc.com/;
you'll have to register an account on the blog in order to get the
download link for that.  If you download and install one of the binary
builds of UpLib, the patched xpdf is included.

Bill
_______________________________________________
Pythonmac-SIG maillist  -  Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig


________________________________________________
David Worrall.
- Sonic Communications Research Group:  creative.canberra.edu.au/scrg
- Experimental Polymedia:       www.avatar.com.au
- Education for Financial Independence: www.mindthemarkets.com.au


_______________________________________________
Pythonmac-SIG maillist  -  Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig

Reply via email to