It's a while since I did any text to PDF extraction.
Last time I did, I used some tools that are part of
http://www.hforge.org/itools
Which, I seem to remember, also does elementary decryption.
David.
On 26/01/2009, at 7:02 AM, Bill Janssen wrote:
Paul Brown <appwo...@mac.com> wrote:
anyone have any pointers on reading a pdf file.
i need to extract the text content , page number , text style ,
block
, ... all in XML if poss
Paul
Hi, Paul.
I use a patched version of xpdf to get this stuff, which works pretty
well. Extracts the text and wordbox info (page, word rectangle, font,
bold/italic, etc.) for each word in the PDF. You can download the
patch to xpdf from
http://downloads.sourceforge.net/doceng-toolkit/doceng-package-sources.zip
.
You'll have to unpack the zip file and look for it in there, then
apply
it to the xpdf sources and build xpdf. I've sent the patch to the
xpdf
maintainer, but haven't heard more about it from him. See the
(patched)
xpdf man page for details of the output format (ASCII text, one word
record per line).
This is also included in the UpLib release at http://uplib.parc.com/;
you'll have to register an account on the blog in order to get the
download link for that. If you download and install one of the binary
builds of UpLib, the patched xpdf is included.
Bill
_______________________________________________
Pythonmac-SIG maillist - Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig
________________________________________________
David Worrall.
- Sonic Communications Research Group: creative.canberra.edu.au/scrg
- Experimental Polymedia: www.avatar.com.au
- Education for Financial Independence: www.mindthemarkets.com.au
_______________________________________________
Pythonmac-SIG maillist - Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig