Paul Brown <appwo...@mac.com> wrote: > anyone have any pointers on reading a pdf file. > > i need to extract the text content , page number , text style , block > , ... all in XML if poss > > Paul
Hi, Paul. I use a patched version of xpdf to get this stuff, which works pretty well. Extracts the text and wordbox info (page, word rectangle, font, bold/italic, etc.) for each word in the PDF. You can download the patch to xpdf from http://downloads.sourceforge.net/doceng-toolkit/doceng-package-sources.zip. You'll have to unpack the zip file and look for it in there, then apply it to the xpdf sources and build xpdf. I've sent the patch to the xpdf maintainer, but haven't heard more about it from him. See the (patched) xpdf man page for details of the output format (ASCII text, one word record per line). This is also included in the UpLib release at http://uplib.parc.com/; you'll have to register an account on the blog in order to get the download link for that. If you download and install one of the binary builds of UpLib, the patched xpdf is included. Bill _______________________________________________ Pythonmac-SIG maillist - Pythonmac-SIG@python.org http://mail.python.org/mailman/listinfo/pythonmac-sig