In article <mailman.2823.1238221222.11746.python-l...@python.org>, Gabriel Genellina <gagsl-...@yahoo.com.ar> wrote: >En Thu, 26 Mar 2009 18:31:31 -0300, M Kumar <tomanis...@gmail.com> >escribió: > >> I need to read pdf files and extract data from it, is there any way to >> do it >> through python. > >If you are interested in the text, I'd use ghostscript pdf2text (you may >invoke it from inside python). > >Actually extracting text from a PDF is rather difficult. It's a >"presentation" format (or "display" format); every word in the document >might be absolutely positioned, there is no paragraph structure you can >rely on. . . . I reinforce Gabriel's good advice with a few points of my own: A. I used to try to index PDF's text extractors at <URL: http://phaseit.net/claird/comp.text.pdf/PDF_converters.html#pdf2txt >. While I haven't maintained this page in years, it would take only a little motivation for me to freshen it considerably. B. My current favorite is pdftotext. C. There are multiple "pdf2txt"-s, that is, dif- ferent products which share a name. Notice Gabriel's qualification that he is thinking of the *GS* one. D. Many times the best way to automate a business process involving PDF demands a trek farther "upstream", that is, identification of the source of a text *before* it was rendered as PDF. Do you have access to such sources?
-- http://mail.python.org/mailman/listinfo/python-list