Re: Help in reading the pdf file

Cameron Laird Sat, 28 Mar 2009 10:10:42 -0700

In article <[email protected]>,
Gabriel Genellina <[email protected]> wrote:
>En Thu, 26 Mar 2009 18:31:31 -0300, M Kumar <[email protected]>
>escribió:
>
>> I need to read pdf files and extract data from it, is there any way to  
>> do it
>> through python.
>
>If you are interested in the text, I'd use ghostscript pdf2text (you may
>invoke it from inside python).
>
>Actually extracting text from a PDF is rather difficult. It's a
>"presentation" format (or "display" format); every word in the document
>might be absolutely positioned, there is no paragraph structure you can
>rely on.
                        .
                        .
                        .
I reinforce Gabriel's good advice with a few points of my own:
A.  I used to try to index PDF's text extractors
    at <URL:
    http://phaseit.net/claird/comp.text.pdf/PDF_converters.html#pdf2txt >.
    While I haven't maintained this page in years,
    it would take only a little motivation for me
    to freshen it considerably.
B.  My current favorite is pdftotext.
C.  There are multiple "pdf2txt"-s, that is, dif-
    ferent products which share a name.  Notice
    Gabriel's qualification that he is thinking 
    of the *GS* one.
D.  Many times the best way to automate a business
    process involving PDF demands a trek farther
    "upstream", that is, identification of the 
    source of a text *before* it was rendered as
    PDF.  Do you have access to such sources?

--
http://mail.python.org/mailman/listinfo/python-list

Re: Help in reading the pdf file

Reply via email to