Its possible (likely) that I came into this in the middle, so sorry if this was already thrown out... but have you looked at any of the following suggestions?
https://pypi.python.org/pypi?%3Aaction=search&term=pdf+convert&submit=search http://stackoverflow.com/questions/6413441/python-pdf-library https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167 -----Original Message----- From: Python-list [mailto:python-list-bounces+d.strohl=f5....@python.org] On Behalf Of Scott Werner Sent: Friday, November 06, 2015 2:30 PM To: python-list@python.org Subject: Re: Script to extract text from PDF files On Tuesday, September 25, 2007 at 1:41:56 PM UTC-4, brad wrote: > I have a very crude Python script that extracts text from some (and I > emphasize some) PDF documents. On many PDF docs, I cannot extract > text, but this is because I'm doing something wrong. The PDF spec is > large and complex and there are various ways in which to store and > encode text. I wanted to post here and ask if anyone is interested in > helping make the script better which means it should accurately > extract text from most any pdf file... not just some. > > I know the topic of reading/extracting the text from a PDF document > natively in Python comes up every now and then on comp.lang.python... > I've posted about it in the past myself. After searching for other > solutions, I've resorted to attempting this on my own in my spare time. > Using apps external to Python (pdftotext, etc.) is not really an > option for me. If someone knows of a free native Python app that does > this now, let me know and I'll use that instead! > > So, if other more experienced programmer are interested in helping > make the script better, please let me know. I can host a website and > the latest revision and do all of the grunt work. > > Thanks, > > Brad As mentioned before, extracting plain text from a PDF document can be hit or miss. I have tried all the following applications (free/open source) on Arch Linux. Note, I would execute the commands with subprocess and capture stdout or read plain text file created by the application. * textract (uses pdftotext) - https://github.com/deanmalmgren/textract * pdftotext - http://poppler.freedesktop.org/ - cmd: pdftotext -layout "/path/to/document.pdf" - - cmd: pdftotext "/path/to/document.pdf" - * Calibre - http://calibre-ebook.com/ - cmd: ebook-convert "/path/to/document.pdf" "/path/to/plain.txt" --no-chapters-in-toc * AbiWord - http://www.abiword.org/ - cmd: abiword --to-name=fd://1 --to-TXT "/path/to/document.pdf" * Apache Tika - https://tika.apache.org/ - cmd: "/usr/bin/java" -jar "/path/to/standalone/tika-app-1.10.jar" --text-main "/path/to/document.pdf" For my application, I saw the best results using Apache Tika. However, I do still encounter strange encoding or extraction issues, e.g. S P A C E D O U T H E A D E R S" and "\nBroken \nHeader\n". I ended up writing a lot of repairing/cleaning methods. I welcome an improved solution that has some intelligence like comparing the extract plain text order to a snapshot of the pdf page using OCR. -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list