On Sep 27, 12:08 pm, flebber <flebber.c...@gmail.com> wrote: > On Sep 27, 10:39 am, flebber <flebber.c...@gmail.com> wrote: > > > > > On Sep 27, 9:38 am, "w.g.sned...@gmail.com" <w.g.sned...@gmail.com> > > wrote: > > > > On Sep 26, 7:10 pm, flebber <flebber.c...@gmail.com> wrote: > > > > > I was trying to use Pypdf following a recipe from the Activestate > > > > cookbooks. However I cannot get it too work. Unsure if it is me or it > > > > is beacuse sets are deprecated. > > > > > I have placed a pdf in my C:\ drive. it is called "Components-of-Dot- > > > > NET.pdf" You could use anything I was just testing with it. > > > > > I was using the last script on that page that was most recently > > > > updated. I am using python 2.6. > > > > >http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-co... > > > > > import pyPdf > > > > > def getPDFContent(path): > > > > content = "C:\Components-of-Dot-NET.pdf" > > > > # Load PDF into pyPDF > > > > pdf = pyPdf.PdfFileReader(file(path, "rb")) > > > > # Iterate pages > > > > for i in range(0, pdf.getNumPages()): > > > > # Extract text from page and add to content > > > > content += pdf.getPage(i).extractText() + "\n" > > > > # Collapse whitespace > > > > content = " ".join(content.replace(u"\xa0", " ").strip().split()) > > > > return content > > > > > print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii", > > > > "ignore") > > > > > This is my error. > > > > > Warning (from warnings module): > > > > File "C:\Documents and Settings\Family\Application Data\Python > > > > \Python26\site-packages\pyPdf\pdf.py", line 52 > > > > from sets import ImmutableSet > > > > DeprecationWarning: the sets module is deprecated > > > > > Traceback (most recent call last): > > > > File "C:/Python26/Pdfread", line 15, in <module> > > > > print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii", > > > > "ignore") > > > > File "C:/Python26/Pdfread", line 6, in getPDFContent > > > > pdf = pyPdf.PdfFileReader(file(path, "rb")) > > > > ---> IOError: [Errno 2] No such file or directory: 'Components-of-Dot-> > > > NET.pdf' > > > > Looks like a issue with finding the file. > > > how do you pass the path? > > > okay thanks I thought that when I set content here > > > def getPDFContent(path): > > content = "C:\Components-of-Dot-NET.pdf" > > > that i was defining where it is. > > > but yeah I updated script to below and it works. That is the contents > > are displayed to the interpreter. How do I output to a .txt file? > > > import pyPdf > > > def getPDFContent(path): > > content = "C:\Components-of-Dot-NET.pdf" > > # Load PDF into pyPDF > > pdf = pyPdf.PdfFileReader(file(path, "rb")) > > # Iterate pages > > for i in range(0, pdf.getNumPages()): > > # Extract text from page and add to content > > content += pdf.getPage(i).extractText() + "\n" > > # Collapse whitespace > > content = " ".join(content.replace(u"\xa0", " ").strip().split()) > > return content > > > print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii", > > "ignore") > > I have found far more advanced scripts searching around. But will have > to keep trying as I cannot get an output file or specify the path. > > Edit very strangely whilst searching for examples I found my own post > just written here ranking number 5 on google within 2 hours. Bizzare. > > http://www.eggheadcafe.com/software/aspnet/36237766/errors-with-pypdf... > > Replicates our thread as thiers. I was searching ggole with "pypdf > return to txt file"
Traceback (most recent call last): File "C:/Python26/Pdfread", line 16, in <module> open('x.txt', 'w').write(content) NameError: name 'content' is not defined >>> When i use. import pyPdf def getPDFContent(path): content = "C:\Components-of-Dot-NET.txt" # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages for i in range(0, pdf.getNumPages()): # Extract text from page and add to content content += pdf.getPage(i).extractText() + "\n" # Collapse whitespace content = " ".join(content.replace(u"\xa0", " ").strip().split()) return content print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii", "ignore") open('x.txt', 'w').write(content) -- http://mail.python.org/mailman/listinfo/python-list