On 27/09/2010 00:10, flebber wrote:
I was trying to use Pypdf following a recipe from the Activestate
cookbooks. However I cannot get it too work. Unsure if it is me or it
is beacuse sets are deprecated.

The 'sets' module pre-dates the built-in 'set' class. The warning is
just to inform you that the module will be removed in due course (it's
still in Python 2.7, but not Python 3), so you can still use it in
those versions.

I have placed a pdf in my C:\ drive. it is called "Components-of-Dot-
NET.pdf" You could use anything I was just testing with it.

I was using the last script on that page that was most recently
updated. I am using python 2.6.


import pyPdf

def getPDFContent(path):
     content = "C:\Components-of-Dot-NET.pdf"
     # Load PDF into pyPDF
     pdf = pyPdf.PdfFileReader(file(path, "rb"))
     # Iterate pages
     for i in range(0, pdf.getNumPages()):
         # Extract text from page and add to content
         content += pdf.getPage(i).extractText() + "\n"
     # Collapse whitespace
     content = " ".join(content.replace(u"\xa0", " ").strip().split())
     return content

print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",

This is my error.

Warning (from warnings module):
   File "C:\Documents and Settings\Family\Application Data\Python
\Python26\site-packages\pyPdf\pdf.py", line 52
     from sets import ImmutableSet
DeprecationWarning: the sets module is deprecated

Traceback (most recent call last):
   File "C:/Python26/Pdfread", line 15, in<module>
     print getPDFContent("Components-of-Dot-NET.pdf").encode("ascii",
   File "C:/Python26/Pdfread", line 6, in getPDFContent
     pdf = pyPdf.PdfFileReader(file(path, "rb"))
IOError: [Errno 2] No such file or directory: 'Components-of-Dot-

You put the file in C:\, but you didn't tell Python where it is. You
gave just the filename "Components-of-Dot-NET.pdf", and it's looking in
the current directory, which probably isn't C:\.

Try providing the full pathname:

print getPDFContent(r"C:\Components-of-Dot-NET.pdf").encode("ascii", "ignore")

Reply via email to