I've been working on downloading all pdf files in one go using Scrapy.

I can't understand why even though I don't see an error in the code, it's 
not downloading any files.

Here's my code.

My spider:  

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from cs.items import CsItem
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector

class CsSpider(CrawlSpider):
    name = "cs"
    allowed_domains = ["cs.org"]
    start_urls = [
        "http://cs.org/projects.html";
    ]
    rules = 
(Rule(SgmlLinkExtractor(allow_domains=('http://cs.org/projects.html', )), 
callback='parse_urls', follow=True),)


def parse_urls(self, response):
hxs=HtmlXPathSelector(response)
item=CsItem()
item['pdf_urls']=hxs.select('//a/@href')
pdf_urls=hxs.select('//a/@href').extract()
for url in pdf_urls :
yield scrapy.Request(url,callback=self.save_pdf)

def save_pdf(self,response):
path=self.get_path(item['url'])
with open(path,"wb") as f:
f.write(response.body) 



items.py:

import scrapy
from scrapy.item import Item, Field

class CsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
pdf_urls=Field()
files=Field()



settings.py:

BOT_NAME = 'cs'

SPIDER_MODULES = ['cs.spiders']
NEWSPIDER_MODULE = 'cs.spiders'
ITEM_PIPELINES = {
    'scrapy.contrib.pipeline.files.FilesPipeline':1
}
FILES_STORE = '/home/amitoj/Projects/Scrapy/PDFScraper/cs/cs/downloads'




I'd appreciate help in finding out possible loopholes in the code and other 
(better) ways of downloading multiple pdf files.



Thanks,

Amitoj

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to