I've been working on downloading all pdf files in one go using Scrapy. I can't understand why even though I don't see an error in the code, it's not downloading any files.
Here's my code. My spider: from scrapy.spider import BaseSpider from scrapy.contrib.spiders import CrawlSpider,Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from cs.items import CsItem from scrapy.item import Item, Field from scrapy.selector import HtmlXPathSelector class CsSpider(CrawlSpider): name = "cs" allowed_domains = ["cs.org"] start_urls = [ "http://cs.org/projects.html" ] rules = (Rule(SgmlLinkExtractor(allow_domains=('http://cs.org/projects.html', )), callback='parse_urls', follow=True),) def parse_urls(self, response): hxs=HtmlXPathSelector(response) item=CsItem() item['pdf_urls']=hxs.select('//a/@href') pdf_urls=hxs.select('//a/@href').extract() for url in pdf_urls : yield scrapy.Request(url,callback=self.save_pdf) def save_pdf(self,response): path=self.get_path(item['url']) with open(path,"wb") as f: f.write(response.body) items.py: import scrapy from scrapy.item import Item, Field class CsItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pdf_urls=Field() files=Field() settings.py: BOT_NAME = 'cs' SPIDER_MODULES = ['cs.spiders'] NEWSPIDER_MODULE = 'cs.spiders' ITEM_PIPELINES = { 'scrapy.contrib.pipeline.files.FilesPipeline':1 } FILES_STORE = '/home/amitoj/Projects/Scrapy/PDFScraper/cs/cs/downloads' I'd appreciate help in finding out possible loopholes in the code and other (better) ways of downloading multiple pdf files. Thanks, Amitoj -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.