Re: Problem in downloading multiple files

Nicolás Alejandro Ramírez Quiros Mon, 01 Sep 2014 15:12:06 -0700

Is the spider crawling the pdf pages? can you see the 200 responses on the 
log?


How are you accesing the item here? 

    path=self.get_path(item['url']) 

item isn't defined on the method, also the field name is wrong. If that is 
the whole code you are trying to run, read it better because it is full of 
errors not related to Scrapy

El lunes, 1 de septiembre de 2014 15:21:47 UTC-3, Amitoj escribió:
>
> I've been working on downloading all pdf files in one go using Scrapy.
>
> I can't understand why even though I don't see an error in the code, it's 
> not downloading any files.
>
> Here's my code.
>
> My spider:  
>
> from scrapy.spider import BaseSpider
> from scrapy.contrib.spiders import CrawlSpider,Rule
> from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
> from cs.items import CsItem
> from scrapy.item import Item, Field
> from scrapy.selector import HtmlXPathSelector
>
> class CsSpider(CrawlSpider):
>     name = "cs"
>     allowed_domains = ["cs.org"]
>     start_urls = [
>         "http://cs.org/projects.html";
>     ]
>     rules = (Rule(SgmlLinkExtractor(allow_domains=('
> http://cs.org/projects.html', )), callback='parse_urls', follow=True),)
>
>
> def parse_urls(self, response):
> hxs=HtmlXPathSelector(response)
> item=CsItem()
> item['pdf_urls']=hxs.select('//a/@href')
> pdf_urls=hxs.select('//a/@href').extract()
> for url in pdf_urls :
> yield scrapy.Request(url,callback=self.save_pdf)
>
> def save_pdf(self,response):
> path=self.get_path(item['url'])
> with open(path,"wb") as f:
> f.write(response.body) 
>
>
>
> items.py:
>
> import scrapy
> from scrapy.item import Item, Field
>
> class CsItem(scrapy.Item):
>     # define the fields for your item here like:
>     # name = scrapy.Field()
> pdf_urls=Field()
> files=Field()
>
>
>
> settings.py:
>
> BOT_NAME = 'cs'
>
> SPIDER_MODULES = ['cs.spiders']
> NEWSPIDER_MODULE = 'cs.spiders'
> ITEM_PIPELINES = {
>     'scrapy.contrib.pipeline.files.FilesPipeline':1
> }
> FILES_STORE = '/home/amitoj/Projects/Scrapy/PDFScraper/cs/cs/downloads'
>
>
>
>
> I'd appreciate help in finding out possible loopholes in the code and 
> other (better) ways of downloading multiple pdf files.
>
>
>
> Thanks,
>
> Amitoj
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Problem in downloading multiple files

Reply via email to