Re: scrapy multilevel web scrapping

Nicolás Alejandro Ramírez Quiros Mon, 18 Aug 2014 11:23:26 -0700

Your problem is that you are using spider attributes to save the data, if 
your check at the Request documentation 
<http://doc.scrapy.org/en/latest/topics/request-response.html#request-objects>, 
you will notice that there is a meta attribute to move information from one 
request to his specific callback instance.


I usually do something like this: 
https://gist.github.com/nramirezuy/1bf4a6d635d98a1e4df0

El lunes, 11 de agosto de 2014 13:19:00 UTC-3, Gaurang shah escribió:
>
> Hi Guys, 
>
> I have just started the scrapy and trying to scrap a website which 
> requires to crawl pages at multiple level.
>
> From this following website I require to map brand with category and 
> category with product. i.e like this 
> Nutripe,Dog Products,prod1
> Nutripe,Dog Products,prod2
> Nutripe,Dog Products,prod3
> Nutripe,Cat Products,prod1
> Nutripe,Cat Products,prod2
> However scrapy is really pissing me off. it looks easy however it's really 
> messy. I am not even able to map product with category. 
>
> I am  getting something like this 
>
> Nutripe     Dog Products,Cat Products
>
> I would really appreciate is someone would help me understand what's wrong 
> I am doing. 
>
>
>
>
>     def get_url(self,string):
>         """Return complete url"""
>         return "http://link2linkco.com/"; + string
>
>
>     def parse(self, response):
>         hxs = HtmlXPathSelector(response)
>         brands = hxs.select("//div[@id='contentFull']/div/p/a/@href")
>         # self.item = Link2LinkItem()
>         for brand in brands:
>             brand_page = brand.extract()
>             # print self.complete_url(brand_page)
>             yield Request(self.get_url(brand_page), 
> callback=self.parse_brands)
>
>
>
>     def parse_brands(self, response):
>
>         index = 1
>         hxs = HtmlXPathSelector(response)
>         item = Link2LinkItem()
>         self.brand_name = 
> hxs.select("//*[@id='contentFull']/h1/text()").extract()
>         brands = 
> hxs.select("//div[@id='contentFull']/fieldset[2]/div/p/a/@href")
>         for brand in brands:
>
>             brand_link = brand.extract()
>
>             self.products_category = 
> hxs.select("//*[@id='contentFull']/fieldset[2]/div/p[2]/a/text()").extract()
>             print self.get_url(brand_link)
>             # yield Request(self.complete_url(brand_name), callback= 
> self.parse_catatories)
>             item['Brand'] = self.brand_name
>             item['Products_Category'] = self.products_category
>         return item
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: scrapy multilevel web scrapping

Reply via email to