Hi Guys,
I have just started the scrapy and trying to scrap a website which requires
to crawl pages at multiple level.
>From this following website I require to map brand with category and
category with product. i.e like this
Nutripe,Dog Products,prod1
Nutripe,Dog Products,prod2
Nutripe,Dog Products,prod3
Nutripe,Cat Products,prod1
Nutripe,Cat Products,prod2
However scrapy is really pissing me off. it looks easy however it's really
messy. I am not even able to map product with category.
I am getting something like this
Nutripe Dog Products,Cat Products
I would really appreciate is someone would help me understand what's wrong
I am doing.
def get_url(self,string):
"""Return complete url"""
return "http://link2linkco.com/" + string
def parse(self, response):
hxs = HtmlXPathSelector(response)
brands = hxs.select("//div[@id='contentFull']/div/p/a/@href")
# self.item = Link2LinkItem()
for brand in brands:
brand_page = brand.extract()
# print self.complete_url(brand_page)
yield Request(self.get_url(brand_page),
callback=self.parse_brands)
def parse_brands(self, response):
index = 1
hxs = HtmlXPathSelector(response)
item = Link2LinkItem()
self.brand_name =
hxs.select("//*[@id='contentFull']/h1/text()").extract()
brands =
hxs.select("//div[@id='contentFull']/fieldset[2]/div/p/a/@href")
for brand in brands:
brand_link = brand.extract()
self.products_category =
hxs.select("//*[@id='contentFull']/fieldset[2]/div/p[2]/a/text()").extract()
print self.get_url(brand_link)
# yield Request(self.complete_url(brand_name), callback=
self.parse_catatories)
item['Brand'] = self.brand_name
item['Products_Category'] = self.products_category
return item
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.