Re: I am troubleing to crawl next page link, it seem get different job count while debugging?

Nicolás Alejandro Ramírez Quiros Mon, 01 Sep 2014 15:05:40 -0700

1. The site might be banning you.
2. We deprecated select method when we added css support, now you have 
xpath and css methods available 
(http://doc.scrapy.org/en/latest/topics/selectors.html)


El lunes, 1 de septiembre de 2014 13:33:44 UTC-3, james josh escribió:
>
>
> Question no :1
>
> I am troubleing to crawl multiple page using nextpage link. which has 
> crawling different jobs count in each time: For ex: 20 jobs, 45 jobs, 200 jobs
>
>
> Question no :2
>
>
> Please let me know why it has happen during debugging? 
>
>
> scrapy_demo\spiders\test.py:43: ScrapyDeprecationWarning: Call to deprecated 
> function select. Use .xpath() instead. and how to solve this.
>
> Please check this.
>
> Thanks
> james
>
>
> Following my scrapy code:
> -------------------------
>
>
>
>
>
>
>
>
> from scrapy.spider import BaseSpiderfrom scrapy.selector import 
> HtmlXPathSelectorimport urlparsefrom scrapy.http.request import Requestfrom 
> scrapy.contrib.spiders import CrawlSpider,Rulefrom 
> scrapy.contrib.linkextractors.sgml import SgmlLinkExtractorfrom scrapy.item 
> import Item, Fieldclass ScrapyDemoSpiderItem(Item):
> link = Field()
> title = Field()
> city = Field()
> salary = Field()
> content = Field()class ScrapyDemoSpider(BaseSpider):
> name = 'eujobs77'
> allowed_domains = ['eujobs77.com']
> start_urls = ['http://www.eujobs77.com/jobs']def parse(self,response):
> hxs = HtmlXPathSelector(response)
> listings = hxs.select('//div[@class="jobSearchBrowse jobSearchBrowsev1"]')
> links = []#scrap listings page to get listing linksfor listing in listings:
> link=listing.select('//h2[@class="jobtitle"]/a[@class="blue"]/@href').extract()
> links.extend(link)#parse listing url to get content of the listing pagefor 
> link in links:
> item=ScrapyDemoSpiderItem(
> item['link']=linkyield Request(urlparse.urljoin(response.url, link), 
> meta={'item':item},callback=self.parse_listing_page)
>
>
>  #get next button link
> next_page = Noneif 
> hxs.select('//div[@class="paggingNext"]/a[@class="blue"]/@href').extract():
> next_page = 
> hxs.select('//div[@class="paggingNext"]/a[@class="blue"]/@href').extract()[0]if
>  next_page:yield Request(urlparse.urljoin(response.url, next_page), 
> self.parse)
>
>
>  #scrap listing page to get contentdef parse_listing_page(self,response):
> hxs = HtmlXPathSelector(response)
> item = response.request.meta['item']
> item ['link'] = response.url
> item['title'] = hxs.select("//h1[@id='share_jobtitle']/text()").extract()
> item['city'] = 
> hxs.select("//html/body/div[3]/div[3]/div[2]/div[1]/div[3]/ul/li[1]/div[2]/text()").extract()
> item['salary'] = 
> hxs.select("//html/body/div[3]/div[3]/div[2]/div[1]/div[3]/ul/li[3]/div[2]/text()").extract()
> item['content'] = hxs.select("//div[@class='detailTxt 
> deneL']/text()").extract()yield item
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: I am troubleing to crawl next page link, it seem get different job count while debugging?

Reply via email to