Hi Paul, Thanks again bro. It is retrieving only first 3 items. But I want first 5 items. I don't know where it went wrong. Could you please help me. Here is my code.. class ScrapePriceSpider(CrawlSpider): name = 'ScrapeItems' allowed_domains = ['steamcommunity.com'] start_urls = ['http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems'] rules = ( Rule(LinkExtractor(restrict_xpaths=('(//a[re:test(@href, "^http://steamcommunity.com/sharedfiles/filedetails/")])[position()<6]',)), process_links=lambda l: l[:5], callback='parse_items'), )
def parse_items(self, response): hxs = HtmlXPathSelector(response) item = ExtractitemsItem() uniqueVisits = hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract() CurrentFavorites = hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract() itemname = hxs.select("//div[@class='workshopItemTitle']/text()").extract() item["Item"] = str(itemname)[3:-2] item["UniqueVisits"] = str(uniqueVisits)[3:-2] item["CurrentFavorites"] = str(CurrentFavorites)[3:-2] return item On Monday, September 29, 2014 10:33:23 AM UTC-7, Paul Tremberth wrote: > > You can try with LinkExtractor and XPath: > > LinkExtractor(restrict_xpaths=('(//a[re:test(@href, "^ > http://steamcommunity.com/sharedfiles/filedetails/")])[position()<6]',)) > > > > On Monday, September 29, 2014 7:01:58 PM UTC+2, Chetan Motamarri wrote: >> >> Hi Paul , >> >> It worked thank you very much, but it is not taking first 5 urls in that >> start url page " >> http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems" >> instead >> it is crawling random 5 links that starts with >> "*http://steamcommunity.com/sharedfiles/filedetails/ >> <http://steamcommunity.com/sharedfiles/filedetails/>*"; >> >> Can we restrict crawlspider to *crawl the first 5 links on a page* that >> starts with above url ? >> >> On Monday, September 29, 2014 4:14:13 AM UTC-7, Paul Tremberth wrote: >>> >>> Hi, >>> >>> You can use process_links for this: >>> >>> >>> Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/",)), >>> process_links=lambda l: l[:5], >>> callback='parse_items'), >>> >>> >>> >>> >>> On Monday, September 29, 2014 9:06:31 AM UTC+2, Chetan Motamarri wrote: >>>> >>>> Hi, >>>> >>>> I am new to use crawlspider... >>>> >>>> My problem is, *I need to extract top 5 items data in this link* ( >>>> http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems). >>>> I have done this like this: >>>> >>>> start_urls = [ >>>> '*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext= >>>> >>>> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*' >>>> >>>> ] >>>> >>>> and specified rules as >>>> rules = ( >>>> >>>> Rule(SgmlLinkExtractor(allow=("*http://steamcommunity.com/sharedfiles/filedetails/ >>>> >>>> <http://steamcommunity.com/sharedfiles/filedetails/>*",)), >>>> callback='parse_items'), >>>> ) >>>> >>>> Now it is crawling through all urls that starts with " >>>> http://steamcommunity.com/sharedfiles/filedetails" on the start_url >>>> <http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems> >>>> page. >>>> >>>> My problem is it should crawl through only first 5 urls that starts >>>> with "http://steamcommunity.com/sharedfiles/filedetails/" on the >>>> start_url >>>> <http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems>page. >>>> >>>> Can we do this by crawlspider restrict or any other means ? >>>> >>>> *My code: * >>>> >>>> class ScrapePriceSpider(CrawlSpider): >>>> >>>> name = 'ScrapeItems' >>>> allowed_domains = ['steamcommunity.com'] >>>> start_urls = >>>> ['*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext= >>>> >>>> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*' >>>> >>>> ] >>>> >>>> rules = ( >>>> Rule(SgmlLinkExtractor(allow=(" >>>> http://steamcommunity.com/sharedfiles/filedetails/",)), >>>> callback='parse_items'), >>>> ) >>>> >>>> >>>> def parse_items(self, response): >>>> hxs = HtmlXPathSelector(response) >>>> >>>> item = ExtractitemsItem() >>>> >>>> item["Item Name"] = >>>> hxs.select("//div[@class='workshopItemTitle']/text()").extract() >>>> item["Unique Visits"] = >>>> hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract() >>>> >>>> item["Current Favorites"] = >>>> hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract() >>>> return item >>>> >>> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.