Hi Paul , It worked thank you very much, but it is not taking first 5 urls in that start url page "http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems" instead it is crawling random 5 links that starts with "*http://steamcommunity.com/sharedfiles/filedetails/ <http://steamcommunity.com/sharedfiles/filedetails/>*";
Can we restrict crawlspider to *crawl the first 5 links on a page* that starts with above url ? On Monday, September 29, 2014 4:14:13 AM UTC-7, Paul Tremberth wrote: > > Hi, > > You can use process_links for this: > > > Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/",)), > process_links=lambda l: l[:5], > callback='parse_items'), > > > > > On Monday, September 29, 2014 9:06:31 AM UTC+2, Chetan Motamarri wrote: >> >> Hi, >> >> I am new to use crawlspider... >> >> My problem is, *I need to extract top 5 items data in this link* ( >> http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems). >> I have done this like this: >> >> start_urls = [ >> '*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext= >> >> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*' >> >> ] >> >> and specified rules as >> rules = ( >> >> Rule(SgmlLinkExtractor(allow=("*http://steamcommunity.com/sharedfiles/filedetails/ >> >> <http://steamcommunity.com/sharedfiles/filedetails/>*",)), >> callback='parse_items'), >> ) >> >> Now it is crawling through all urls that starts with " >> http://steamcommunity.com/sharedfiles/filedetails" on the start_url >> <http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems> >> page. >> >> My problem is it should crawl through only first 5 urls that starts with " >> http://steamcommunity.com/sharedfiles/filedetails/" on the start_url >> <http://steamcommunity.com/workshop/browse/?appid=570§ion=mtxitems>page. >> Can we do this by crawlspider restrict or any other means ? >> >> *My code: * >> >> class ScrapePriceSpider(CrawlSpider): >> >> name = 'ScrapeItems' >> allowed_domains = ['steamcommunity.com'] >> start_urls = >> ['*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext= >> >> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*' >> >> ] >> >> rules = ( >> Rule(SgmlLinkExtractor(allow=(" >> http://steamcommunity.com/sharedfiles/filedetails/",)), >> callback='parse_items'), >> ) >> >> >> def parse_items(self, response): >> hxs = HtmlXPathSelector(response) >> >> item = ExtractitemsItem() >> >> item["Item Name"] = >> hxs.select("//div[@class='workshopItemTitle']/text()").extract() >> item["Unique Visits"] = >> hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract() >> item["Current Favorites"] = >> hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract() >> return item >> > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.