Re: Is it possible to limit the number of links crawled by crawl spider ?

Chetan Motamarri Wed, 01 Oct 2014 00:14:27 -0700

Hi Paul,
Thanks again bro. It is retrieving only first 3 items. But I want first 5 
items. I don't know where it went wrong. Could you please help me. Here is 
my code..
class ScrapePriceSpider(CrawlSpider):
   name = 'ScrapeItems'     
    allowed_domains = ['steamcommunity.com']    
    start_urls = 
['http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems']
    
    rules = (
              Rule(LinkExtractor(restrict_xpaths=('(//a[re:test(@href, 
"^http://steamcommunity.com/sharedfiles/filedetails/";)])[position()<6]',)), 
 process_links=lambda l: l[:5], callback='parse_items'),
            )



    def parse_items(self, response):
            hxs = HtmlXPathSelector(response)

            item = ExtractitemsItem()
            uniqueVisits = 
hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()    
            CurrentFavorites = 
hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
            itemname = 
hxs.select("//div[@class='workshopItemTitle']/text()").extract()
            item["Item"] = str(itemname)[3:-2]
            item["UniqueVisits"] = str(uniqueVisits)[3:-2]
            item["CurrentFavorites"] = str(CurrentFavorites)[3:-2]
            return item

On Monday, September 29, 2014 10:33:23 AM UTC-7, Paul Tremberth wrote:
>
> You can try with LinkExtractor and XPath:
>
> LinkExtractor(restrict_xpaths=('(//a[re:test(@href, "^
> http://steamcommunity.com/sharedfiles/filedetails/";)])[position()<6]',))
>
>
>
> On Monday, September 29, 2014 7:01:58 PM UTC+2, Chetan Motamarri wrote:
>>
>> Hi Paul ,
>>
>> It worked thank you very much, but it is not taking first 5 urls in that 
>> start url page "
>> http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems"; 
>> instead 
>> it is crawling random 5 links that starts with 
>> "*http://steamcommunity.com/sharedfiles/filedetails/ 
>> <http://steamcommunity.com/sharedfiles/filedetails/>*"; 
>>
>> Can we restrict crawlspider to *crawl the first 5 links on a page* that 
>> starts with above url ?
>>
>> On Monday, September 29, 2014 4:14:13 AM UTC-7, Paul Tremberth wrote:
>>>
>>> Hi,
>>>
>>> You can use process_links for this:
>>>
>>>
>>> Rule(SgmlLinkExtractor(allow=("http://steamcommunity.com/sharedfiles/filedetails/";,)),
>>>      process_links=lambda l: l[:5],
>>>      callback='parse_items'),
>>>
>>>
>>>
>>>
>>> On Monday, September 29, 2014 9:06:31 AM UTC+2, Chetan Motamarri wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am new to use crawlspider... 
>>>>
>>>> My problem is, *I need to extract top 5 items data in this link* (
>>>> http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems). 
>>>> I have done this like this:
>>>>
>>>> start_urls = [ 
>>>> '*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=
>>>>  
>>>> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*'
>>>>  
>>>> ]
>>>>
>>>> and specified rules as 
>>>> rules = (
>>>>              
>>>> Rule(SgmlLinkExtractor(allow=("*http://steamcommunity.com/sharedfiles/filedetails/
>>>>  
>>>> <http://steamcommunity.com/sharedfiles/filedetails/>*",)), 
>>>> callback='parse_items'),
>>>>             )
>>>>
>>>> Now it is crawling through all urls that starts with "
>>>> http://steamcommunity.com/sharedfiles/filedetails"; on the start_url 
>>>> <http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems>
>>>> page. 
>>>>
>>>> My problem is it should crawl through only first 5 urls that starts 
>>>> with "http://steamcommunity.com/sharedfiles/filedetails/";  on the 
>>>> start_url  
>>>> <http://steamcommunity.com/workshop/browse/?appid=570&section=mtxitems>page.
>>>>  
>>>> Can we do this by crawlspider restrict or any other means ?
>>>>
>>>> *My code: *
>>>>
>>>> class ScrapePriceSpider(CrawlSpider):
>>>>     
>>>>     name = 'ScrapeItems'     
>>>>     allowed_domains = ['steamcommunity.com']     
>>>>     start_urls = 
>>>> ['*http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=
>>>>  
>>>> <http://steamcommunity.com/sharedfiles/filedetails/?id=317972390&searchtext=>*'
>>>>  
>>>> ]
>>>>     
>>>>     rules = (
>>>>              Rule(SgmlLinkExtractor(allow=("
>>>> http://steamcommunity.com/sharedfiles/filedetails/";,)), 
>>>> callback='parse_items'),
>>>>             )
>>>>
>>>>
>>>>     def parse_items(self, response):
>>>>            hxs = HtmlXPathSelector(response)       
>>>>
>>>>             item = ExtractitemsItem()
>>>>
>>>>             item["Item Name"]                      = 
>>>> hxs.select("//div[@class='workshopItemTitle']/text()").extract()
>>>>             item["Unique Visits"]                  = 
>>>> hxs.select("//table[@class='stats_table']/tr[1]/td[1]/text()").extract()   
>>>>  
>>>>             item["Current Favorites"]          = 
>>>> hxs.select("//table[@class='stats_table']/tr[2]/td[1]/text()").extract()
>>>>             return item
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Is it possible to limit the number of links crawled by crawl spider ?

Reply via email to