Re: Problem with crawling multiple pages

Paul Tremberth Mon, 02 Mar 2015 14:07:47 -0800

That's weird,
I get nearly 2000 items running your spider: (with a custom ItemloadItem)


https://gist.github.com/redapple/aa274c729ee912de46ce


On Saturday, February 28, 2015 at 8:10:36 PM UTC+1, JEBI93 wrote:
>
> Here's full script: http://pastebin.com/13eNky9W, after i change from 
> parse to parse_page i dont get anything scraped.
>
> субота, 28. фебруар 2015. 16.51.10 UTC+1, Paul Tremberth је написао/ла:
>>
>> Hi,
>>
>> CrawlSpider and a custom parse() method do not play well together. See 
>> the warning a bit below 
>> http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules
>> It's easy to miss.
>>
>> Try renaming your parse() method to something like parse_page(), and 
>> reference this new callback name in your rule.
>>  Le 28 févr. 2015 16:17, "JEBI93" <[email protected]> a écrit :
>>
>>> Hey guys, i have a small problem when trying to crawl 10+ pages. Heres 
>>> the code:
>>>
>>> class ItemspiderSpider(CrawlSpider):
>>>     name = "itemspider"
>>>     allowed_domains = ["openstacksummitnovember2014paris.sched.org"]
>>>     start_urls = ['
>>> http://openstacksummitnovember2014paris.sched.org/directory/attendees/']
>>>     
>>>     rules = (
>>>         Rule(SgmlLinkExtractor(allow=r'/directory/attendees/\d+'), 
>>> callback='parse', follow=True),
>>>     )    
>>>
>>> The problem is that when i run this code i get only results of first 
>>> page, not the others. I tried to modify start_urls to something like this 
>>> and it worked fine
>>>
>>> start_urls = [
>>> 'http://openstacksummitnovember2014paris.sched.org/directory/attendees/1
>>> '
>>> 'http://openstacksummitnovember2014paris.sched.org/directory/attendees/2
>>> '
>>> 'http://openstacksummitnovember2014paris.sched.org/directory/attendees/3
>>> '
>>> 'http://openstacksummitnovember2014paris.sched.org/directory/attendees/4
>>> '
>>> etc..
>>> ]
>>>
>>> I'm guessing i messed up at allow part, probably my regex its not proper.
>>>
>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Problem with crawling multiple pages

Reply via email to