Hi everyone,
I faced with same situation (login and crawling with crawlspider) and done 
it with overridden parse_start_url (all code below should placed in 
your-spider-file.py):

# Here you setting the start page, where your spider can login
start_urls = ["http://forums.website.com";]

# Here you override the function to login on website and setting the 
callback function to check is everything ok after your login
def parse_start_url(self, response):
return [FormRequest.from_response(response,
formdata={'login': 'myUsername', 'password': 'myPassword'}, 
callback=self.after_login)]

# Here you doing after-login stuff to check is everything ok, and if it's, 
making request object with real start page from where your spider can start 
to crawl and parse
def after_login(self, response):
if "Incorrect login or password" in response.body:
self.log("### Login failed ###", level=log.ERROR)
exit()
else:
self.log("### Successfully logged in! ###")
lnk = 'http://website.com/realstartpage.php'
request = Request(lnk)
return request

To make it work, don't forget to import request modules in the beggining of 
your spider file:
from scrapy.http import Request, FormRequest

Hope it helps someone



среда, 17 июля 2013 г., 3:12:07 UTC+4 пользователь Capi Etheriel написал:
>
> it's documented in 0.17: 
> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url
>  
>
> Em quinta-feira, 11 de julho de 2013 17h02min31s UTC-3, Paul Tremberth 
> escreveu:
>>
>> Hi
>> CrawlSpider has an overridable method parse_start_url() that could be 
>> used in your case (I think)
>>
>> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url
>>
>> It's not mentioned in the docs for 0.16 (the links your provided) but 
>> it's in the code for 0.16 and 0.17
>> https://github.com/scrapy/scrapy/blob/0.16/scrapy/contrib/spiders/crawl.py
>>
>> It's called in CrawlSpider's parse() method, so when the first URL is 
>> fetched and processed (especially the start_urls you will define for your 
>> LoginSpider).
>>
>> So I would try and define parse_start_url() just as the LoginSpider 
>> example
>>
>>     def parse_start_url(self, response):
>>         return [FormRequest.from_response(response,
>>                     formdata={'username': 'john', 'password': 'secret'},
>>                     callback=self.after_login)]
>>
>>
>> *Note: as another user in the group recently had issues with this 
>> parse_start_url() method being called several times,*
>> *be sure to define a callback that is NOT parse() for your Rules()*
>>
>> Tell us how it goes.
>>
>> Paul.
>>
>> On Thursday, July 11, 2013 7:48:57 PM UTC+2, Fer wrote:
>>>
>>> Hi everyone!
>>> I'm trying to mix the 
>>> LoginSpider<http://doc.scrapy.org/en/0.16/topics/request-response.html#topics-request-response-ref-request-userlogin>
>>>  with 
>>> CrawlSpider<http://doc.scrapy.org/en/0.16/topics/spiders.html#crawlspider-example>,
>>>  
>>> but I do not find a way. The idea is first login and then parse using the 
>>> rules, but in the example of 
>>> LoginSpider<http://doc.scrapy.org/en/0.16/topics/request-response.html#topics-request-response-ref-request-userlogin>
>>>  the 
>>> method parse was modified, and in the 
>>> CrawlSpider<http://doc.scrapy.org/en/0.16/topics/spiders.html#crawlspider-example>
>>>  says 
>>> "if you override the parse method, the crawl spider will no longer work".  
>>> I 
>>> would be grateful if you could help me.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to