Hi everyone, I faced with same situation (login and crawling with crawlspider) and done it with overridden parse_start_url (all code below should placed in your-spider-file.py):
# Here you setting the start page, where your spider can login start_urls = ["http://forums.website.com"] # Here you override the function to login on website and setting the callback function to check is everything ok after your login def parse_start_url(self, response): return [FormRequest.from_response(response, formdata={'login': 'myUsername', 'password': 'myPassword'}, callback=self.after_login)] # Here you doing after-login stuff to check is everything ok, and if it's, making request object with real start page from where your spider can start to crawl and parse def after_login(self, response): if "Incorrect login or password" in response.body: self.log("### Login failed ###", level=log.ERROR) exit() else: self.log("### Successfully logged in! ###") lnk = 'http://website.com/realstartpage.php' request = Request(lnk) return request To make it work, don't forget to import request modules in the beggining of your spider file: from scrapy.http import Request, FormRequest Hope it helps someone среда, 17 июля 2013 г., 3:12:07 UTC+4 пользователь Capi Etheriel написал: > > it's documented in 0.17: > http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url > > > Em quinta-feira, 11 de julho de 2013 17h02min31s UTC-3, Paul Tremberth > escreveu: >> >> Hi >> CrawlSpider has an overridable method parse_start_url() that could be >> used in your case (I think) >> >> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.contrib.spiders.CrawlSpider.parse_start_url >> >> It's not mentioned in the docs for 0.16 (the links your provided) but >> it's in the code for 0.16 and 0.17 >> https://github.com/scrapy/scrapy/blob/0.16/scrapy/contrib/spiders/crawl.py >> >> It's called in CrawlSpider's parse() method, so when the first URL is >> fetched and processed (especially the start_urls you will define for your >> LoginSpider). >> >> So I would try and define parse_start_url() just as the LoginSpider >> example >> >> def parse_start_url(self, response): >> return [FormRequest.from_response(response, >> formdata={'username': 'john', 'password': 'secret'}, >> callback=self.after_login)] >> >> >> *Note: as another user in the group recently had issues with this >> parse_start_url() method being called several times,* >> *be sure to define a callback that is NOT parse() for your Rules()* >> >> Tell us how it goes. >> >> Paul. >> >> On Thursday, July 11, 2013 7:48:57 PM UTC+2, Fer wrote: >>> >>> Hi everyone! >>> I'm trying to mix the >>> LoginSpider<http://doc.scrapy.org/en/0.16/topics/request-response.html#topics-request-response-ref-request-userlogin> >>> with >>> CrawlSpider<http://doc.scrapy.org/en/0.16/topics/spiders.html#crawlspider-example>, >>> >>> but I do not find a way. The idea is first login and then parse using the >>> rules, but in the example of >>> LoginSpider<http://doc.scrapy.org/en/0.16/topics/request-response.html#topics-request-response-ref-request-userlogin> >>> the >>> method parse was modified, and in the >>> CrawlSpider<http://doc.scrapy.org/en/0.16/topics/spiders.html#crawlspider-example> >>> says >>> "if you override the parse method, the crawl spider will no longer work". >>> I >>> would be grateful if you could help me. >>> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.