Updated the SO answer with a functional example. Cheers. On Friday, June 17, 2016 at 5:53:44 AM UTC+1, Jeremy D wrote: > > Does seem to work. Using deferToThread I run into the same problem, where > it doesnt get into the parse() method until the program closes. I'm open to > other ideas of how to organically get a URL for scrapy to crawl that isn't > through a message queue, though this seems to be the most sensible option, > if I can get it to work. > > This is pretty messy, but here's what I have (I've never used > deferToThread, or much threading in general for that matter so I may be > doing this wrong) > > Full pastebin here (*exactly *what I have, minus AWS creds): > http://pastebin.com/4cebXyTc > > def start_requests(self): > self.logger.error("STARTING QUEUE") > while True: > queue = deferToThread(self.queue) > self.logger.error(self.cpuz_url) > if self.cpuz_url is None: > time.sleep(10) > continue > yield Request(self.cpuz_url, self.parse) > > I've then changed my queue() function to have a try catch after it gets > the > > try: > message = message[0] > message_body = message.get_body() > self.logger.error(message_body) > message_body = str(message_body).split(',') > message.delete() > self.cpuz_url = message_body[0] > self.uid = message_body[1] > except: > self.logger.error(message) > self.logger.error(self.cpuz_url) > self.cpuz_url = None > > > > On Thu, Jun 16, 2016 at 8:23 PM, Neverlast N <never...@hotmail.com > <javascript:>> wrote: > >> Thanks for bringing this up. I answered in SO. As a methodology - I would >> say - try to make the simplest working thing possible and then build up >> towards the more complex code you have. See at which point it breaks. Is it >> when you add an API call? Is it when you return something? What I did was >> to replace your queue() with this and it seems to work: >> >> def queue(self): >> return 'http://www.example.com/?{}'.format(random.randint(0,100000)) >> >> What can we infer from this? >> >> >> ------------------------------ >> From: jdavi...@gmail.com <javascript:> >> Date: Thu, 16 Jun 2016 13:43:28 -0400 >> Subject: Trying to read from message queue, not parsing response in >> make_requests_from_url loop >> To: scrapy...@googlegroups.com <javascript:> >> >> >> I have this question on SO, but no answers unfortunately. Figured Id try >> my luck here. >> >> >> https://stackoverflow.com/questions/37770678/scrapy-not-parsing-response-in-make-requests-from-url-loop >> >> I'm trying to get scrapy to grab a URL from a message queue, and then >> scrape that URL. I have the loop going just fine and grabbing the URL from >> the queue, but it never enters the parse() method once it has a url, it >> just continues to loop (and sometimes the url comes back around even though >> I've deleted it from the queue...) >> >> While it's running in terminal, if I CTRL+C and force it to end, it >> enters the parse() method and crawls the page, then ends. I'm not sure >> what's wrong here. Scrapy needs to be running at all times to catch a url >> as it enters the queue. Anyone have ideas or have done something like this? >> >> >> class my_Spider(Spider): >> name = "my_spider" >> allowed_domains = ['domain.com'] >> >> def __init__(self): >> super(my_Spider, self).__init__() >> self.url = None >> >> def start_requests(self): >> while True: >> # Crawl the url from queue >> yield self.make_requests_from_url(self._pop_queue()) >> >> def _pop_queue(self): >> # Grab the url from queue >> return self.queue() >> >> def queue(self): >> url = None >> while url is None: >> conf = { >> "sqs-access-key": "", >> "sqs-secret-key": "", >> "sqs-queue-name": "crawler", >> "sqs-region": "us-east-1", >> "sqs-path": "sqssend" >> } >> # Connect to AWS >> conn = boto.sqs.connect_to_region( >> conf.get('sqs-region'), >> aws_access_key_id=conf.get('sqs-access-key'), >> aws_secret_access_key=conf.get('sqs-secret-key') >> ) >> q = conn.get_queue(conf.get('sqs-queue-name')) >> message = conn.receive_message(q) >> # Didn't get a message back, wait. >> if not message: >> time.sleep(10) >> url = None >> else: >> url = message >> if url is not None: >> message = url[0] >> message_body = str(message.get_body()) >> message.delete() >> self.url = message_body >> return self.url >> >> def parse(self, response): >> ... >> yield item >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to scrapy-users...@googlegroups.com <javascript:>. >> To post to this group, send email to scrapy...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to scrapy-users...@googlegroups.com <javascript:>. >> To post to this group, send email to scrapy...@googlegroups.com >> <javascript:>. >> Visit this group at https://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > >
-- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.