Re: Trying to read from message queue, not parsing response in make_requests_from_url loop

Dimitris Kouzis - Loukas Sat, 18 Jun 2016 13:22:00 -0700

Updated the SO answer with a functional example. Cheers.

On Friday, June 17, 2016 at 5:53:44 AM UTC+1, Jeremy D wrote:
>
> Does seem to work. Using deferToThread I run into the same problem, where 
> it doesnt get into the parse() method until the program closes. I'm open to 
> other ideas of how to organically get a URL for scrapy to crawl that isn't 
> through a message queue, though this seems to be the most sensible option, 
> if I can get it to work.
>
> This is pretty messy, but here's what I have (I've never used 
> deferToThread, or much threading in general for that matter so I may be 
> doing this wrong)
>
> Full pastebin here (*exactly *what I have, minus AWS creds): 
> http://pastebin.com/4cebXyTc
>
> def start_requests(self):
>         self.logger.error("STARTING QUEUE")
>         while True:
>             queue = deferToThread(self.queue)
>             self.logger.error(self.cpuz_url)
>             if self.cpuz_url is None:
>                 time.sleep(10)
>                 continue
>             yield Request(self.cpuz_url, self.parse)
>
> I've then changed my queue() function to have a try catch after it gets 
> the 
>
>         try:
>             message = message[0]
>             message_body = message.get_body()
>             self.logger.error(message_body)
>             message_body = str(message_body).split(',')
>             message.delete()
>             self.cpuz_url = message_body[0]
>             self.uid = message_body[1]
>         except:
>             self.logger.error(message)
>             self.logger.error(self.cpuz_url)
>             self.cpuz_url = None
>
>
>
> On Thu, Jun 16, 2016 at 8:23 PM, Neverlast N <never...@hotmail.com 
> <javascript:>> wrote:
>
>> Thanks for bringing this up. I answered in SO. As a methodology - I would 
>> say - try to make the simplest working thing possible and then build up 
>> towards the more complex code you have. See at which point it breaks. Is it 
>> when you add an API call? Is it when you return something? What I did was 
>> to replace your queue() with this and it seems to work:
>>
>>     def queue(self):
>>         return 'http://www.example.com/?{}'.format(random.randint(0,100000))
>>
>> What can we infer from this?
>>
>>
>> ------------------------------
>> From: jdavi...@gmail.com <javascript:>
>> Date: Thu, 16 Jun 2016 13:43:28 -0400
>> Subject: Trying to read from message queue, not parsing response in 
>> make_requests_from_url loop
>> To: scrapy...@googlegroups.com <javascript:>
>>
>>
>> I have this question on SO, but no answers unfortunately. Figured Id try 
>> my luck here.
>>
>>
>> https://stackoverflow.com/questions/37770678/scrapy-not-parsing-response-in-make-requests-from-url-loop
>>
>> I'm trying to get scrapy to grab a URL from a message queue, and then 
>> scrape that URL. I have the loop going just fine and grabbing the URL from 
>> the queue, but it never enters the parse() method once it has a url, it 
>> just continues to loop (and sometimes the url comes back around even though 
>> I've deleted it from the queue...)
>>
>> While it's running in terminal, if I CTRL+C and force it to end, it 
>> enters the parse() method and crawls the page, then ends. I'm not sure 
>> what's wrong here. Scrapy needs to be running at all times to catch a url 
>> as it enters the queue. Anyone have ideas or have done something like this?
>>
>>
>> class my_Spider(Spider):
>>         name = "my_spider"
>>         allowed_domains = ['domain.com']
>>
>>         def __init__(self):
>>             super(my_Spider, self).__init__()
>>             self.url = None
>>
>>         def start_requests(self):
>>             while True:
>>                 # Crawl the url from queue
>>                 yield self.make_requests_from_url(self._pop_queue())
>>
>>         def _pop_queue(self):
>>             # Grab the url from queue
>>             return self.queue()
>>
>>         def queue(self):
>>             url = None
>>             while url is None:
>>                 conf = {
>>                     "sqs-access-key": "",
>>                     "sqs-secret-key": "",
>>                     "sqs-queue-name": "crawler",
>>                     "sqs-region": "us-east-1",
>>                     "sqs-path": "sqssend"
>>                 }
>>                 # Connect to AWS
>>                 conn = boto.sqs.connect_to_region(
>>                     conf.get('sqs-region'),
>>                     aws_access_key_id=conf.get('sqs-access-key'),
>>                     aws_secret_access_key=conf.get('sqs-secret-key')
>>                 )
>>                 q = conn.get_queue(conf.get('sqs-queue-name'))
>>                 message = conn.receive_message(q)
>>                 # Didn't get a message back, wait.
>>                 if not message:
>>                     time.sleep(10)
>>                     url = None
>>                 else:
>>                     url = message
>>             if url is not None:
>>                 message = url[0]
>>                 message_body = str(message.get_body())
>>                 message.delete()
>>                 self.url = message_body
>>                 return self.url
>>
>>         def parse(self, response):
>>             ...
>>             yield item
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to scrapy-users...@googlegroups.com <javascript:>.
>> To post to this group, send email to scrapy...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to scrapy-users...@googlegroups.com <javascript:>.
>> To post to this group, send email to scrapy...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Trying to read from message queue, not parsing response in make_requests_from_url loop

Reply via email to