Re: Trying to read from message queue, not parsing response in make_requests_from_url loop

Jeremy D Thu, 16 Jun 2016 21:54:02 -0700

Does seem to work. Using deferToThread I run into the same problem, where
it doesnt get into the parse() method until the program closes. I'm open to
other ideas of how to organically get a URL for scrapy to crawl that isn't
through a message queue, though this seems to be the most sensible option,
if I can get it to work.


This is pretty messy, but here's what I have (I've never used
deferToThread, or much threading in general for that matter so I may be
doing this wrong)

Full pastebin here (*exactly *what I have, minus AWS creds):
http://pastebin.com/4cebXyTc

def start_requests(self):
        self.logger.error("STARTING QUEUE")
        while True:
            queue = deferToThread(self.queue)
            self.logger.error(self.cpuz_url)
            if self.cpuz_url is None:
                time.sleep(10)
                continue
            yield Request(self.cpuz_url, self.parse)

I've then changed my queue() function to have a try catch after it gets the

        try:
            message = message[0]
            message_body = message.get_body()
            self.logger.error(message_body)
            message_body = str(message_body).split(',')
            message.delete()
            self.cpuz_url = message_body[0]
            self.uid = message_body[1]
        except:
            self.logger.error(message)
            self.logger.error(self.cpuz_url)
            self.cpuz_url = None



On Thu, Jun 16, 2016 at 8:23 PM, Neverlast N <neverla...@hotmail.com> wrote:

> Thanks for bringing this up. I answered in SO. As a methodology - I would
> say - try to make the simplest working thing possible and then build up
> towards the more complex code you have. See at which point it breaks. Is it
> when you add an API call? Is it when you return something? What I did was
> to replace your queue() with this and it seems to work:
>
>     def queue(self):
>         return 'http://www.example.com/?{}'.format(random.randint(0,100000))
>
> What can we infer from this?
>
>
> ------------------------------
> From: jdavis....@gmail.com
> Date: Thu, 16 Jun 2016 13:43:28 -0400
> Subject: Trying to read from message queue, not parsing response in
> make_requests_from_url loop
> To: scrapy-users@googlegroups.com
>
>
> I have this question on SO, but no answers unfortunately. Figured Id try
> my luck here.
>
>
> https://stackoverflow.com/questions/37770678/scrapy-not-parsing-response-in-make-requests-from-url-loop
>
> I'm trying to get scrapy to grab a URL from a message queue, and then
> scrape that URL. I have the loop going just fine and grabbing the URL from
> the queue, but it never enters the parse() method once it has a url, it
> just continues to loop (and sometimes the url comes back around even though
> I've deleted it from the queue...)
>
> While it's running in terminal, if I CTRL+C and force it to end, it enters
> the parse() method and crawls the page, then ends. I'm not sure what's
> wrong here. Scrapy needs to be running at all times to catch a url as it
> enters the queue. Anyone have ideas or have done something like this?
>
>
> class my_Spider(Spider):
>         name = "my_spider"
>         allowed_domains = ['domain.com']
>
>         def __init__(self):
>             super(my_Spider, self).__init__()
>             self.url = None
>
>         def start_requests(self):
>             while True:
>                 # Crawl the url from queue
>                 yield self.make_requests_from_url(self._pop_queue())
>
>         def _pop_queue(self):
>             # Grab the url from queue
>             return self.queue()
>
>         def queue(self):
>             url = None
>             while url is None:
>                 conf = {
>                     "sqs-access-key": "",
>                     "sqs-secret-key": "",
>                     "sqs-queue-name": "crawler",
>                     "sqs-region": "us-east-1",
>                     "sqs-path": "sqssend"
>                 }
>                 # Connect to AWS
>                 conn = boto.sqs.connect_to_region(
>                     conf.get('sqs-region'),
>                     aws_access_key_id=conf.get('sqs-access-key'),
>                     aws_secret_access_key=conf.get('sqs-secret-key')
>                 )
>                 q = conn.get_queue(conf.get('sqs-queue-name'))
>                 message = conn.receive_message(q)
>                 # Didn't get a message back, wait.
>                 if not message:
>                     time.sleep(10)
>                     url = None
>                 else:
>                     url = message
>             if url is not None:
>                 message = url[0]
>                 message_body = str(message.get_body())
>                 message.delete()
>                 self.url = message_body
>                 return self.url
>
>         def parse(self, response):
>             ...
>             yield item
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scrapy-users@googlegroups.com.
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scrapy-users@googlegroups.com.
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Trying to read from message queue, not parsing response in make_requests_from_url loop

Reply via email to