Using scrapy in a backend system with high load

adi . lavi Wed, 03 Dec 2014 03:59:07 -0800

Hi,
I am looking for a web-scraping framework that can be both easy to 
use/configure, and in addition would be part of a large backend. The 
backend gets thousands of requests per second, and needs to do scraping as 
part of its logic. We use RabbitMQ and I wonder if scrapy can be part of 
such a system. So, each request carries a different URL (there is a small 
set of domains, but the path/query etc is dynamic)


So, I wonder about the following questions:
1. Can I pass the URL as an argument? I mean, the spider is known by the 
domain, but the URL is dynamic, therefore the spider has to get it 
dynamically
2. Integration, performance and scale: I've read that in a running system 
scrapy can be invoked using scrapyd json API that actually opens up a 
process. 
So, in my system that passes lots of Rabbit messages around, the scraping 
would launch a json request and we'll have lots of concurrent processes. 
Imagine a single spider tales 2 seconds, then the number of processes would 
go up until it chokes the server. I fear this model is problematic.

I'd appreciate your advise.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Using scrapy in a backend system with high load

Reply via email to