Hi, I am looking for a web-scraping framework that can be both easy to use/configure, and in addition would be part of a large backend. The backend gets thousands of requests per second, and needs to do scraping as part of its logic. We use RabbitMQ and I wonder if scrapy can be part of such a system. So, each request carries a different URL (there is a small set of domains, but the path/query etc is dynamic)
So, I wonder about the following questions: 1. Can I pass the URL as an argument? I mean, the spider is known by the domain, but the URL is dynamic, therefore the spider has to get it dynamically 2. Integration, performance and scale: I've read that in a running system scrapy can be invoked using scrapyd json API that actually opens up a process. So, in my system that passes lots of Rabbit messages around, the scraping would launch a json request and we'll have lots of concurrent processes. Imagine a single spider tales 2 seconds, then the number of processes would go up until it chokes the server. I fear this model is problematic. I'd appreciate your advise. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
