If you are looking to crawl a url per process why don't you just use python-requets? Scrapy is a high performance scraping framework; the best way of using it is if you pass it a bunch of urls. A url as an argument is not an option, but there are options like scrapy-redis.
https://github.com/darkrho/scrapy-redis El miércoles, 3 de diciembre de 2014 09:02:35 UTC-2, [email protected] escribió: > > Hi, > I am looking for a web-scraping framework that can be both easy to > use/configure, and in addition would be part of a large backend. The > backend gets thousands of requests per second, and needs to do scraping as > part of its logic. We use RabbitMQ and I wonder if scrapy can be part of > such a system. So, each request carries a different URL (there is a small > set of domains, but the path/query etc is dynamic) > > So, I wonder about the following questions: > 1. Can I pass the URL as an argument? I mean, the spider is known by the > domain, but the URL is dynamic, therefore the spider has to get it > dynamically > 2. Integration, performance and scale: I've read that in a running system > scrapy can be invoked using scrapyd json API that actually opens up a > process. > So, in my system that passes lots of Rabbit messages around, the scraping > would launch a json request and we'll have lots of concurrent processes. > Imagine a single spider tales 2 seconds, then the number of processes would > go up until it chokes the server. I fear this model is problematic. > > I'd appreciate your advise. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
