'm planning on consuming scrape requests from a RabbitMQ queue, and publishing scraped items to another queue instance.
What's the preferred way of doing this? I've got a basic Django webservice that my scraping application lives under, and so far I've see two possible approaches: 1. Write a singleton, say CrawlManager, that consumes from the queue and sends HTTP POST requests to a running instance of Scrapyd. The advantage of this is (correct me if I'm wrong, is that Scrapyd will take care of much of the scheduling/throttling logic for me (via MAX_PROC_PER_CPU, for example) 2. Start crawler processes within the CrawlManager class myself, using the Core API. The disadvantage of this (correct me if I'm wrong) is not having Scrapyd manage scheduling and throttling the multiple crawls for me, and not having the crawls synced up to the web job interface. In either case, I'd have to find a way, via item pipelines, or multiprocessing, to have the child processes talk back to the CrawlManager. I don't see an absolute clear path to doing this in the first case. But in the second case, having read some multiprocessing tutorials last night, I think there are no major blocks to having a child process communicate with a parent one. Which of these, with modifications, would be the preferred, non-hacky, and most useful way of scheduling spiders from a crawl request queue and published scraped items back to the queue via the CrawlManager instance? -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
