'm planning on consuming scrape requests from a RabbitMQ queue, and 
publishing scraped items to another queue instance.

What's the preferred way of doing this? I've got a basic Django webservice 
that my scraping application lives under, and so far I've see two possible 
approaches:

   1. Write a singleton, say CrawlManager, that consumes from the queue and 
   sends HTTP POST requests to a running instance of Scrapyd. The advantage of 
   this is (correct me if I'm wrong, is that Scrapyd will take care of much of 
   the scheduling/throttling logic for me (via MAX_PROC_PER_CPU, for example)
   2. Start crawler processes within the CrawlManager class myself, using 
   the Core API. The disadvantage of this (correct me if I'm wrong) is not 
   having Scrapyd manage scheduling and throttling the multiple crawls for me, 
   and not having the crawls synced up to the web job interface.

In either case, I'd have to find a way, via item pipelines, or 
multiprocessing, to have the child processes talk back to the CrawlManager. 
I don't see an absolute clear path to doing this in the first case. But in 
the second case, having read some multiprocessing tutorials last night, I 
think there are no major blocks to having a child process communicate with 
a parent one.

Which of these, with modifications, would be the preferred, non-hacky, and 
most useful way of scheduling spiders from a crawl request queue and 
published scraped items back to the queue via the CrawlManager instance?

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to