With crawlers you'll likely spend most of your time waiting to download the HTML, not parsing it. Spawn a bunch of Ruby threads that download and parse (ideally from different domains) and you'll parallelize the work, waiting less and working more, driving up the throughput per-worker. This will work nicely within your existing model of scaling workers for more throughput.
Date: Wed, 10 Oct 2012 00:41:24 -0700 From: [email protected] To: [email protected] Subject: web crawler throughput with background jobs? Hi everyone, I've written a site analyzer (it crawls the site) and am trying to figure out the best way to deploy it on Heroku. The analyzer will have a never ending list of sites to analyze and I want to maximize throughput while minimizing costs. The worst case scenario is each site analysis is processed by the worker one at a time. To scale I increase workers. Is there a smarter/more efficient way? For example, could I use EventMachine within the background job to analyze multiple sites in 'parallel'? Or could I use Unicorn within the background job to achieve the same kind of desired parallelism? Thanks for any replies in advance! -- You received this message because you are subscribed to the Google Groups "Heroku" group. To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/heroku?hl=en_US?hl=en -- You received this message because you are subscribed to the Google Groups "Heroku" group. To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/heroku?hl=en_US?hl=en
