We faced the same issue, wanting to improve the efficiency of a long-running web scraper running as a worker on Heroku. After looking at the available options (including EventMachine, which is clearly a good product), we chose to use Sidekiq. Sidekiq multi-threads tasks within a single worker (it can also be run multiple times, scaling to multiple workers if you need additional parallelism). We are still in the process of development, because we want to scale the workers as work demands and Sidekiq does not scale Heroku workers, you must do that yourself (there are a couple people working on HireFire-like addons that would do this), but so far we have run up to 10 threads simultaneously on a single worker task with a 6-1 reduction in wall-clock time (1200 tasks going from ~66 minutes to ~11 minutes).
We chose Sidekiq over EventMachine because we felt our architecture more closely fit a task structure than an event architecture (each of our tasks consists of 1-10 web requests that follow from one another and cannot be predicted) and we felt the serial nature of the requests would be poorly-reflected in the EM event-loop style. Sidekiq does require a Redis database instance, but you can sign up for that as an add-on at Heroku, and the price is justified by the speed with which Redis persists and retrieves data. We signed up for a separate Redis-to-Go instance to use for development/testing, but you can setup your own Redis instance on your own machine (Redis is open-source), if you prefer that approach. The Sidekiq community is relatively good at answering newbie questions, but I have no experience with EM so I can't compare the two. One concern with Sidekiq I would mention is that there seems to be a 1-2 second overhead with startup-shutdown of Sidekiq tasks, so keep that in mind when you are structuring your jobs: many of our tasks are very short (5 seconds) and so the overhead becomes a big part (20%) of the task time; I would recommend your tasks be no shorter than 10 seconds, with 20-30 seconds being most desirable. In summary, while Sidekiq is certainly not the only good solution, I would recommend you consider it for your project if your needs are not well-met by an event architecture. Hope this helps. On Oct 10, 2012, at 4:13 AM, [email protected] wrote: > Today's Topic Summary > Group: http://groups.google.com/group/heroku/topics > > web crawler throughput with background jobs? [1 Update] > web crawler throughput with background jobs? > Craayzie <[email protected]> Oct 10 12:41AM -0700 > > Hi everyone, I've written a site analyzer (it crawls the site) and am > trying to figure out the best way to deploy it on Heroku. > > The analyzer will have a never ending list of sites to analyze and I want > to maximize throughput while minimizing costs. > > The worst case scenario is each site analysis is processed by the worker > one at a time. To scale I increase workers. > > Is there a smarter/more efficient way? > > For example, could I use EventMachine within the background job to analyze > multiple sites in 'parallel'? Or could I use Unicorn within the background > job to achieve the same kind of desired parallelism? > > Thanks for any replies in advance! > > You received this message because you are subscribed to the Google Group > heroku. > You can post via email. > To unsubscribe from this group, send an empty message. > For more options, visit this group. > > -- > You received this message because you are subscribed to the Google > Groups "Heroku" group. > > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/heroku?hl=en_US?hl=en -- You received this message because you are subscribed to the Google Groups "Heroku" group. To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/heroku?hl=en_US?hl=en
