We faced the same issue, wanting to improve the efficiency of a long-running 
web scraper running as a worker on Heroku.  After looking at the available 
options (including EventMachine, which is clearly a good product), we chose to 
use Sidekiq.   Sidekiq multi-threads tasks within a single worker (it can also 
be run multiple times, scaling to multiple workers if you need additional 
parallelism).  We are still in the process of development, because we want to 
scale the workers as work demands and Sidekiq does not scale Heroku workers, 
you must do that yourself (there are a couple people working on HireFire-like 
addons that would do this), but so far we have run up to 10 threads 
simultaneously on a single worker task with a 6-1 reduction in wall-clock time 
(1200 tasks going from ~66 minutes to ~11 minutes).

We chose Sidekiq over EventMachine because we felt our architecture more 
closely fit a task structure than an event architecture (each of our tasks 
consists of 1-10 web requests that follow from one another and cannot be 
predicted) and we felt the serial nature of the requests would be 
poorly-reflected in the
EM event-loop style.

Sidekiq does require a Redis database instance, but you can sign up for that as 
an add-on at Heroku, and the price is justified by the speed with which Redis 
persists and retrieves data. We signed up for a separate Redis-to-Go instance 
to use for development/testing, but you can setup your own Redis instance on 
your own machine (Redis is open-source), if you prefer that approach.

The Sidekiq community is relatively good at answering newbie questions, but I 
have no experience with EM so I can't compare the two.

One concern with Sidekiq I would mention is that there seems to be a 1-2 second 
overhead with startup-shutdown of Sidekiq tasks, so keep that in mind when you 
are structuring your jobs: many of our tasks are very short (5 seconds) and so 
the overhead becomes a big part (20%) of the task time;  I would recommend your 
tasks be no shorter than 10 seconds, with 20-30 seconds being most desirable.

In summary, while Sidekiq is certainly not the only good solution, I would 
recommend you consider it for your project if your needs are not well-met by an 
event architecture.

Hope this helps.
 
On Oct 10, 2012, at 4:13 AM, [email protected] wrote:

>   Today's Topic Summary
> Group: http://groups.google.com/group/heroku/topics
> 
> web crawler throughput with background jobs? [1 Update]
>  web crawler throughput with background jobs?
> Craayzie <[email protected]> Oct 10 12:41AM -0700  
> 
> Hi everyone, I've written a site analyzer (it crawls the site) and am 
> trying to figure out the best way to deploy it on Heroku.
>  
> The analyzer will have a never ending list of sites to analyze and I want 
> to maximize throughput while minimizing costs.
>  
> The worst case scenario is each site analysis is processed by the worker 
> one at a time. To scale I increase workers.
>  
> Is there a smarter/more efficient way?
>  
> For example, could I use EventMachine within the background job to analyze 
> multiple sites in 'parallel'? Or could I use Unicorn within the background 
> job to achieve the same kind of desired parallelism?
>  
> Thanks for any replies in advance!
>  
> You received this message because you are subscribed to the Google Group 
> heroku.
> You can post via email.
> To unsubscribe from this group, send an empty message.
> For more options, visit this group.
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Heroku" group.
>  
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/heroku?hl=en_US?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "Heroku" group.

To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/heroku?hl=en_US?hl=en

Reply via email to