RE: web crawler throughput with background jobs?

Tom O'Neill Wed, 10 Oct 2012 09:51:01 -0700

With crawlers you'll likely spend most of your time waiting to download the 
HTML, not parsing it. Spawn a bunch of Ruby threads that download and parse 
(ideally from different domains) and you'll parallelize the work, waiting less 
and working more, driving up the throughput per-worker. This will work nicely 
within your existing model of scaling workers for more throughput.

Date: Wed, 10 Oct 2012 00:41:24 -0700
From: [email protected]
To: [email protected]
Subject: web crawler throughput with background jobs?

Hi everyone, I've written a site analyzer (it crawls the site) and am trying to 
figure out the best way to deploy it on Heroku.
The analyzer will have a never ending list of sites to analyze and I want to 
maximize throughput while minimizing costs.
The worst case scenario is each site analysis is processed by the worker one at 
a time. To scale I increase workers.

Is there a smarter/more efficient way?
For example, could I use EventMachine within the background job to analyze 
multiple sites in 'parallel'? Or could I use Unicorn within the background job 
to achieve the same kind of desired parallelism?
Thanks for any replies in advance!

-- 

You received this message because you are subscribed to the Google

Groups "Heroku" group.

To unsubscribe from this group, send email to

[email protected]

For more options, visit this group at

http://groups.google.com/group/heroku?hl=en_US?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "Heroku" group.

To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/heroku?hl=en_US?hl=en

RE: web crawler throughput with background jobs?

Reply via email to