Has anyone tried running Hadoop on the Amazon Elastic Compute Cloud yet?


One way to use Hadoop on this would be to:

1. Allocate a pool of machines.
2. Start Hadoop daemons.
3. Load the HDFS filesystem with input from Amazon S3.
4. Run a series of MapReduce computations.
5. Copy the final output from HDFS back to Amazon S3.
6. Deallocate the machines.

Steps (3) and (4) could be eliminated if a Hadoop FileSystem were
implemented on S3, so that input and output could be accessed directly
from S3.  (One might still use HDFS for intermediate data, as it should
be faster.)

The prices seem very reasonable.  100 nodes for 10 hours costs $100.
Storing a terabyte on S3 costs $150/month.  Transferring a terabyte of
offsite data (e.g., fetching 100M pages) costs $200.  So someone could
use Nutch to keep a 100M page crawl, refreshed monthly, for around
$500/month.  Such a crawl could be shared with other organizations who
would themselves pay, a la carte, for their computations over it.

If anyone tries Hadoop on EC2, please tell us how it goes.


Reply via email to