Doug,

At Powerset we have used EC2 and Hadoop with a large number of nodes,
successfully running Map/Reduce computations and HDFS. Pretty much like you
describe, we use HDFS for intermediate results and caching, and periodically
extract data to our local network. We are not really using S3 at the moment
for persistent storage.

A nice feature of Hadoop as measured against our use of EC2 has been the
capability of fluidly changing the number of instances that are part of the
cluster. Our instances are set up to join the cluster and the DFS as soon as
they are activated and when - for any reason - we lose those machines, the
overall process doesn't suffer. We have been quite happy with this, even at
significant number of instances.

As a byproduct of running these experiments, we have implemented some
patches to Hadoop to report to the master IP's and hostnames that are
different than the default ones assigned by InetAddress's static method
(getLocalHost). This is due to the fact that machines sometimes are assigned
to specific networks to deal with firewalls in various ways so we may want
to report Ips to the JobTracker from different interfaces, in order for the
tracker to contact them back. In our model the interface is specified as
part of the configuration parameters.

Is this something that the Hadoop project would be interested to
incorporate?

Lorenzo Thione



On 8/25/06 3:39 PM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> Has anyone tried running Hadoop on the Amazon Elastic Compute Cloud yet?
> 
> http://www.amazon.com/gp/browse.html?node=201590011
> 
> One way to use Hadoop on this would be to:
> 
> 1. Allocate a pool of machines.
> 2. Start Hadoop daemons.
> 3. Load the HDFS filesystem with input from Amazon S3.
> 4. Run a series of MapReduce computations.
> 5. Copy the final output from HDFS back to Amazon S3.
> 6. Deallocate the machines.
> 
> Steps (3) and (4) could be eliminated if a Hadoop FileSystem were
> implemented on S3, so that input and output could be accessed directly
> from S3.  (One might still use HDFS for intermediate data, as it should
> be faster.)
> 
> The prices seem very reasonable.  100 nodes for 10 hours costs $100.
> Storing a terabyte on S3 costs $150/month.  Transferring a terabyte of
> offsite data (e.g., fetching 100M pages) costs $200.  So someone could
> use Nutch to keep a 100M page crawl, refreshed monthly, for around
> $500/month.  Such a crawl could be shared with other organizations who
> would themselves pay, a la carte, for their computations over it.
> 
> If anyone tries Hadoop on EC2, please tell us how it goes.
> 
> Doug

Reply via email to