Has anyone tried running Hadoop on the Amazon Elastic Compute Cloud yet?
http://www.amazon.com/gp/browse.html?node=201590011 One way to use Hadoop on this would be to: 1. Allocate a pool of machines. 2. Start Hadoop daemons. 3. Load the HDFS filesystem with input from Amazon S3. 4. Run a series of MapReduce computations. 5. Copy the final output from HDFS back to Amazon S3. 6. Deallocate the machines. Steps (3) and (4) could be eliminated if a Hadoop FileSystem were implemented on S3, so that input and output could be accessed directly from S3. (One might still use HDFS for intermediate data, as it should be faster.) The prices seem very reasonable. 100 nodes for 10 hours costs $100. Storing a terabyte on S3 costs $150/month. Transferring a terabyte of offsite data (e.g., fetching 100M pages) costs $200. So someone could use Nutch to keep a 100M page crawl, refreshed monthly, for around $500/month. Such a crawl could be shared with other organizations who would themselves pay, a la carte, for their computations over it. If anyone tries Hadoop on EC2, please tell us how it goes. Doug