What seems to be emerging here is a pattern for another special node associated with a Hadoop cluster.
The need is to have a machine which can: * handle setup and shutdown of a Hadoop cluster on remote server resources * manage loading and retrieving data via a storage grid * interact and synchronize events via a message broker * capture telemetry (logging, exceptions, job/task stats) from the remote cluster On our team, one of the engineers named it a CloudController, as distinct from JobTracker and NameNode. In the discussion here, the CloudController pattern derives from services provided by AWS. However, it could just as easily be mapped to other elastic services for servers / storage buckets / message queues -- based on other vendors, company data centers, etc. This pattern has come up in several other discussions I've had with other companies making large use of Hadoop. We're generally trying to address these issues: * long-running batch jobs and how to manage complex workflows for them * managing trade-offs between using a cloud provider (AWS, Flexiscale, AppNexus, etc.) and using company data centers * managing trade-offs between cluster size vs. batch window time vs. total cost Our team chose to implement this functionality using Python scripts -- replacing the shell scripts. That makes it easier to handle the many potential exceptions of leasing remote elastic resources. FWIW, our team is also moving these CloudController scripts to run under RightScale, to manage AWS resources more effectively -- especially the logging after node failures. What do you think of having an optional CloudController added to the definition of a Hadoop cluster? Paco On Thu, Oct 23, 2008 at 10:52 AM, Chris K Wensel <[EMAIL PROTECTED]> wrote: > Hey Stuart > > I did that for a client using Cascading events and SQS. > > When jobs completed, they dropped a message on SQS where a listener picked > up new jobs and ran with them, or decided to kill off the cluster. The > currently shipping EC2 scripts are suitable for having multiple simultaneous > clusters for this purpose. > > Cascading has always and now Hadoop supports (thanks Tom) raw file access on > S3, so this is quite natural. This is the best approach as data is pulled > directly into the Mapper, instead of onto HDFS first, then read into the > Mapper from HDFS. > > YMMV > > chris > > On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote: > >> Hi folks, >> Anybody tried scripting Hadoop on EC2 to... >> 1. Launch a cluster >> 2. Pull data from S3 >> 3. Run a job >> 4. Copy results to S3 >> 5. Terminate the cluster >> ... without any user interaction? >> >> -Stuart > > -- > Chris K Wensel > [EMAIL PROTECTED] > http://chris.wensel.net/ > http://www.cascading.org/ > >