What seems to be emerging here is a pattern for another special node
associated with a Hadoop cluster.

The need is to have a machine which can:
   * handle setup and shutdown of a Hadoop cluster on remote server resources
   * manage loading and retrieving data via a storage grid
   * interact and synchronize events via a message broker
   * capture telemetry (logging, exceptions, job/task stats) from the
remote cluster

On our team, one of the engineers named it a CloudController, as
distinct from JobTracker and NameNode.

In the discussion here, the CloudController pattern derives from
services provided by AWS. However, it could just as easily be mapped
to other elastic services for servers / storage buckets / message
queues -- based on other vendors, company data centers, etc.

This pattern has come up in several other discussions I've had with
other companies making large use of Hadoop. We're generally trying to
address these issues:
   * long-running batch jobs and how to manage complex workflows for them
   * managing trade-offs between using a cloud provider (AWS,
Flexiscale, AppNexus, etc.) and using company data centers
   * managing trade-offs between cluster size vs. batch window time
vs. total cost

Our team chose to implement this functionality using Python scripts --
replacing the shell scripts. That makes it easier to handle the many
potential exceptions of leasing remote elastic resources.  FWIW, our
team is also moving these CloudController scripts to run under
RightScale, to manage AWS resources more effectively -- especially the
logging after node failures.

What do you think of having an optional CloudController added to the
definition of a Hadoop cluster?

Paco

On Thu, Oct 23, 2008 at 10:52 AM, Chris K Wensel <[EMAIL PROTECTED]> wrote:
> Hey Stuart
>
> I did that for a client using Cascading events and SQS.
>
> When jobs completed, they dropped a message on SQS where a listener picked
> up new jobs and ran with them, or decided to kill off the cluster. The
> currently shipping EC2 scripts are suitable for having multiple simultaneous
> clusters for this purpose.
>
> Cascading has always and now Hadoop supports (thanks Tom) raw file access on
> S3, so this is quite natural. This is the best approach as data is pulled
> directly into the Mapper, instead of onto HDFS first, then read into the
> Mapper from HDFS.
>
> YMMV
>
> chris
>
> On Oct 23, 2008, at 7:47 AM, Stuart Sierra wrote:
>
>> Hi folks,
>> Anybody tried scripting Hadoop on EC2 to...
>> 1. Launch a cluster
>> 2. Pull data from S3
>> 3. Run a job
>> 4. Copy results to S3
>> 5. Terminate the cluster
>> ... without any user interaction?
>>
>> -Stuart
>
> --
> Chris K Wensel
> [EMAIL PROTECTED]
> http://chris.wensel.net/
> http://www.cascading.org/
>
>

Reply via email to