Re: How to manage a large cluster?

2008-09-16 Thread Steve Loughran

Paco NATHAN wrote:

We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To
make it simple, pull those from S3 buckets. That provides a flexible
pattern for managing the frameworks involved, more so than needing to
re-do an EC2 image whenever you want to add a patch to Hadoop.

Given that approach, you can add your Hadoop application code
similarly. Just upload the current stable build out of SVN, Git,
whatever, to an S3 bucket.


Nice. Your CI tool could upload the latest release tagged as good and 
the machines could pull it down.


The goal of cluster management is to make the addition/removal of an 
extra node an O(1) problem; you edit one entry in one place to increment 
or decrement the number of machines, and that's it.


If you find you have lots of images to keep alive, then your costs go 
up. Keep the # of images you have to 1 and you will stay in control.




We use a set of Python scripts to manage a daily, (mostly) automated
launch of 100+ EC2 nodes for a Hadoop cluster.  We also run a listener
on a local server, so that the Hadoop job can send notification when
it completes, and allow the local server to initiate download of
results.  Overall, that minimizes the need for having a sysadmin
dedicated to the Hadoop jobs -- a small dev team can handle it, while
focusing on algorithm development and testing.


1. We have some components that use google talk to relay messages to 
local boxes behind the firewall. I could imagine hooking up hadoop 
status events to that too.


2. There's an old paper of mine, Making Web Services that Work, in 
which I talk about deployment centric development:

http://www.hpl.hp.com/techreports/2002/HPL-2002-274.html

The idea is that right from the outset, the dev team work on a cluster 
that resembles production, the CI server builds to it automatically, 
changes get pushed out to production semi-automatically (you tag the 
version you want pushed out in SVN, the CI server does the release). The 
article is focused on services exported to third parties, not back end 
stuff, so it may not all apply to hadoop deployments.


-steve





Re: How to manage a large cluster?

2008-09-16 Thread Paco NATHAN
Thanks, Steve -

Another flexible approach to handling messages across firewalls,
between jt and worker nodes, etc., would be to place an APMQ message
broker on the jobtracker and another inside our local network.  We're
experimenting with RabbitMQ for that.


On Tue, Sep 16, 2008 at 4:03 AM, Steve Loughran [EMAIL PROTECTED] wrote:

 We use a set of Python scripts to manage a daily, (mostly) automated
 launch of 100+ EC2 nodes for a Hadoop cluster.  We also run a listener
 on a local server, so that the Hadoop job can send notification when
 it completes, and allow the local server to initiate download of
 results.  Overall, that minimizes the need for having a sysadmin
 dedicated to the Hadoop jobs -- a small dev team can handle it, while
 focusing on algorithm development and testing.

 1. We have some components that use google talk to relay messages to local
 boxes behind the firewall. I could imagine hooking up hadoop status events
 to that too.

 2. There's an old paper of mine, Making Web Services that Work, in which I
 talk about deployment centric development:
 http://www.hpl.hp.com/techreports/2002/HPL-2002-274.html

 The idea is that right from the outset, the dev team work on a cluster that
 resembles production, the CI server builds to it automatically, changes get
 pushed out to production semi-automatically (you tag the version you want
 pushed out in SVN, the CI server does the release). The article is focused
 on services exported to third parties, not back end stuff, so it may not all
 apply to hadoop deployments.

 -steve


Re: How to manage a large cluster?

2008-09-15 Thread 叶双明
Sorry, but I can't open it:
http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment

2008/9/13 Steve Loughran [EMAIL PROTECTED]

 James Moore wrote:

 On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer [EMAIL PROTECTED]
 wrote:

 On 9/11/08 2:39 AM, Alex Loddengaard [EMAIL PROTECTED] wrote:

 I've never dealt with a large cluster, though I'd imagine it is managed
 the
 same way as small clusters:

   Maybe. :)


 Depends how often you like to be paged, doesn't it :)


Instead, use a real system configuration management package such as
 bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the
 plug.
 :) ]


 Yes Allen, I owe you beer at the next apachecon we are both at.
 Actually, I think Y! were one of the sponsors at the UK event, so we owe
 you for that too.


  Or on EC2 and its competitors, just build a new image whenever you
 need to update Hadoop itself.



 1. It's still good to have as much automation of your image build as you
 can; if you can build new machine images on demand you have have fun/make a
 mess of things. Look at http://instalinux.com to see the web GUI for
 creating linux images on demand that is used inside HP.

 2. When you try and bring up everything from scratch, you have a
 choreography problem. DNS needs to be up early, and your authentication
 system, the management tools, then the other parts of the system. If you
 have a project where hadoop is integrated with the front end site, for
 example, you're app servers have to stay offline until HDFS is live. So it
 does get complex.

 3. The Hadoop nodes are good here in that you aren't required to bring up
 the namenode first; the datanodes will wait; same for the task trackers and
 job tracker. But if you, say, need to point everything at a new hostname for
 the namenode, well, that's a config change that needs to be pushed out,
 somehow.



 I'm adding some stuff on different ways to deploy hadoop here:

 http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment

 -steve




-- 
Sorry for my english!! 明
Please help me to correct my english expression and error in syntax


Re: How to manage a large cluster?

2008-09-15 Thread Paco NATHAN
We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To
make it simple, pull those from S3 buckets. That provides a flexible
pattern for managing the frameworks involved, more so than needing to
re-do an EC2 image whenever you want to add a patch to Hadoop.

Given that approach, you can add your Hadoop application code
similarly. Just upload the current stable build out of SVN, Git,
whatever, to an S3 bucket.

We use a set of Python scripts to manage a daily, (mostly) automated
launch of 100+ EC2 nodes for a Hadoop cluster.  We also run a listener
on a local server, so that the Hadoop job can send notification when
it completes, and allow the local server to initiate download of
results.  Overall, that minimizes the need for having a sysadmin
dedicated to the Hadoop jobs -- a small dev team can handle it, while
focusing on algorithm development and testing.


  Or on EC2 and its competitors, just build a new image whenever you
 need to update Hadoop itself.


Re: How to manage a large cluster?

2008-09-12 Thread 叶双明
er... is that:

Set up a DNS server, use hostnames instead of raw ips?

Config all node in the slaves file, and put this file on the namenode and
secondary namenode to prevent accidents?

Use a real system configuration management package to sync software in all
nodes of cluster?

Thanks for all, and Alex Loddengaard's creation of wiki are commendable,
anyone should enrich it.

2008/9/12 Alex Loddengaard [EMAIL PROTECTED]

 My inexperience has been revealed ;).  I've taken your comments, James and
 Allen, and added them to the wiki:

 http://wiki.apache.org/hadoop/LargeClusterTips

 Alex

 On Fri, Sep 12, 2008 at 2:01 AM, James Moore [EMAIL PROTECTED]
 wrote:

  On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer [EMAIL PROTECTED]
  wrote:
   On 9/11/08 2:39 AM, Alex Loddengaard [EMAIL PROTECTED] wrote:
   I've never dealt with a large cluster, though I'd imagine it is
 managed
  the
   same way as small clusters:
  
  Maybe. :)
 
  Add me to the maybe :) column.  In my experience, large rarely turns
  out the same as small.
 
  What usually happens is that the developers build the small thing,
  keeping in mind that good sysadmins are going to need to do some work
  turn it into the large thing.  (Never underestimate the value of good
  system admin people.  Speaking as a developer, good sysadmins will
  almost always know something about large that you haven't thought
  about.)
 
  I think what I was doing on a small cluster (100 machines) would take
  some modifications to scale up.
 
   -Use hostnames or ips, whichever is more convenient for you
  
  Use hostnames.  Seriously.  Who are you people using raw IPs for
  things?
   :)  Besides, you're going to need it for the eventual support of
  Kerberos.
 
  I suspect lots of people buy arrays by the hour from Amazon, so you're
  going to have a different batch of IP addresses every
  $WHATEVER_PERIOD.  Not having to worry about dynamic dns is probably
  interesting to someone.  (Our plan was to spin up an array of 100 or
  so servers every N days, work for a few hours, then shut down.)
 
  Dynamic DNS sounded like a pain to me only because I'm a really bad
  system administrator - it may be that it's worth it (or trivial).
 
  Instead, use a real system configuration management package such as
   bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the
  plug.
   :) ]
 
  Or on EC2 and its competitors, just build a new image whenever you
  need to update Hadoop itself.
 
  --
  James Moore | [EMAIL PROTECTED]
  Ruby and Ruby on Rails consulting
  blog.restphone.com
 




-- 
Sorry for my english!! 明
Please help me to correct my english expression and error in syntax


Re: How to manage a large cluster?

2008-09-12 Thread Steve Loughran

James Moore wrote:

On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer [EMAIL PROTECTED] wrote:

On 9/11/08 2:39 AM, Alex Loddengaard [EMAIL PROTECTED] wrote:

I've never dealt with a large cluster, though I'd imagine it is managed the
same way as small clusters:

   Maybe. :)


Depends how often you like to be paged, doesn't it :)




   Instead, use a real system configuration management package such as
bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the plug.
:) ]


Yes Allen, I owe you beer at the next apachecon we are both at.
Actually, I think Y! were one of the sponsors at the UK event, so we owe 
you for that too.




Or on EC2 and its competitors, just build a new image whenever you
need to update Hadoop itself.



1. It's still good to have as much automation of your image build as you 
can; if you can build new machine images on demand you have have 
fun/make a mess of things. Look at http://instalinux.com to see the web 
GUI for creating linux images on demand that is used inside HP.


2. When you try and bring up everything from scratch, you have a 
choreography problem. DNS needs to be up early, and your authentication 
system, the management tools, then the other parts of the system. If you 
have a project where hadoop is integrated with the front end site, for 
example, you're app servers have to stay offline until HDFS is live. So 
it does get complex.


3. The Hadoop nodes are good here in that you aren't required to bring 
up the namenode first; the datanodes will wait; same for the task 
trackers and job tracker. But if you, say, need to point everything at a 
new hostname for the namenode, well, that's a config change that needs 
to be pushed out, somehow.




I'm adding some stuff on different ways to deploy hadoop here:

http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment

-steve


How to manage a large cluster?

2008-09-11 Thread 叶双明
Hi, all!

How to manage a large cluster, eg. more than 2000 nodes.
How to config hostname and ip, use DNS?
How to config slaves, all in slaves file?
How to update software in all nodes.

Any practice, articles, suggestion is appreciate!
Thanks.

-- 
Sorry for my english!! 明
Please help me to correct my english expression and error in syntax


Re: How to manage a large cluster?

2008-09-11 Thread Alex Loddengaard
I've never dealt with a large cluster, though I'd imagine it is managed the
same way as small clusters:

-Use hostnames or ips, whichever is more convenient for you
-All the slaves need to go into the slave file
-You can update software by using bin/hadoop-daemons.sh.  Something like:
#bin/hadoop-daemons.sh rsync (mastersrcpath) (localdestpath)

I created a wiki page that currently contains one tip for managing large
clusters.  Could others add to this wiki page?

http://wiki.apache.org/hadoop/LargeClusterTips

Thanks.  Hope this helps!

Alex

On Thu, Sep 11, 2008 at 5:15 PM, 叶双明 [EMAIL PROTECTED] wrote:

 Hi, all!

 How to manage a large cluster, eg. more than 2000 nodes.
 How to config hostname and ip, use DNS?
 How to config slaves, all in slaves file?
 How to update software in all nodes.

 Any practice, articles, suggestion is appreciate!
 Thanks.

 --
 Sorry for my english!! 明
 Please help me to correct my english expression and error in syntax



Re: How to manage a large cluster?

2008-09-11 Thread Allen Wittenauer
On 9/11/08 2:39 AM, Alex Loddengaard [EMAIL PROTECTED] wrote:
 I've never dealt with a large cluster, though I'd imagine it is managed the
 same way as small clusters:

Maybe. :)

 -Use hostnames or ips, whichever is more convenient for you

Use hostnames.  Seriously.  Who are you people using raw IPs for things?
:)  Besides, you're going to need it for the eventual support of Kerberos.

 -All the slaves need to go into the slave file

We only put this file on the namenode and 2ndary namenode to prevent
accidents.

 -You can update software by using bin/hadoop-daemons.sh.  Something like:
 #bin/hadoop-daemons.sh rsync (mastersrcpath) (localdestpath)

We don't use that because it doesn't take inconsideration down nodes
(and you *will* have down nodes!) or deal with nodes that are outside the
grid (such as our gateways/bastion hosts, data loading machines, etc).

Instead, use a real system configuration management package such as
bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the plug.
:) ]

 I created a wiki page that currently contains one tip for managing large
 clusters.  Could others add to this wiki page?
 
 http://wiki.apache.org/hadoop/LargeClusterTips

Quite a bit of what we do is covered in the latter half of
http://tinyurl.com/5foamm .  This is a presentation I did at ApacheCon EU
this past April that included some of the behind-the-scenes of the large
clusters at Y!.  At some point I'll probably do an updated version that
includes more adminy things (such as why we push four different types of
Hadoop configurations per grid) while others talk about core Hadoop stuff.



Re: How to manage a large cluster?

2008-09-11 Thread James Moore
On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer [EMAIL PROTECTED] wrote:
 On 9/11/08 2:39 AM, Alex Loddengaard [EMAIL PROTECTED] wrote:
 I've never dealt with a large cluster, though I'd imagine it is managed the
 same way as small clusters:

Maybe. :)

Add me to the maybe :) column.  In my experience, large rarely turns
out the same as small.

What usually happens is that the developers build the small thing,
keeping in mind that good sysadmins are going to need to do some work
turn it into the large thing.  (Never underestimate the value of good
system admin people.  Speaking as a developer, good sysadmins will
almost always know something about large that you haven't thought
about.)

I think what I was doing on a small cluster (100 machines) would take
some modifications to scale up.

 -Use hostnames or ips, whichever is more convenient for you

Use hostnames.  Seriously.  Who are you people using raw IPs for things?
 :)  Besides, you're going to need it for the eventual support of Kerberos.

I suspect lots of people buy arrays by the hour from Amazon, so you're
going to have a different batch of IP addresses every
$WHATEVER_PERIOD.  Not having to worry about dynamic dns is probably
interesting to someone.  (Our plan was to spin up an array of 100 or
so servers every N days, work for a few hours, then shut down.)

Dynamic DNS sounded like a pain to me only because I'm a really bad
system administrator - it may be that it's worth it (or trivial).

Instead, use a real system configuration management package such as
 bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the plug.
 :) ]

Or on EC2 and its competitors, just build a new image whenever you
need to update Hadoop itself.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com