Re: How to manage a large cluster?
Paco NATHAN wrote: We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To make it simple, pull those from S3 buckets. That provides a flexible pattern for managing the frameworks involved, more so than needing to re-do an EC2 image whenever you want to add a patch to Hadoop. Given that approach, you can add your Hadoop application code similarly. Just upload the current stable build out of SVN, Git, whatever, to an S3 bucket. Nice. Your CI tool could upload the latest release tagged as good and the machines could pull it down. The goal of cluster management is to make the addition/removal of an extra node an O(1) problem; you edit one entry in one place to increment or decrement the number of machines, and that's it. If you find you have lots of images to keep alive, then your costs go up. Keep the # of images you have to 1 and you will stay in control. We use a set of Python scripts to manage a daily, (mostly) automated launch of 100+ EC2 nodes for a Hadoop cluster. We also run a listener on a local server, so that the Hadoop job can send notification when it completes, and allow the local server to initiate download of results. Overall, that minimizes the need for having a sysadmin dedicated to the Hadoop jobs -- a small dev team can handle it, while focusing on algorithm development and testing. 1. We have some components that use google talk to relay messages to local boxes behind the firewall. I could imagine hooking up hadoop status events to that too. 2. There's an old paper of mine, Making Web Services that Work, in which I talk about deployment centric development: http://www.hpl.hp.com/techreports/2002/HPL-2002-274.html The idea is that right from the outset, the dev team work on a cluster that resembles production, the CI server builds to it automatically, changes get pushed out to production semi-automatically (you tag the version you want pushed out in SVN, the CI server does the release). The article is focused on services exported to third parties, not back end stuff, so it may not all apply to hadoop deployments. -steve
Re: How to manage a large cluster?
Thanks, Steve - Another flexible approach to handling messages across firewalls, between jt and worker nodes, etc., would be to place an APMQ message broker on the jobtracker and another inside our local network. We're experimenting with RabbitMQ for that. On Tue, Sep 16, 2008 at 4:03 AM, Steve Loughran [EMAIL PROTECTED] wrote: We use a set of Python scripts to manage a daily, (mostly) automated launch of 100+ EC2 nodes for a Hadoop cluster. We also run a listener on a local server, so that the Hadoop job can send notification when it completes, and allow the local server to initiate download of results. Overall, that minimizes the need for having a sysadmin dedicated to the Hadoop jobs -- a small dev team can handle it, while focusing on algorithm development and testing. 1. We have some components that use google talk to relay messages to local boxes behind the firewall. I could imagine hooking up hadoop status events to that too. 2. There's an old paper of mine, Making Web Services that Work, in which I talk about deployment centric development: http://www.hpl.hp.com/techreports/2002/HPL-2002-274.html The idea is that right from the outset, the dev team work on a cluster that resembles production, the CI server builds to it automatically, changes get pushed out to production semi-automatically (you tag the version you want pushed out in SVN, the CI server does the release). The article is focused on services exported to third parties, not back end stuff, so it may not all apply to hadoop deployments. -steve
Re: How to manage a large cluster?
Sorry, but I can't open it: http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment 2008/9/13 Steve Loughran [EMAIL PROTECTED] James Moore wrote: On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer [EMAIL PROTECTED] wrote: On 9/11/08 2:39 AM, Alex Loddengaard [EMAIL PROTECTED] wrote: I've never dealt with a large cluster, though I'd imagine it is managed the same way as small clusters: Maybe. :) Depends how often you like to be paged, doesn't it :) Instead, use a real system configuration management package such as bcfg2, smartfrog, puppet, cfengine, etc. [Steve, you owe me for the plug. :) ] Yes Allen, I owe you beer at the next apachecon we are both at. Actually, I think Y! were one of the sponsors at the UK event, so we owe you for that too. Or on EC2 and its competitors, just build a new image whenever you need to update Hadoop itself. 1. It's still good to have as much automation of your image build as you can; if you can build new machine images on demand you have have fun/make a mess of things. Look at http://instalinux.com to see the web GUI for creating linux images on demand that is used inside HP. 2. When you try and bring up everything from scratch, you have a choreography problem. DNS needs to be up early, and your authentication system, the management tools, then the other parts of the system. If you have a project where hadoop is integrated with the front end site, for example, you're app servers have to stay offline until HDFS is live. So it does get complex. 3. The Hadoop nodes are good here in that you aren't required to bring up the namenode first; the datanodes will wait; same for the task trackers and job tracker. But if you, say, need to point everything at a new hostname for the namenode, well, that's a config change that needs to be pushed out, somehow. I'm adding some stuff on different ways to deploy hadoop here: http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment -steve -- Sorry for my english!! 明 Please help me to correct my english expression and error in syntax
Re: How to manage a large cluster?
We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To make it simple, pull those from S3 buckets. That provides a flexible pattern for managing the frameworks involved, more so than needing to re-do an EC2 image whenever you want to add a patch to Hadoop. Given that approach, you can add your Hadoop application code similarly. Just upload the current stable build out of SVN, Git, whatever, to an S3 bucket. We use a set of Python scripts to manage a daily, (mostly) automated launch of 100+ EC2 nodes for a Hadoop cluster. We also run a listener on a local server, so that the Hadoop job can send notification when it completes, and allow the local server to initiate download of results. Overall, that minimizes the need for having a sysadmin dedicated to the Hadoop jobs -- a small dev team can handle it, while focusing on algorithm development and testing. Or on EC2 and its competitors, just build a new image whenever you need to update Hadoop itself.
Re: How to manage a large cluster?
er... is that: Set up a DNS server, use hostnames instead of raw ips? Config all node in the slaves file, and put this file on the namenode and secondary namenode to prevent accidents? Use a real system configuration management package to sync software in all nodes of cluster? Thanks for all, and Alex Loddengaard's creation of wiki are commendable, anyone should enrich it. 2008/9/12 Alex Loddengaard [EMAIL PROTECTED] My inexperience has been revealed ;). I've taken your comments, James and Allen, and added them to the wiki: http://wiki.apache.org/hadoop/LargeClusterTips Alex On Fri, Sep 12, 2008 at 2:01 AM, James Moore [EMAIL PROTECTED] wrote: On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer [EMAIL PROTECTED] wrote: On 9/11/08 2:39 AM, Alex Loddengaard [EMAIL PROTECTED] wrote: I've never dealt with a large cluster, though I'd imagine it is managed the same way as small clusters: Maybe. :) Add me to the maybe :) column. In my experience, large rarely turns out the same as small. What usually happens is that the developers build the small thing, keeping in mind that good sysadmins are going to need to do some work turn it into the large thing. (Never underestimate the value of good system admin people. Speaking as a developer, good sysadmins will almost always know something about large that you haven't thought about.) I think what I was doing on a small cluster (100 machines) would take some modifications to scale up. -Use hostnames or ips, whichever is more convenient for you Use hostnames. Seriously. Who are you people using raw IPs for things? :) Besides, you're going to need it for the eventual support of Kerberos. I suspect lots of people buy arrays by the hour from Amazon, so you're going to have a different batch of IP addresses every $WHATEVER_PERIOD. Not having to worry about dynamic dns is probably interesting to someone. (Our plan was to spin up an array of 100 or so servers every N days, work for a few hours, then shut down.) Dynamic DNS sounded like a pain to me only because I'm a really bad system administrator - it may be that it's worth it (or trivial). Instead, use a real system configuration management package such as bcfg2, smartfrog, puppet, cfengine, etc. [Steve, you owe me for the plug. :) ] Or on EC2 and its competitors, just build a new image whenever you need to update Hadoop itself. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com -- Sorry for my english!! 明 Please help me to correct my english expression and error in syntax
Re: How to manage a large cluster?
James Moore wrote: On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer [EMAIL PROTECTED] wrote: On 9/11/08 2:39 AM, Alex Loddengaard [EMAIL PROTECTED] wrote: I've never dealt with a large cluster, though I'd imagine it is managed the same way as small clusters: Maybe. :) Depends how often you like to be paged, doesn't it :) Instead, use a real system configuration management package such as bcfg2, smartfrog, puppet, cfengine, etc. [Steve, you owe me for the plug. :) ] Yes Allen, I owe you beer at the next apachecon we are both at. Actually, I think Y! were one of the sponsors at the UK event, so we owe you for that too. Or on EC2 and its competitors, just build a new image whenever you need to update Hadoop itself. 1. It's still good to have as much automation of your image build as you can; if you can build new machine images on demand you have have fun/make a mess of things. Look at http://instalinux.com to see the web GUI for creating linux images on demand that is used inside HP. 2. When you try and bring up everything from scratch, you have a choreography problem. DNS needs to be up early, and your authentication system, the management tools, then the other parts of the system. If you have a project where hadoop is integrated with the front end site, for example, you're app servers have to stay offline until HDFS is live. So it does get complex. 3. The Hadoop nodes are good here in that you aren't required to bring up the namenode first; the datanodes will wait; same for the task trackers and job tracker. But if you, say, need to point everything at a new hostname for the namenode, well, that's a config change that needs to be pushed out, somehow. I'm adding some stuff on different ways to deploy hadoop here: http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment -steve
How to manage a large cluster?
Hi, all! How to manage a large cluster, eg. more than 2000 nodes. How to config hostname and ip, use DNS? How to config slaves, all in slaves file? How to update software in all nodes. Any practice, articles, suggestion is appreciate! Thanks. -- Sorry for my english!! 明 Please help me to correct my english expression and error in syntax
Re: How to manage a large cluster?
I've never dealt with a large cluster, though I'd imagine it is managed the same way as small clusters: -Use hostnames or ips, whichever is more convenient for you -All the slaves need to go into the slave file -You can update software by using bin/hadoop-daemons.sh. Something like: #bin/hadoop-daemons.sh rsync (mastersrcpath) (localdestpath) I created a wiki page that currently contains one tip for managing large clusters. Could others add to this wiki page? http://wiki.apache.org/hadoop/LargeClusterTips Thanks. Hope this helps! Alex On Thu, Sep 11, 2008 at 5:15 PM, 叶双明 [EMAIL PROTECTED] wrote: Hi, all! How to manage a large cluster, eg. more than 2000 nodes. How to config hostname and ip, use DNS? How to config slaves, all in slaves file? How to update software in all nodes. Any practice, articles, suggestion is appreciate! Thanks. -- Sorry for my english!! 明 Please help me to correct my english expression and error in syntax
Re: How to manage a large cluster?
On 9/11/08 2:39 AM, Alex Loddengaard [EMAIL PROTECTED] wrote: I've never dealt with a large cluster, though I'd imagine it is managed the same way as small clusters: Maybe. :) -Use hostnames or ips, whichever is more convenient for you Use hostnames. Seriously. Who are you people using raw IPs for things? :) Besides, you're going to need it for the eventual support of Kerberos. -All the slaves need to go into the slave file We only put this file on the namenode and 2ndary namenode to prevent accidents. -You can update software by using bin/hadoop-daemons.sh. Something like: #bin/hadoop-daemons.sh rsync (mastersrcpath) (localdestpath) We don't use that because it doesn't take inconsideration down nodes (and you *will* have down nodes!) or deal with nodes that are outside the grid (such as our gateways/bastion hosts, data loading machines, etc). Instead, use a real system configuration management package such as bcfg2, smartfrog, puppet, cfengine, etc. [Steve, you owe me for the plug. :) ] I created a wiki page that currently contains one tip for managing large clusters. Could others add to this wiki page? http://wiki.apache.org/hadoop/LargeClusterTips Quite a bit of what we do is covered in the latter half of http://tinyurl.com/5foamm . This is a presentation I did at ApacheCon EU this past April that included some of the behind-the-scenes of the large clusters at Y!. At some point I'll probably do an updated version that includes more adminy things (such as why we push four different types of Hadoop configurations per grid) while others talk about core Hadoop stuff.
Re: How to manage a large cluster?
On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer [EMAIL PROTECTED] wrote: On 9/11/08 2:39 AM, Alex Loddengaard [EMAIL PROTECTED] wrote: I've never dealt with a large cluster, though I'd imagine it is managed the same way as small clusters: Maybe. :) Add me to the maybe :) column. In my experience, large rarely turns out the same as small. What usually happens is that the developers build the small thing, keeping in mind that good sysadmins are going to need to do some work turn it into the large thing. (Never underestimate the value of good system admin people. Speaking as a developer, good sysadmins will almost always know something about large that you haven't thought about.) I think what I was doing on a small cluster (100 machines) would take some modifications to scale up. -Use hostnames or ips, whichever is more convenient for you Use hostnames. Seriously. Who are you people using raw IPs for things? :) Besides, you're going to need it for the eventual support of Kerberos. I suspect lots of people buy arrays by the hour from Amazon, so you're going to have a different batch of IP addresses every $WHATEVER_PERIOD. Not having to worry about dynamic dns is probably interesting to someone. (Our plan was to spin up an array of 100 or so servers every N days, work for a few hours, then shut down.) Dynamic DNS sounded like a pain to me only because I'm a really bad system administrator - it may be that it's worth it (or trivial). Instead, use a real system configuration management package such as bcfg2, smartfrog, puppet, cfengine, etc. [Steve, you owe me for the plug. :) ] Or on EC2 and its competitors, just build a new image whenever you need to update Hadoop itself. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com