Hi Dan,
On 20 Oct , 2005, at 2:41 AM, Dan Podeanu wrote:
Interesting topic.
Indeed, I'm moving to a different employer and was considering a
similar setup...
I'm curious about a number of things:
What's the scale of the cluster you're using this setup on ?
Would you be willing / able to share some of the work ?
I'd be very interested to look at your setup before I start my own.
Any comments on the hardware stability of the nodes you're using?
Which make blades are you using?
I was also wondering whether you are familiar with the work of http://
www.infrastructures.org/
Your setup has many of it's characteristics.
Objectives:
1. Low maintenance costs: maintaining and applying patches to a
single build
(Gentoo snapshots).
2. Low scalability overhead: scalability should be part of the
design, it
should not take more than 10 minutes per server to scale up.
3. Redundancy: Permanent hardware failure of N-1 out of N nodes, or
temporary failure (power off) of all nodes should allow fast (10
minutes) recovery of all nodes in a
cluster.
I read below that all nodes include configs for dhcp/tftp in order to
be able to take over the golden (blade root) server. How do you
handle that? In case of downtime of the main blade root server which
of the nodes gets to take over? Is that an automatic or a manual
process?
Additionally, did you test a all node failure and how did the master
blade root cope with the strain of all nodes booting at once? What
hardware are you using for the blade root server ?
Restrictions:
1. Single CPU architecture: I consider the cost of maintaining several
architectures to be bigger than the cost of purchasing a single
architecture.
Are you running a full 64-bit setup or 32-bit compatibility mode ?
What are your experiences with stability in 64-bit case ? Especially
curious about php and it's diverse set of external libs. Do agree
though, any thoughts on the inevitable upgrade that's going to show
up some time in the future when your current hardware platform is no
longer available ?
2. Unified packages tree: I consider the cost of maintaining
several Gentoo
snapshots just to have deployed the minimum of packages per server
assigned to a specific
application (mail server, web server etc.) to be bigger than having
a common build with all packages and just starting the required
services (ie. all deployed servers have a both a MTA and Apache
installed, just that web servers have Apache started, and the mail
servers have it stopped and MTA running instead).
Agreed, doesn't pay off to have seperate base-sets for the different
type of nodes, and it's good on redundancy, if needed a former
webserver can stand in as a database server etc..
3. An application that can act as a cluster with transparent
failover (web
with balancer and health checking, multiple MX servers, etc.)
I don't understand this restriction?
4. A remote storage for persistent data (like logs) helps (you will
see why);
you can modify the partitioning or harddisk configuration to
maintain a stable filesystem on individual servers.
<snipped>
Software:
One initial server (blade root) is installed with Gentoo. On top of
that, in
a directory, another Gentoo is installed (Gentoo snapshot) that
will be replicated on individual
servers as further described, and all maintenance to the snapshot
is done in chroot.
The Blade root runs DHCP and tftp and is able to answer PXE dhcp/tftp
requests (for network boot) and serve an initial bootloader (grub
0.95 with diskless and diskless-undi patches to allow detection of
Broadcom NICs), along with an initial initrd filesystem.
The Gentoo snapshot contains all the packages required for all
applications
(roughly 2gb on our systems), along with dhcp/tftp and configs, to
allow it to act as Blade root.
See question above, is switching manual ?
In addition, the Blade root contains individual configurations for
every
individual deployed server (or, rather, only changes to the
standard Gentoo config, ie. per-blade IPs, custom application
configs, different configuration for services to start as boot, etc.)
Do you use classes here (e.g. webserver, databaseserver, mailserver,
cachingserver etc.)?
Or do you maintain individual setups for each server?
What scripting language did you choose for the config scripts and
stuff and why that script lang?
<booting process snipped>
I'm also curious as to what QA procdures you have in place to prevent
accidental mistakes on the blade root server. I assume you test
beforehand ? On all server classes ? Modifications to the third
archive with the per-server config seem rather difficult to test.
I hope this helps.
Oh it sure did, it confirmed some ideas i was already thinking about
and gave me a real world example that it can be done :-)
Thanks,
Ramon
--
Change what you're saying,
Don't change what you said
The Eels
--
[email protected] mailing list