Hi Dan,

On 20 Oct , 2005, at 2:41 AM, Dan Podeanu wrote:
Interesting topic.

Indeed, I'm moving to a different employer and was considering a similar setup...
I'm curious about a number of things:

What's the scale of the cluster you're using this setup on ?
Would you be willing / able to share some of the work ?
I'd be very interested to look at your setup before I start my own.

Any comments on the hardware stability of the nodes you're using?
Which make blades are you using?

I was also wondering whether you are familiar with the work of http:// www.infrastructures.org/
Your setup has many of it's characteristics.

Objectives:

1. Low maintenance costs: maintaining and applying patches to a single build
(Gentoo snapshots).
2. Low scalability overhead: scalability should be part of the design, it
should not take more than 10 minutes per server to scale up.
3. Redundancy: Permanent hardware failure of N-1 out of N nodes, or
temporary failure (power off) of all nodes should allow fast (10 minutes) recovery of all nodes in a
cluster.

I read below that all nodes include configs for dhcp/tftp in order to be able to take over the golden (blade root) server. How do you handle that? In case of downtime of the main blade root server which of the nodes gets to take over? Is that an automatic or a manual process?

Additionally, did you test a all node failure and how did the master blade root cope with the strain of all nodes booting at once? What hardware are you using for the blade root server ?

Restrictions:

1. Single CPU architecture: I consider the cost of maintaining several
architectures to be bigger than the cost of purchasing a single architecture.

Are you running a full 64-bit setup or 32-bit compatibility mode ?
What are your experiences with stability in 64-bit case ? Especially curious about php and it's diverse set of external libs. Do agree though, any thoughts on the inevitable upgrade that's going to show up some time in the future when your current hardware platform is no longer available ?

2. Unified packages tree: I consider the cost of maintaining several Gentoo snapshots just to have deployed the minimum of packages per server assigned to a specific application (mail server, web server etc.) to be bigger than having a common build with all packages and just starting the required services (ie. all deployed servers have a both a MTA and Apache installed, just that web servers have Apache started, and the mail servers have it stopped and MTA running instead).

Agreed, doesn't pay off to have seperate base-sets for the different type of nodes, and it's good on redundancy, if needed a former webserver can stand in as a database server etc..

3. An application that can act as a cluster with transparent failover (web
with balancer and health checking, multiple MX servers, etc.)

I don't understand this restriction?

4. A remote storage for persistent data (like logs) helps (you will see why); you can modify the partitioning or harddisk configuration to maintain a stable filesystem on individual servers.

<snipped>

Software:

One initial server (blade root) is installed with Gentoo. On top of that, in a directory, another Gentoo is installed (Gentoo snapshot) that will be replicated on individual servers as further described, and all maintenance to the snapshot is done in chroot.

The Blade root runs DHCP and tftp and is able to answer PXE dhcp/tftp
requests (for network boot) and serve an initial bootloader (grub 0.95 with diskless and diskless-undi patches to allow detection of Broadcom NICs), along with an initial initrd filesystem.

The Gentoo snapshot contains all the packages required for all applications (roughly 2gb on our systems), along with dhcp/tftp and configs, to allow it to act as Blade root.

See question above, is switching manual ?

In addition, the Blade root contains individual configurations for every individual deployed server (or, rather, only changes to the standard Gentoo config, ie. per-blade IPs, custom application configs, different configuration for services to start as boot, etc.)

Do you use classes here (e.g. webserver, databaseserver, mailserver, cachingserver etc.)?
Or do you maintain individual setups for each server?
What scripting language did you choose for the config scripts and stuff and why that script lang?

<booting process snipped>

I'm also curious as to what QA procdures you have in place to prevent accidental mistakes on the blade root server. I assume you test beforehand ? On all server classes ? Modifications to the third archive with the per-server config seem rather difficult to test.

I hope this helps.

Oh it sure did, it confirmed some ideas i was already thinking about and gave me a real world example that it can be done :-)

Thanks,

Ramon
--
Change what you're saying,
Don't change what you said

The Eels



--
[email protected] mailing list

Reply via email to