Dear Xavier,
Thank you for the nice and detailed response. This is very useful
'intelligence' particularly regarding the reliability of the controllers
and IMPI firmware problems.
Have you ever lost data on the X4500 systems?
Would it be possible to get a (private) copy of your Jumpstart config file
and the custom install scripts? Reading these and modifying them will
probably be quite a bit quicker than developing our own from scratch.
I am traveling at the moment (just picking up email from airport wireless)
so I will need a bit of time to absorb everything below. I'll probably
write back with a few other questions and comments later.
(PS: do you see any good reason NOT to boot the boxes with root on an
NFS-exported file system? To me this makes sense, as it would permit an
'all-ZFS' and symmetric disk configuration.)
Cheers,
Bruce
PS: Loic: thanks for passing this on to your colleauge!
PPS: We've also been doing some experiments with putting OpenSolaris+ZFS
on some of our generic (Supermicro + Areca) 16-disk RAID systems, which
were originally intended to run Linux.
On Mon, 14 Apr 2008, Xavier Canehan wrote:
Bruce Allen wrote:
Hi Loic,
Hello,
As I am one of Loic's collegue and in charge of Thumpers, I'm answering
there.
(I'm picking up a 6-month old thread.)
On Wed, 5 Sep 2007, Loic Tortay wrote:
As of today we have 112 X4500, 112U are almost 3 racks which is quite
a lot due to our floor space constraints.
We're now doing the comissioning of our new cluster which includes about
30 Linux storage server boxes of ~10TB each and (only!) 12 Sun X4500.
We now have 146 X4500. 3 machines are the spare/developpement kit, 2 are
running Linux (Scientific Linux 4.5 x64), the remainder is running Solaris
10 (Update 3 with a recent patch level).
We have cloning scripts set up (Debian FAI) to automatically build the
Linux boxes, but are less familiar with cloning in a Solaris
environment. What method do you use to install and OS and patches onto
your X4500s and ensure a homogenous environment. If it's Sun's
Jumpstart, would it be possible to see the config and Jumpstart file(s)
that you use for this?
We use JumpStart for the system installation with several custom scripts
launched automatically after the "post_install" (which does as little as
possible).
Everything (OS patches, software installation & patches, storage
configuration, system services configuration) is done automatically except
the actual storage applications (dCache, Xrootd, SRB or HPSS) installation
and configuration which are left to the applications administrators.
The system configuration is homogenous simply because the installation is
done with the same set of scripts for all Solaris installations (with
machine, service & hardware specific configuration files).
We only have ~2 Solaris admins (< 1 FTE), things have to be kept simple.
Since stock Solaris can not boot from ZFS, I'm a bit reluctant to throw
away drives and storage space to host the OS separately on each X4500.
One attractive alterative is to NFS-boot the Thumpers from a single
central OS image. Have you tried this yourself? Where do you boot your
X4500 systems from?
We still use two of the internal disks for the system (in software RAID-1).
This served us once in less than 40 hardware incidents.
We have not tried to NFS-boot the X4500 for daily operation (only for
install and once for rescue).
I'd be grateful for any advice, anecdotes, war stories, etc.
The Marvell SATA controllers in the X4500s are much more reliable than we
initially expected (one failure in 876 controllers over 18 months). So there
is no need to be over-cautious and configure the zpools as we first did
(i.e. in the Sun default configuration with 8 security disks).
We now use a configuration with only 6 security disks which gives an extra
terabyte.
The war story is still raging.
The (Linux based) service processor firmware in the X4500s (and apparently
other Sun X4x00 servers) may become unusable after some time.
This is an issue because when the service processor becomes unusable, IPMI
no longer works (no serial-over-lan console, no ipmitool, etc.)
At that point, more often than not, the OS eventually becomes (very)
unresponsive. The only solution is then to unplug all power supplies (!)
The current Sun supported work-around is to reboot the SP every 30 days,
60 at most. This should not be a big deal but sometimes it triggered a
reboot
of the X4500 as well. As This Should Not Happen, Sun is actively
investigating,
suspecting a related but different issue.
There is supposedly a corrected firmware for this but it's scheduled to be
released "sometime" in the future (current version is 2.0.2.1).
As for anecdotes, the X4500s we have survived to two catastrophic power
failures (all machine room instantly quiet): once due to a human error and
once to an almost melting/melted power generator after a power cut.
First one stroke as 46 machines were in production: we lost only one hard
drive over 2208. Last one passed almost unnoticed. Almost.
Good point for support which training is now perfect. We do not encounter
people asking to check the Hitachi Disk Array connected to the Sun server.
Most of the issues we've had with the X4500 have been forwarded to Sun
engineering and it seems that several of these remarks were taken into
account for the next generation of X4500 (of course, several other X4500
users provided input too).
We are planning to set up a tender to get at least this volume of usable
storage this year. Thumpers may be good candidates.
X.
--
| Xavier Canehan <[EMAIL PROTECTED] <[EMAIL PROTECTED]>> - IN2P3
Computing Centre |
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf