Hello Steve and list
Steve Herborn wrote:
The hardware suite is actually quite sweet, but has been mismanaged rather
badly. It has been left in a machine room that is too hot & on power that
is more then flaky with no line conditioners. One of the very first things
I had to do was replace almost two-dozen Power Supplies that were DOA.
Yes, 24 power supplies may cost as much as the savings in UPS,
plus the headache of replacing them, plus failing nodes.
I think I have most of the hardware issues squared away right now and need
to focus on getting here up & running, but even installing the OS on a
head-Node is proving to be troublesome.
Besides my naive encouragement to use Rocks,
I remember some recent discussions here on the Beowulf list
about different techniques to setup a cluster.
See this thread, and check the postings by
Bogdan Cotescu, from the University of Heidelberg.
He seems to administer a number of clusters, some of which have
constraints comparable to yours, and to use a variety of tools for this:
http://www.beowulf.org/archive/2008-October/023433.html
http://www.iwr.uni-heidelberg.de/services/equipment/parallel/
I really wish I could get away with using ROCKS as there would be such a
greater reach back for me over SUSE. Right now I am exploring AutoYast to
push the OS out to the compute nodes,
Long ago I looked into System Imager, which was then part of Oscar,
but I don't know if it is current/maintained:
http://wiki.systemimager.org/index.php/Main_Page
but that is still going to leave me
short on any management tools.
That is true.
Tell bosses they are asking you to reinvent the Rocks wheel.
Good luck,
Gus Correa
--
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: [EMAIL PROTECTED]
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
Steven A. Herborn
U.S. Naval Academy
Advanced Research Computing
410-293-6480 (Desk)
757-418-0505 (Cell)
-----Original Message-----
From: Gus Correa [mailto:[EMAIL PROTECTED]
Sent: Monday, December 08, 2008 1:45 PM
To: Beowulf
Cc: Steve Herborn
Subject: Re: [Beowulf] Personal Introduction & First Beowulf Cluster
Question
Hello Steve and list
In the likely case that the original vendor will no longer support this
5-year old cluster,
you can try installing the Rocks cluster suite, which is free from SDSC,
and you already came across to:
http://www.rocksclusters.org/wordpress/
This would be a path or least resistance, and may get your cluster up and
running again with relatively small effort.
Of course there are many other solutions, but they may require more effort
from the system administrator.
Rocks is well supported and documented.
It is based on CentOS (free version of RHEL).
There is no support for SLES on Rocks,
so if you must keep the current OS distribution, it won't work for you.
I read your last paragraph, but you may argue with your bosses that the
age of this
machine doesn't justify being picky about the particular OS flavor.
Bringing it back to life, making it an useful asset,
with a free software stack, would be a great benefit.
You would spend money only in application software (e.g. Fortran
compiler, Matlab, etc).
Other solutions (e.g. Moab) will cost money, and may not work with
this old hardware.
Sticking to SLES may be a catch-22, a shot on the foot.
Rocks has a relatively large user base, and an active mailing list for help.
Moreover, for Rocks minimally you must have 1GB of RAM on every node,
two Ethernet ports on the head node, and one Ethernet port on each
compute node.
Check the hardware you have.
Although PXE boot capability is not strictly required, it makes
installation much easier.
Check your motherboard and BIOS.
I have a small cluster made of five salvaged Dell Precision 410 (dual
Pentium III)
running Rocks 4.3, and it works well.
For old hardware Rocks is a very good solution, requiring a modest
investment of time,
and virtually no money.
(In my case I only had to buy cheap SOHO switches and Ethernet cables,
but you probably already have switches.)
If you are going to run parallel programs with MPI,
the cheapest thing would be to have GigE ports and switches.
I wouldn't invest on fancier interconnect on such an old machine.
(Do you have any fancier interconnect already, say Myrinet?)
However, you can buy cheap GigE NICs for $15-$20, and high end ones (say
Intel Pro 1000) for $30 or less.
This would be needed only if you don't have GigE ports on the nodes already.
Probably your motherboards have dual GigE ports, I don't know.
MPI over 100T Ethernet is a real pain, don't do it, unless you are a
masochist.
A 64-port GigE switch to support MPI traffic would also be a worthwhile
investment.
Keeping MPI on a separate network, distinct from the I/O and cluster
control net, is a good thing.
It avoids contention and improves performance.
A natural precaution would be to backup all home directories before you
start,
and any precious data or filesystems.
I suggest sorting out the hardware issues before anything else.
It would be good to evaluate the status of your RAID,
and perhaps use that particular node as a separate storage appliance.
You can try just rebuilding the RAID, and see if it works, or perhaps
replace the defective disk(s),
if the RAID controller is still good.
Another thing to look at is how functional your Ethernet (or GigE)
switch or switches are,
and if you have more than one switch how they are/can be connected to
each other.
(One for the whole cluster? Two or more separate? Some specific topology
connecting many switches?)
I hope this helps,
Gus Correa
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf