On Sun, 2007-10-14 at 07:24 +0800, Michael Calizo wrote:
> Its a HIGH Performance Computing (HPC).  To elaborate the setup, we
> have a thinanywhere setup for user to use and  to submit jobs to those
> computing nodes. The management node  will then push the those jobs to
> available computing nodes. 

I don't know much about that end of the linux performance spectrum,
but I suspect you'll just have to provide more information.  there
is NO information in your posts that can help anyone help you.

e.g., are you running something like openMosix or openSSI?  or is it
similar to beowulf?  just what kind of HPC are you running?  what
distro is running?  what kernel?  how are tasks allocated?  how does
each node know to run something?  does it poll a central location and
pull jobs its supposed to run and run them?  or does a central server
directly send commands to it to run (e.g. similar to 
ssh [run-some-program-on-your-cpu]).  are these jobs IO-bound or
cpu bound?  it's probably a mix, what is the mix?  is disk shared among
multiple servers?  if yes, how is it shared?

> Cron job is not an option because we can not just
> kill/restart/powercycle those jobs/server on the compute node without
> informing the job owner. 

do the servers ever die by themselves?  e.g., OOM.  i would think that
it's always possible to kill a node at least via OOM.  or do you have
strong ulimit settings so that it's never possible to kill a node via
OOM or some similar denial of service?

> What I want is an opinion if it is safe to say that upgrade is needed
> for those low-end computer node. This is actually a matter of  how to
> defend  my case to the boss :)

if your statistics are pretty good (e.g., we get N node failures a 
month, and of those N, 99% are on low-end nodes, or we've had M node
failures in 3 years and of those 99% were on low-end nodes) then you 
can certainly safely show those stats to your boss.  even 80% is
probably high enough that he'll agree to upgrade all the low end 
boxes.  if you're lower than around 80% (or, pick a number, any 
number higher than 50%) then you'll need to actually understand 
why those low end nodes are failing rather than making blanket
statements that your statistics don't support.

is there some big-vendor you can push that question to?  if you're
describing your setup accurately (and not just giving us big numbers
so our eyes will grow big too), then i'm sure you've got some sort
of expensive support agreement.  Maybe, if your vendor is, e.g., 
Redhat, you can get them to have Alan Cox look at your setup.  Maybe
you can pay people on this list (Ed Tongson? Fooler? maybe Ian Sison
or Orly Andico [but maybe not, if they're very busy, unless they'd
look at it for fun :-]) to look at the issue.  Posting vague
descriptions of the problem though is certainly not going to get you
useful replies.  You need to be specific.  If there's too much that's
confidential, then you're just going to have to pay someone good
who will sign an NDA.

tiger


_________________________________________________
Philippine Linux Users' Group (PLUG) Mailing List
[email protected] (#PLUG @ irc.free.net.ph)
Read the Guidelines: http://linux.org.ph/lists
Searchable Archives: http://archives.free.net.ph

Reply via email to