On Sun, 2007-10-14 at 07:24 +0800, Michael Calizo wrote: > Its a HIGH Performance Computing (HPC). To elaborate the setup, we > have a thinanywhere setup for user to use and to submit jobs to those > computing nodes. The management node will then push the those jobs to > available computing nodes.
I don't know much about that end of the linux performance spectrum, but I suspect you'll just have to provide more information. there is NO information in your posts that can help anyone help you. e.g., are you running something like openMosix or openSSI? or is it similar to beowulf? just what kind of HPC are you running? what distro is running? what kernel? how are tasks allocated? how does each node know to run something? does it poll a central location and pull jobs its supposed to run and run them? or does a central server directly send commands to it to run (e.g. similar to ssh [run-some-program-on-your-cpu]). are these jobs IO-bound or cpu bound? it's probably a mix, what is the mix? is disk shared among multiple servers? if yes, how is it shared? > Cron job is not an option because we can not just > kill/restart/powercycle those jobs/server on the compute node without > informing the job owner. do the servers ever die by themselves? e.g., OOM. i would think that it's always possible to kill a node at least via OOM. or do you have strong ulimit settings so that it's never possible to kill a node via OOM or some similar denial of service? > What I want is an opinion if it is safe to say that upgrade is needed > for those low-end computer node. This is actually a matter of how to > defend my case to the boss :) if your statistics are pretty good (e.g., we get N node failures a month, and of those N, 99% are on low-end nodes, or we've had M node failures in 3 years and of those 99% were on low-end nodes) then you can certainly safely show those stats to your boss. even 80% is probably high enough that he'll agree to upgrade all the low end boxes. if you're lower than around 80% (or, pick a number, any number higher than 50%) then you'll need to actually understand why those low end nodes are failing rather than making blanket statements that your statistics don't support. is there some big-vendor you can push that question to? if you're describing your setup accurately (and not just giving us big numbers so our eyes will grow big too), then i'm sure you've got some sort of expensive support agreement. Maybe, if your vendor is, e.g., Redhat, you can get them to have Alan Cox look at your setup. Maybe you can pay people on this list (Ed Tongson? Fooler? maybe Ian Sison or Orly Andico [but maybe not, if they're very busy, unless they'd look at it for fun :-]) to look at the issue. Posting vague descriptions of the problem though is certainly not going to get you useful replies. You need to be specific. If there's too much that's confidential, then you're just going to have to pay someone good who will sign an NDA. tiger _________________________________________________ Philippine Linux Users' Group (PLUG) Mailing List [email protected] (#PLUG @ irc.free.net.ph) Read the Guidelines: http://linux.org.ph/lists Searchable Archives: http://archives.free.net.ph

