can you be more specific to your setup? is it an HPC or HTC? can you also elaborate on your problem? does the job stays idle on the low-end node? the way you deal with the problem is the typical way of responding to such but be done automatically via the job scheduler. and since you've already identified those problematic nodes, you might want to pull them out of the cluster, place them in a sandbox and then troubleshoot them further.
On 10/13/07, Michael Calizo <[EMAIL PROTECTED]> wrote: > > Hi Guys, > > A newbie here needs an expert opinion regarding Linux HPC. > > In my current company we have a Linux(Redhat) cluster implementation, say > 100 nodes per cluster. > I notice that on the problematic cluster, some nodes are low end server > say 2GB memory while the > other nodes have 4GB memory. This past few weeks I noticed that user > problem keeps on growing and > base on my investigation, the leftover jobs is always on the compute nodes > which are "low end". > We manage to stop/kill/restart the jobs but I know that this is only a > temporary solution and I wanted a permanent one. > > 1. I am suspecting that this might be a hardware related problem but I am > not 100% sure. I want to get opinion/suggestion first from HPC guru before I > make my move to approach the management and raise my case that hardware > upgrade is needed. > > 2. Or can this problem be attributed to the cluster missconfiguration? > > Thanks in advance. > > -- > Mike Calizo > Registered Linux User # 365113 >
_________________________________________________ Philippine Linux Users' Group (PLUG) Mailing List [email protected] (#PLUG @ irc.free.net.ph) Read the Guidelines: http://linux.org.ph/lists Searchable Archives: http://archives.free.net.ph

