[gt-user] Job rescheduling & CPU number issues

yingying chen Wed, 10 Jun 2009 20:10:37 -0700

Hi. I email to Gridway for many times. But it seems that there is nobody.
So, I resend the mail here. I hope anyone could help me. THANKS!

---------- Forwarded message ----------
From: yingying chen <[email protected]>
Date: 2009/6/10
Subject: Job rescheduling & CPU number issues
To: [email protected]

Hi, everyone. I used GT4.2 + Ganglia + PBS + Gridway for some research.

1) The Gridway has a little problem, that is:

when several jobs(just 10, for example) submitted to Gridway, Gridway will
allocate them to different cluster nodes. Some jobs will be done
successfully, while others will be rescheduled, even many times (not more
than the configuration of Rescheduling times, ofcourse). Sometimes, this
kind job just be pending. I checked the job.log, from where I get:

*unable to find exit code, assuming that the job failed or was cancelled*.

Why? need to say, the chance above is almost at random. The bad job is not
constant. any hints? Thanks!

2) Besides, I use Ganglia for the information collection of cluster nodes. I
find in Gridway, the CPU number (which could be get from "gwhost"
command)----*N(U/F/T)* is something wrong. e.g., my Cluster A: 7 machines,
each machine with dual CPU, and Hyperthreading. So, the N(U/F/T) in Gridway
is : 0/7/14. I googled and almost get the conclusion this problem is casued
by Hyperthreading tech. That makes confused, since I hope to see the total
number and free number should be the same (when I didnot using the cluster
resources at first), not be a double-time relationship.

 is my thought wrong? Plz help me. Thanks a lot!

-- 
Regards,
Elaine CHEN.
Polyu, HK.

-- 
Regards,
Elaine CHEN.
Polyu, HK.

[gt-user] Job rescheduling & CPU number issues

Reply via email to