Hi. I email to Gridway for many times. But it seems that there is nobody. So, I resend the mail here. I hope anyone could help me. THANKS!
---------- Forwarded message ---------- From: yingying chen <[email protected]> Date: 2009/6/10 Subject: Job rescheduling & CPU number issues To: [email protected] Hi, everyone. I used GT4.2 + Ganglia + PBS + Gridway for some research. 1) The Gridway has a little problem, that is: when several jobs(just 10, for example) submitted to Gridway, Gridway will allocate them to different cluster nodes. Some jobs will be done successfully, while others will be rescheduled, even many times (not more than the configuration of Rescheduling times, ofcourse). Sometimes, this kind job just be pending. I checked the job.log, from where I get: *unable to find exit code, assuming that the job failed or was cancelled*. Why? need to say, the chance above is almost at random. The bad job is not constant. any hints? Thanks! 2) Besides, I use Ganglia for the information collection of cluster nodes. I find in Gridway, the CPU number (which could be get from "gwhost" command)----*N(U/F/T)* is something wrong. e.g., my Cluster A: 7 machines, each machine with dual CPU, and Hyperthreading. So, the N(U/F/T) in Gridway is : 0/7/14. I googled and almost get the conclusion this problem is casued by Hyperthreading tech. That makes confused, since I hope to see the total number and free number should be the same (when I didnot using the cluster resources at first), not be a double-time relationship. is my thought wrong? Plz help me. Thanks a lot! -- Regards, Elaine CHEN. Polyu, HK. -- Regards, Elaine CHEN. Polyu, HK.
