I have been trying to test a new "build" of a torque/maui package for Rocks. I am running a head node and 2 test nodes in a VMWare environment under CentOS6. The issue I'm seeing that also seems to have been reported on the torque list by someone as well is that everything appears to go as expected but all the work units end up on 1 node. Maui reports the load on that node is too high. I am pasting a variety of log and command outputs here in case someone has any useful observations or suggestions.
[tsadmin@compute-0-1 ~]$ cat /var/spool/torque/aux/36.numbat.arc.nasa.gov compute-0-1 compute-0-1 compute-0-0 compute-0-0 ----------------- [root@numbat bin]# ./checknode compute-0-0 checking node compute-0-0 State: Busy (in current state for 00:26:26) Configured Resources: PROCS: 2 MEM: 734M SWAP: 1734M DISK: 1M Utilized Resources: PROCS: 2 SWAP: 138M Dedicated Resources: PROCS: 2 Opsys: linux Arch: [NONE] Speed: 1.00 Load: 0.000 Network: [DEFAULT] Features: [NONE] Attributes: [Batch] Classes: [default 0:2] Total Time: 5:22:08:58 Up: 5:22:06:26 (99.97%) Active: 3:03:18:08 (52.97%) Reservations: Job '34'(x2) -00:26:57 -> 7:33:03 (8:00:00) JobList: 34 ALERT: node is in state Busy but load is low (0.000) [root@numbat bin]# ./checknode compute-0-1 checking node compute-0-1 State: Busy (in current state for 00:26:26) Configured Resources: PROCS: 2 MEM: 734M SWAP: 1734M DISK: 1M Utilized Resources: PROCS: 2 SWAP: 1597M Dedicated Resources: PROCS: 2 Opsys: linux Arch: [NONE] Speed: 1.00 Load: 5.180 Network: [DEFAULT] Features: [NONE] Attributes: [Batch] Classes: [default 0:2] Total Time: 5:22:08:58 Up: 5:22:04:24 (99.95%) Active: 4:31:48 (3.19%) Reservations: Job '34'(x2) -00:26:57 -> 7:33:03 (8:00:00) JobList: 34 -------------- [root@numbat bin]# ./checkjob -v 34 checking job 34 (RM job '34.numbat.arc.nasa.gov') State: Running Creds: user:tsadmin group:tsadmin class:default qos:DEFAULT WallTime: 00:26:03 of 8:00:00 SubmitTime: Thu Jun 20 10:43:24 (Time Queued Total: 00:00:01 Eligible: 00:00:01) StartTime: Thu Jun 20 10:43:25 Total Tasks: 4 Req[0] TaskCount: 4 Partition: DEFAULT Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0 Opsys: [NONE] Arch: [NONE] Features: [NONE] Exec: '' ExecSize: 0 ImageSize: 0 Dedicated Resources Per Task: PROCS: 1 Utilized Resources Per Task: PROCS: 1.52 MEM: 1.35 SWAP: 15.90 Avg Util Resources Per Task: PROCS: 1.52 Max Util Resources Per Task: PROCS: 1.52 MEM: 1.35 SWAP: 15.90 Average Utilized Memory: 131.65 MB Average Utilized Procs: 5.82 NodeAccess: SHARED TasksPerNode: 2 NodeCount: 2 Allocated Nodes: [compute-0-1:2][compute-0-0:2] Task Distribution: compute-0-1,compute-0-1,compute-0-0,compute-0-0 IWD: [NONE] Executable: [NONE] Bypass: 0 StartCount: 1 PartitionMask: [ALL] Flags: RESTARTABLE Reservation '34' (-00:25:55 -> 7:34:05 Duration: 8:00:00) PE: 4.00 StartPriority: 1 ------------ [root@numbat ~]# tracejob -v 34 /var/spool/torque/server_priv/accounting/20130620: Successfully located matching job records /var/spool/torque/server_logs/20130620: Successfully located matching job records /var/spool/torque/mom_logs/20130620: No such file or directory /var/spool/torque/sched_logs/20130620: No such file or directory Job: 34.numbat.arc.nasa.gov 06/20/2013 10:43:24 S enqueuing into default, state 1 hop 1 06/20/2013 10:43:24 A queue=default 06/20/2013 10:43:25 S Job Run at request of [email protected] 06/20/2013 10:43:25 S Not sending email: User does not want mail of this type. 06/20/2013 10:43:25 A user=tsadmin group=tsadmin jobname=xhpl2node queue=default ctime=1371750204 qtime=1371750204 etime=1371750204 start=1371750205 [email protected] exec_host=compute-0-1/1+compute-0-1/0+compute-0-0/1+compute-0-0/0 Resource_List.neednodes=2:ppn=2 Resource_List.nodect=2 Resource_List.nodes=2:ppn=2 Resource_List.walltime=08:00:00 ------------ 06/20/2013 03:40:41;0002; pbs_mom.3431;Svr;pbs_mom;Torque Mom Version = 4.2.2, loglevel = 0 06/20/2013 03:43:06;0008; pbs_mom.3431;Job;34.numbat.arc.nasa.gov;JOIN JOB as node 1 06/20/2013 03:45:41;0002; pbs_mom.3431;Svr;pbs_mom;Torque Mom Version = 4.2.2, loglevel = 0 06/20/2013 10:40:52;0002; pbs_mom.6624;Svr;pbs_mom;Torque Mom Version = 4.2.2, loglevel = 0 06/20/2013 10:43:25;0001; pbs_mom.6624;Job;TMomFinalizeJob3;job 34.numbat.arc.nasa.gov started, pid = 36561 06/20/2013 10:45:52;0002; pbs_mom.6624;Svr;pbs_mom;Torque Mom Version = 4.2.2, loglevel = 0 ------------- Here is my standard test submit script: #PBS -S /bin/bash #PBS -l nodes=2:ppn=2,walltime=8:00:00 #PBS -j oe #PBS -N xhpl2node #PBS -m e # echo $PBS_O_WORKDIR cd $PBS_O_WORKDIR ln -fs HPL.dat2node HPL.dat mpirun -v -bynode -np 4 ./xhpl If I specify the hosts with the -H switch the work goes to the correct nodes as expected. I suspect this means it's a maui issue and not a torque issue, but I was hoping someone has some ideas. Thanks in advance _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
