[Mauiusers] submitted jobs not running all nodes

Mark Moorcroft Fri, 25 Jul 2014 16:15:07 -0700

I have been trying to test a new "build" of a torque/maui package for 
Rocks. I am running a head node and 2 test nodes in a VMWare environment 
under CentOS6. The issue I'm seeing that also seems to have been 
reported on the torque list by someone as well is that everything 
appears to go as expected but all the work units end up on 1 node. Maui 
reports the load on that node is too high. I am pasting a variety of log 
and command outputs here in case someone has any useful observations or 
suggestions.



[tsadmin@compute-0-1 ~]$ cat /var/spool/torque/aux/36.numbat.arc.nasa.gov
compute-0-1
compute-0-1
compute-0-0
compute-0-0

-----------------

[root@numbat bin]# ./checknode compute-0-0


checking node compute-0-0

State:      Busy  (in current state for 00:26:26)
Configured Resources: PROCS: 2  MEM: 734M  SWAP: 1734M  DISK: 1M
Utilized   Resources: PROCS: 2  SWAP: 138M
Dedicated  Resources: PROCS: 2
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       0.000
Network:    [DEFAULT]
Features:   [NONE]
Attributes: [Batch]
Classes:    [default 0:2]

Total Time: 5:22:08:58  Up: 5:22:06:26 (99.97%)  Active: 3:03:18:08 (52.97%)

Reservations:
   Job '34'(x2)  -00:26:57 -> 7:33:03 (8:00:00)
JobList:  34
ALERT:  node is in state Busy but load is low (0.000)

[root@numbat bin]# ./checknode compute-0-1


checking node compute-0-1

State:      Busy  (in current state for 00:26:26)
Configured Resources: PROCS: 2  MEM: 734M  SWAP: 1734M  DISK: 1M
Utilized   Resources: PROCS: 2  SWAP: 1597M
Dedicated  Resources: PROCS: 2
Opsys:         linux  Arch:      [NONE]
Speed:      1.00  Load:       5.180
Network:    [DEFAULT]
Features:   [NONE]
Attributes: [Batch]
Classes:    [default 0:2]

Total Time: 5:22:08:58  Up: 5:22:04:24 (99.95%)  Active: 4:31:48 (3.19%)

Reservations:
   Job '34'(x2)  -00:26:57 -> 7:33:03 (8:00:00)
JobList:  34


--------------

[root@numbat bin]# ./checkjob -v 34


checking job 34 (RM job '34.numbat.arc.nasa.gov')

State: Running
Creds:  user:tsadmin  group:tsadmin  class:default  qos:DEFAULT
WallTime: 00:26:03 of 8:00:00
SubmitTime: Thu Jun 20 10:43:24
   (Time Queued  Total: 00:00:01  Eligible: 00:00:01)

StartTime: Thu Jun 20 10:43:25
Total Tasks: 4

Req[0]  TaskCount: 4  Partition: DEFAULT
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Exec:  ''  ExecSize: 0  ImageSize: 0
Dedicated Resources Per Task: PROCS: 1
Utilized Resources Per Task:  PROCS: 1.52  MEM: 1.35  SWAP: 15.90
Avg Util Resources Per Task:  PROCS: 1.52
Max Util Resources Per Task:  PROCS: 1.52  MEM: 1.35  SWAP: 15.90
Average Utilized Memory: 131.65 MB
Average Utilized Procs: 5.82
NodeAccess: SHARED
TasksPerNode: 2  NodeCount: 2
Allocated Nodes:
[compute-0-1:2][compute-0-0:2]
Task Distribution: compute-0-1,compute-0-1,compute-0-0,compute-0-0


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [ALL]
Flags:       RESTARTABLE

Reservation '34' (-00:25:55 -> 7:34:05  Duration: 8:00:00)
PE:  4.00  StartPriority:  1


------------


[root@numbat ~]# tracejob -v 34
/var/spool/torque/server_priv/accounting/20130620: Successfully located 
matching job records
/var/spool/torque/server_logs/20130620: Successfully located matching 
job records
/var/spool/torque/mom_logs/20130620: No such file or directory
/var/spool/torque/sched_logs/20130620: No such file or directory

Job: 34.numbat.arc.nasa.gov

06/20/2013 10:43:24  S    enqueuing into default, state 1 hop 1
06/20/2013 10:43:24  A    queue=default
06/20/2013 10:43:25  S    Job Run at request of [email protected]
06/20/2013 10:43:25  S    Not sending email: User does not want mail of 
this type.
06/20/2013 10:43:25  A    user=tsadmin group=tsadmin jobname=xhpl2node 
queue=default
                           ctime=1371750204 qtime=1371750204 
etime=1371750204 start=1371750205
                           [email protected]
 
exec_host=compute-0-1/1+compute-0-1/0+compute-0-0/1+compute-0-0/0
                           Resource_List.neednodes=2:ppn=2 
Resource_List.nodect=2
                           Resource_List.nodes=2:ppn=2 
Resource_List.walltime=08:00:00


------------


06/20/2013 03:40:41;0002;   pbs_mom.3431;Svr;pbs_mom;Torque Mom Version 
= 4.2.2, loglevel = 0
06/20/2013 03:43:06;0008;   pbs_mom.3431;Job;34.numbat.arc.nasa.gov;JOIN 
JOB as node 1
06/20/2013 03:45:41;0002;   pbs_mom.3431;Svr;pbs_mom;Torque Mom Version 
= 4.2.2, loglevel = 0


06/20/2013 10:40:52;0002;   pbs_mom.6624;Svr;pbs_mom;Torque Mom Version 
= 4.2.2, loglevel = 0
06/20/2013 10:43:25;0001;   pbs_mom.6624;Job;TMomFinalizeJob3;job 
34.numbat.arc.nasa.gov started, pid = 36561
06/20/2013 10:45:52;0002;   pbs_mom.6624;Svr;pbs_mom;Torque Mom Version 
= 4.2.2, loglevel = 0


-------------

Here is my standard test submit script:


#PBS -S /bin/bash
#PBS -l nodes=2:ppn=2,walltime=8:00:00
#PBS -j oe
#PBS -N xhpl2node
#PBS -m e
#

echo $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
ln -fs HPL.dat2node HPL.dat

mpirun -v -bynode -np 4 ./xhpl


If I specify the hosts with the -H switch the work goes to the correct 
nodes as expected. I suspect this means it's a maui issue and not a 
torque issue, but I was hoping someone has some ideas.

Thanks in advance
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

[Mauiusers] submitted jobs not running all nodes

Reply via email to