Hi!
There is a problem with node allocation for mpi jobs in my test cluster
which has 2 identical nodes with 8 cores in each (i.e. 16 cores in total).
A job submitted from UI hangs in status scheduled.
Maui.log shows:
[...]
05/02 12:31:48 INFO: processing node request line '1:ppn=8+1:ppn=2'
05/02 12:31:48 INFO: job '1351' loaded: 10 mlp001 mlp
259200 Idle 0 1304325107 [NONE] [NONE] [NONE] >= 0
>= 0 [NONE] 1304325108
05/02 12:31:48 INFO: 1 PBS jobs detected on RM base
05/02 12:31:48 INFO: jobs detected: 1
05/02 12:31:48 INFO: total jobs selected (ALL): 1/1
05/02 12:31:48 INFO: total jobs selected (ALL): 1/1
05/02 12:31:48 INFO: total jobs selected in partition ALL: 1/1
05/02 12:31:48 INFO: total jobs selected in partition ALL: 1/1
05/02 12:31:48 INFO: total jobs selected in partition DEFAULT: 1/1
05/02 12:31:48 ERROR: cannot allocate nodes to job '1351' in
partition DEFAULT
05/02 12:31:48 MJobPReserve(1351,DEFAULT,ResCount,ResCountRej)
05/02 12:31:48 INFO: total jobs selected in partition DEFAULT: 1/1
05/02 12:31:48 INFO: total jobs selected in partition ALL: 1/1
05/02 12:31:48 INFO: total jobs selected in partition DEFAULT: 1/1
05/02 12:31:48 MQueueBackFill(BFQueue,HARD,DEFAULT)
05/02 12:31:48 INFO: total jobs selected in partition DEFAULT: 0/1
[ReserveTime: 1]
05/02 12:31:48 INFO: total jobs selected in partition ALL: 1/1
[...]
$ showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
0 Active Jobs 0 of 16 Processors Active (0.00%)
0 of 2 Nodes Active (0.00%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
1351 mlp001 Idle 8 3:00:00:00 Mon May 2
12:40:02
1 Idle Job
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
Total Jobs: 1 Active Jobs: 0 Idle Jobs: 1 Blocked Jobs: 0
The job runs fine if CPUNumber <=8. If CPUNodes=> 8 then maui behaves as
below:
np=9, 1:ppn=8+1:ppn=1 => maui failed to allocate nodes
np=10, 1:ppn=8+1:ppn=2 => maui failed to allocate nodes
np=11, 1:ppn=8+1:ppn=3 => maui failed to allocate nodes
np=12, 1:ppn=8+1:ppn=4 => maui failed to allocate nodes
np=13, 1:ppn=8+1:ppn=5 => maui is able to allocate nodes
np=14, 1:ppn=8+1:ppn=6 => maui is able to allocate nodes
np=15, 1:ppn=8+1:ppn=7 => maui is able to allocate nodes
np=16, 2:ppn=8 => maui is able to allocate nodes
To test listed above a possible combinations I used the following command:
$ echo "/opt/openmpi/1.4.3/bin/mpirun -np 9 /data/hello_mpi" |qsub -l
nodes=1:ppn=8+1:ppn=1 -q mlp
There are the following lines in maui.cfg:
[...]
ENABLEMULTINODEJOBS TRUE
ENABLEMULTIREQJOBS TRUE
[...]
as LRMS I use torque version 2.3.13 (recompiled from source). Tried
maui-3.2.6p21 and maui-3.3.1 rebuilt against specified torque version
(i.e. 2.3.13).
I don't know if it is sufficient but submitfilter is configured as well.
$ cat /var/spool/pbs/torque.cfg
SUBMITFILTER /var/spool/pbs/submit_filter.pl
Submit filter was downloaded from http://devel.ifca.es/rep/submit_filter.pl.
I found the description of similar problem (the link to thread is
http://www.mail-archive.com/[email protected]/msg02604.html)
but the solution or the reason for such behavior is not mentioned there.
I wonder if anybody observed the same maui behavior? Any ideas what can
be a cause of it and how to fix that problem?
Regards,
Nikolay.
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers