Hi!

There is a problem with node allocation for mpi jobs in my test cluster 
which has 2 identical nodes with 8 cores in each (i.e. 16 cores in total).
A job submitted from UI hangs in status scheduled.
Maui.log shows:
[...]
05/02 12:31:48 INFO:     processing node request line '1:ppn=8+1:ppn=2'
05/02 12:31:48 INFO:     job '1351' loaded:  10   mlp001      mlp 
259200       Idle   0 1304325107   [NONE] [NONE] [NONE] >=      0 
 >=      0 [NONE] 1304325108
05/02 12:31:48 INFO:     1 PBS jobs detected on RM base
05/02 12:31:48 INFO:     jobs detected: 1
05/02 12:31:48 INFO:     total jobs selected (ALL): 1/1
05/02 12:31:48 INFO:     total jobs selected (ALL): 1/1
05/02 12:31:48 INFO:     total jobs selected in partition ALL: 1/1
05/02 12:31:48 INFO:     total jobs selected in partition ALL: 1/1
05/02 12:31:48 INFO:     total jobs selected in partition DEFAULT: 1/1
05/02 12:31:48 ERROR:    cannot allocate nodes to job '1351' in 
partition DEFAULT
05/02 12:31:48 MJobPReserve(1351,DEFAULT,ResCount,ResCountRej)
05/02 12:31:48 INFO:     total jobs selected in partition DEFAULT: 1/1
05/02 12:31:48 INFO:     total jobs selected in partition ALL: 1/1
05/02 12:31:48 INFO:     total jobs selected in partition DEFAULT: 1/1
05/02 12:31:48 MQueueBackFill(BFQueue,HARD,DEFAULT)
05/02 12:31:48 INFO:     total jobs selected in partition DEFAULT: 0/1 
[ReserveTime: 1]
05/02 12:31:48 INFO:     total jobs selected in partition ALL: 1/1
[...]

$ showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING STARTTIME


      0 Active Jobs       0 of   16 Processors Active (0.00%)
                          0 of    2 Nodes Active      (0.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT QUEUETIME

1351                 mlp001       Idle     8  3:00:00:00  Mon May  2 
12:40:02

1 Idle Job

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT QUEUETIME


Total Jobs: 1   Active Jobs: 0   Idle Jobs: 1   Blocked Jobs: 0


The job runs fine if CPUNumber <=8. If CPUNodes=> 8 then maui behaves as 
below:
np=9, 1:ppn=8+1:ppn=1 => maui failed to allocate nodes
np=10, 1:ppn=8+1:ppn=2 => maui failed to allocate nodes
np=11, 1:ppn=8+1:ppn=3 => maui failed to allocate nodes
np=12, 1:ppn=8+1:ppn=4 => maui failed to allocate nodes
np=13, 1:ppn=8+1:ppn=5 => maui is able to allocate nodes
np=14, 1:ppn=8+1:ppn=6 => maui is able to allocate nodes
np=15, 1:ppn=8+1:ppn=7 => maui is able to allocate nodes
np=16, 2:ppn=8 => maui is able to allocate nodes

To test listed above a possible combinations I used the following command:

$ echo "/opt/openmpi/1.4.3/bin/mpirun -np 9 /data/hello_mpi" |qsub -l 
nodes=1:ppn=8+1:ppn=1 -q mlp

There are the following lines in maui.cfg:
[...]
ENABLEMULTINODEJOBS     TRUE
ENABLEMULTIREQJOBS      TRUE
[...]

as LRMS I use torque version 2.3.13 (recompiled from source). Tried 
maui-3.2.6p21 and maui-3.3.1 rebuilt against specified torque version 
(i.e. 2.3.13).
I don't know if it is sufficient but submitfilter is configured as well.
$ cat /var/spool/pbs/torque.cfg
SUBMITFILTER /var/spool/pbs/submit_filter.pl

Submit filter was downloaded from http://devel.ifca.es/rep/submit_filter.pl.

I found the description of similar problem (the link to thread is 
http://www.mail-archive.com/[email protected]/msg02604.html) 
but the solution or the reason for such behavior is not mentioned there.

I wonder if anybody observed the same maui behavior? Any ideas what can 
be a cause of it and how to fix that problem?

Regards,
Nikolay.

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to