Hi,
I tried to start a 32-node job on 7 quadcore and 2 dualcore machines:
qsub -l nodes=7:ppn=4+2:ppn=2 ...
With torque's FIFO scheduler (pbs_sched), the job starts as expected.
With maui, I have
ENABLEMULTIREQJOBS TRUE
but the job gets deferred and will never start.
The reason is revealed by an error message in maui.log:
12/14 13:24:31 ERROR: job '40' cannot be started: (rc: 15064
errmsg: 'Unknown node ' hostlist: 'dong2:ppn=6+dong3:ppn=4+dong4:ppn=6
+gdong1:ppn=4+gdong2:ppn=4+gdong3:ppn=4+gdong4:ppn=4')
12/14 13:24:31 ALERT: cannot start job 40 (RM 'base' failed in function
'jobstart')
The "hostlist" correctly lists the 7 quadcore hosts, but instead of
adding the 2 dualcores, it overloads two of the quadcores ("ppn=6").
The bug is seen with torque-2.5.9 and maui-3.3 as well as maui-3.3.1.
Testing with "smaller" requests like
qsub -l nodes=4:ppn=4+2:ppn=2
does indeed work in the same configuration. Maybe hostlists of a certain
size/complexity are needed to trigger the buggy behavior?
Please let me know if you need more info for debugging the case.
Best regards,
Burkhard Bunk.
----------------------------------------------------------------------
[email protected] Physics Institute, Humboldt University
fax: ++49-30 2093 7628 Newtonstr. 15
phone: ++49-30 2093 7980 12489 Berlin, Germany
----------------------------------------------------------------------
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers