Hi,

I tried to start a 32-node job on 7 quadcore and 2 dualcore machines:

    qsub -l nodes=7:ppn=4+2:ppn=2 ...

With torque's FIFO scheduler (pbs_sched), the job starts as expected.

With maui, I have

    ENABLEMULTIREQJOBS TRUE

but the job gets deferred and will never start.
The reason is revealed by an error message in maui.log:

12/14 13:24:31 ERROR:    job '40' cannot be started: (rc: 15064
    errmsg: 'Unknown node '  hostlist: 'dong2:ppn=6+dong3:ppn=4+dong4:ppn=6
    +gdong1:ppn=4+gdong2:ppn=4+gdong3:ppn=4+gdong4:ppn=4')
12/14 13:24:31 ALERT:    cannot start job 40 (RM 'base' failed in function
    'jobstart')

The "hostlist" correctly lists the 7 quadcore hosts, but instead of
adding the 2 dualcores, it overloads two of the quadcores ("ppn=6").

The bug is seen with torque-2.5.9 and maui-3.3 as well as maui-3.3.1.

Testing with "smaller" requests like

     qsub -l nodes=4:ppn=4+2:ppn=2

does indeed work in the same configuration. Maybe hostlists of a certain 
size/complexity are needed to trigger the buggy behavior?

Please let me know if you need more info for debugging the case.

Best regards,
Burkhard Bunk.
----------------------------------------------------------------------
  [email protected]      Physics Institute, Humboldt University
  fax:    ++49-30 2093 7628     Newtonstr. 15
  phone:  ++49-30 2093 7980     12489 Berlin, Germany
----------------------------------------------------------------------
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to