Hi,
We have multiple sets of nodes and queues and as far as possible try to
push jobs from one queue to a certain set of nodes first and if those
are all busy to another set.
queue parallel -> d,f,k nodes (10 of each, total 30)
queue medium64 -> a and l nodes + temp1 (total of 25)
In the past we have used the SRCFG for this as below:
# Tie E4400, E2160 and E6600 machines to the medium64 queue
SRCFG[medium64]
HOSTLIST=a01,a02,a03,a04,a05,a06,a07,a08,a09,a10,a11,a12,a13,a14,a15,a16,a17,a18,a19,a20,l01,l02,l03,l04,temp1
SRCFG[medium64] CLASSLIST=medium64
SRCFG[medium64] PERIOD=INFINITY
SRCFG[medium64] RESOURCES=PROCS:-1
# Tie the parallel queue to the quad core phenoms cpus
SRCFG[parallel]
HOSTLIST=f01,f02,f03,f04,f05,f06,f07,f08,f09,f10,d01,d02,d03,d04,d05,d06,d07,d08,d09,d10,k01,k02,k03,k04,k05,k06,k07,
k08,k09,k10
SRCFG[parallel] CLASSLIST=parallel,medium64-
SRCFG[parallel] PERIOD=INFINITY
SRCFG[parallel] RESOURCES=PROCS:-1
However what is now happening is that (maui-3.2.21) any jobs submitted
to the medium64 queue are always sent to the f,d or k nodes first and
not to the a machines.
in fact when considering the nodes maui does not even consider the a
machines to be available:
03/17 11:01:40 INFO: processing node request line '1:ppn=1'
03/17 11:01:40 INFO: job '343129' loaded: 1 jon staff
1209600 Idle 0 1237258899 [NONE] [NONE] [NONE] >= 0 >
= 0 [NONE] 1237258900
03/17 11:01:40 INFO: 15 PBS jobs detected on RM vanguard
03/17 11:01:40 INFO: jobs detected: 15
03/17 11:01:40 INFO: total jobs selected (ALL): 1/15 [State: 14]
03/17 11:01:40 INFO: total jobs selected (ALL): 1/15 [State: 14]
03/17 11:01:40 INFO: total jobs selected in partition ALL: 1/1
03/17 11:01:40 MQueueScheduleRJobs(Q)
03/17 11:01:40 INFO: total jobs selected in partition ALL: 1/1
03/17 11:01:40 INFO: total jobs selected in partition DEFAULT: 1/1
03/17 11:01:40 MQueueScheduleIJobs(Q,DEFAULT)
03/17 11:01:40 INFO: 370 feasible tasks found for job 343129:0 in
partition DEFAULT (1 Needed)
03/17 11:01:40 INFO: tasks located for job 343129: 1 of 1 required
(120 feasible)
03/17 11:01:40 MJobStart(343129)
03/17 11:01:40 MRMJobStart(343129,Msg,SC)
03/17 11:01:40 MPBSJobStart(343129,vanguard,Msg,SC)
03/17 11:01:40 MPBSJobModify(343129,Resource_List,Resource,k10)
03/17 11:01:40 MPBSJobModify(343129,Resource_List,Resource,1:ppn=1)
03/17 11:01:40 INFO: job '343129' successfully started
03/17 11:01:40 INFO: starting job '343129'
03/17 11:01:40 INFO: 1 jobs started on iteration 2
Active Jobs------
The 120 feasible indicated that the a machines are not being considered
because 120 is the number of cpu's available from the 30 d,k,f machines.
Now this used to work in the past, the NODEALLOCATIONPOLICY is set to
MINRESOURCE, BACKFILL to BESTFIT.
We have another couple of queue also linked in a similar manner and they
seem to be working fine but in this case it just donesn't work as I
expect it too - obviously I have something wrong but any help would be
appreciated.
Jon
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers