Roy and I were able to find the cause of this. Backfill was breaking on NODEACCESSPOLICY SINGLEJOB. The following patch has the fix.
http://www.clusterresources.com/download/maui/snapshots/maui-3.2.6p21-snap.1243977349.tar.gz Thanks, Brian Roy Dragseth wrote: > Hi, sorry for the late reply. > > > My config is as follows, four compute nodes with np=2, two have the "gige" > feature and two have "ib". You'll find the config files below. > > > Submit three jobs like this: > > echo sleep 1000 | qsub -lnodes=2:ppn=2:gige,walltime=3000 > > wait for the first job to start, the two others should be queued. > > Then submit four jobs like this > > echo sleep 1000 | qsub -lnodes=1,walltime=3000 > > What you should see then is three jobs starting and one job ending up queued > leaving one job slot un-utilized: > > > ACTIVE JOBS-------------------- > JOBNAME USERNAME STATE PROC REMAINING STARTTIME > > 189 royd Running 4 00:43:43 Sat Mar 28 23:47:52 > 192 royd Running 1 00:45:21 Sat Mar 28 23:49:30 > 193 royd Running 1 00:45:52 Sat Mar 28 23:50:01 > 194 royd Running 1 00:45:52 Sat Mar 28 23:50:01 > > 4 Active Jobs 7 of 8 Processors Active (87.50%) > 4 of 4 Nodes Active (100.00%) > > IDLE JOBS---------------------- > JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME > > 190 royd Idle 4 00:50:00 Sat Mar 28 23:47:52 > 191 royd Idle 4 00:50:00 Sat Mar 28 23:47:53 > 195 royd Idle 1 00:50:00 Sat Mar 28 23:49:31 > > 3 Idle Jobs > > > This illustrates the behaviour we see on our production cluster without the > maui patch I submitted earlier. > > > Here is the nodes file from my 4 node test cluster: > > compute-0-0 np=2 ib > compute-0-1 np=2 ib > compute-0-2 np=2 gige > compute-0-3 np=2 gige > > and here is the maui.cfg > > RMPOLLINTERVAL 00:00:30 > JOBAGGREGATIONTIME 00:00:30 > > SERVERHOST hpc2.cc.uit.no > SERVERPORT 42559 > SERVERMODE NORMAL > RMCFG[base] TYPE=PBS > ADMIN1 maui root > > LOGFILE maui.log > LOGFILEMAXSIZE 10000000 > LOGLEVEL 3 > > BACKFILLPOLICY FIRSTFIT > RESERVATIONPOLICY CURRENTHIGHEST > NODEACCESSPOLICY SINGLEUSER > > > And here is the torque config, aka the output from qmgr -c " print server" > > $ qmgr -c "p s" > # > # Create queues and set their attributes. > # > # > # Create and define queue default > # > create queue default > set queue default queue_type = Execution > set queue default enabled = True > set queue default started = True > # > # Set server attributes. > # > set server scheduling = True > set server acl_host_enable = False > set server acl_hosts = hpc2.cc.uit.no > set server managers = [email protected] > set server managers += [email protected] > set server default_queue = default > set server log_events = 511 > set server mail_from = adm > set server query_other_jobs = True > set server scheduler_iteration = 600 > set server node_check_rate = 150 > set server tcp_timeout = 6 > set server next_job_number = 196 > > > > > _______________________________________________ > mauiusers mailing list > [email protected] > http://www.supercluster.org/mailman/listinfo/mauiusers > _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
