Hello,

For the most part our maui config is working well.  But occasionally we get 
into a situation where larger jobs will not start.  I think this mostly an 
issue when a user is completing with themselves, in this case "ema".  We have 3 
processor types on our system so the qstat listing below is only showing the 
jobs on the sandy bridge nodes.  As you can see job 314578 was submitted first 
(its been in the queue for a week) but his other smaller jobs keep starting 
before the larger job, even though the start priority of the large job is 
double that of the smaller jobs.  It doesn't help that there are some other 
large jobs (316488 and 316511) that are in a non-blocked state – I'm sure maui 
is draining for these jobs.  But I would think that at some point ema's large 
job would get prioritized over his own smaller jobs.  There are no reservations 
on any of the san nodes.  Any ideas?

Thanks,
Darby



Job ID Username Queue   Jobname          N:ppn Proc Wall  S Elap      SP 
Features
------ -------- ------- ---------------- ----- ---- ----- - ----- ------ 
--------------
314578 ema      normal  F9_case2_refined 40:16  640 08:00 Q --     21165  san
316460 ema      normal  m0.60a175r-60_dr  6:16   96 05:30 R 01:45  10185  san
316467 dgs      normal  m0.175a0.0_rwb02  9:16  144 04:00 R 02:31  10102  san
316477 flumpkin normal  DC1_vent         16:16  256 08:00 R 02:21  10040  san
316483 ema      normal  m0.30a160r22.5_d  6:16   96 05:30 R 01:45  10066  san
316488 mwhite   normal  v8.0396_h55.943_ 21:16  336 08:00 Q --     10130  san
316511 flumpkin normal  DC1_vent         16:16  256 08:00 Q --     10067  san
316512 ema      normal  m0.30a150r-90_dr  6:16   96 05:30 R 01:01  10000  san
316513 stuart   normal  m0.00a180_ICC=1_  6:16   96 03:00 R 00:53  10000  san
316528 ema      normal  m0.90a165r-90_dr  6:16   96 05:30 R 00:12  10000  san


In the above output "SP" is the "StartPriority" of the jobs.



# checkjob 314578


checking job 314578

State: Idle
Creds:  user:ema  group:eg3  class:normal  qos:DEFAULT
WallTime: 00:00:00 of 8:00:00
SubmitTime: Tue May  5 22:22:36
  (Time Queued  Total: 7:17:56:26  Eligible: 7:17:54:25)

Total Tasks: 640

Req[0]  TaskCount: 640  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [san]


IWD: [NONE]  Executable:  [NONE]
Bypass: 25  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

PE:  640.00  StartPriority:  21153
job cannot run in partition DEFAULT.  (job 314578 violates active SOFT MAXJOB 
limit of 2 for user ema  (R: 1, U: 10)
)









# showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

316322             gsalaza3    Running   360    00:33:17  Wed May 13 08:52:50
316357             aschwing    Running   192    00:51:01  Wed May 13 09:10:34
316327             pmcclou1    Running   576     1:01:54  Tue May 12 21:21:27
316467                  dgs    Running   144     1:39:04  Wed May 13 13:58:37
316513               stuart    Running    96     2:16:53  Wed May 13 15:36:26
316257                  ema    Running    96     3:10:09  Wed May 13 13:59:42
316500             aschwing    Running    24     3:13:21  Wed May 13 15:32:54
316273                  ema    Running    96     3:55:01  Wed May 13 14:44:34
316460                  ema    Running    96     3:55:01  Wed May 13 14:44:34
316483                  ema    Running    96     3:55:01  Wed May 13 14:44:34
316363             gsalaza3    Running   360     3:55:09  Wed May 13 12:14:42
316491                  ema    Running    96     3:59:11  Wed May 13 14:48:44
316494                  ema    Running    96     4:03:18  Wed May 13 14:52:51
316367             aschwing    Running   192     4:16:23  Wed May 13 12:35:56
316423               mwhite    Running   168     4:19:45  Wed May 13 12:39:18
316512                  ema    Running    96     4:37:44  Wed May 13 15:27:17
316514                  ema    Running    96     5:03:09  Wed May 13 15:52:42
316516                  ema    Running    96     5:05:27  Wed May 13 15:55:00
316528                  ema    Running    96     5:27:24  Wed May 13 16:16:57
316477             flumpkin    Running   256     5:48:17  Wed May 13 14:07:50
316499              boliver    Running    24     6:49:56  Wed May 13 15:09:29
316496             marichal    Running   240     7:03:25  Wed May 13 15:22:58
316515              boliver    Running    56     7:33:25  Wed May 13 15:52:58

    23 Active Jobs    3648 of 3888 Processors Active (93.83%)
                       263 of  282 Nodes Active      (93.26%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

316458               mwhite       Idle   336     8:00:00  Wed May 13 11:12:53
316487                 jmai       Idle   240     8:00:00  Wed May 13 14:15:53
316488               mwhite       Idle   336     8:00:00  Wed May 13 14:18:39
316511             flumpkin       Idle   256     8:00:00  Wed May 13 15:22:48

4 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

314578                  ema       Idle   640     8:00:00  Tue May  5 22:22:36
316344                  ema       Idle    96     5:30:00  Tue May 12 20:41:40
316378                  ema       Idle    96     5:30:00  Wed May 13 00:39:24
316379                  ema       Idle    96     5:30:00  Wed May 13 00:51:31
316386                  ema       Idle    96     5:30:00  Wed May 13 01:32:58
316388                  ema       Idle    96     5:30:00  Wed May 13 01:40:10
316393                  ema       Idle    96     5:30:00  Wed May 13 02:28:01
316398                  ema       Idle    96     5:30:00  Wed May 13 03:09:06
316402                  ema       Idle    96     5:30:00  Wed May 13 03:44:30
316405                  ema       Idle    96     5:30:00  Wed May 13 04:21:13
316415                  ema       Idle    96     5:30:00  Wed May 13 06:14:25
316428             aschwing       Idle   192     8:00:00  Wed May 13 07:57:56
316445                  ema       Idle    96     5:30:00  Wed May 13 09:26:48
316464             pmcclou1       Idle   336     8:00:00  Wed May 13 11:59:21
316466             gsalaza3       Idle   360     8:00:00  Wed May 13 12:13:50
316468                  ema       Idle    96     5:30:00  Wed May 13 12:28:43
316470             aschwing       Idle   192     8:00:00  Wed May 13 12:35:01
316471             aschwing       Idle   192     8:00:00  Wed May 13 12:35:04
316472             aschwing       Idle   192     8:00:00  Wed May 13 12:35:06
316473                  ema       Idle    96     5:30:00  Wed May 13 12:38:55
316485                  ema       Idle    96     5:30:00  Wed May 13 13:59:37
316489                  ema       Idle    96     5:30:00  Wed May 13 14:23:59
316490                  ema       Idle    96     5:30:00  Wed May 13 14:33:10
316492             breddell       Idle    60     8:00:00  Wed May 13 14:49:22
316493             breddell       Idle    60     8:00:00  Wed May 13 14:49:28
316501             aschwing       Idle    24     4:00:00  Wed May 13 15:12:13
316502             aschwing       Idle    24     4:00:00  Wed May 13 15:12:13
316503             aschwing       Idle    24     4:00:00  Wed May 13 15:12:13
316504             aschwing       Idle    24     4:00:00  Wed May 13 15:12:13
316505             aschwing       Idle    24     4:00:00  Wed May 13 15:12:13
316506             aschwing       Idle    24     4:00:00  Wed May 13 15:12:13
316507             aschwing       Idle    24     4:00:00  Wed May 13 15:12:13
316508             aschwing       Idle    24     8:00:00  Wed May 13 15:12:13
316509             aschwing       Idle    24     4:00:00  Wed May 13 15:12:13
316510             aschwing       Idle    24     4:00:00  Wed May 13 15:12:13
316517             aschwing       Idle    24     4:00:00  Wed May 13 16:09:35
316518             aschwing       Idle    24     4:00:00  Wed May 13 16:09:35
316519             aschwing       Idle    24     4:00:00  Wed May 13 16:09:35
316520             aschwing       Idle    24     4:00:00  Wed May 13 16:09:35
316521             aschwing       Idle    24     4:00:00  Wed May 13 16:09:35
316522             aschwing       Idle    24     4:00:00  Wed May 13 16:09:35
316523             aschwing       Idle    24     4:00:00  Wed May 13 16:09:35
316524             aschwing       Idle    24     4:00:00  Wed May 13 16:09:35
316525             aschwing       Idle    24     8:00:00  Wed May 13 16:09:35
316526             aschwing       Idle    24     4:00:00  Wed May 13 16:09:35
316527             aschwing       Idle    24     4:00:00  Wed May 13 16:09:35

Total Jobs: 73   Active Jobs: 23   Idle Jobs: 4   Blocked Jobs: 46
[root@service0 etc]#





Here are the relevant parts of maui.cfg


QUEUETIMEWEIGHT       1
BACKFILLPOLICY        FIRSTFIT
RESERVATIONPOLICY     CURRENTHIGHEST

NODEALLOCATIONPOLICY  PRIORITY
NODEACCESSPOLICY      SINGLEJOB
ENABLEMULTIREQJOBS    TRUE
JOBNODEMATCHPOLICY    EXACTNODE


FSWEIGHT 0
CREDWEIGHT 1
CLASSWEIGHT 1

USERCFG[DEFAULT] MAXJOB=2,40 MAXPROC=769,10000

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to