Hello,
For the most part our maui config is working well. But occasionally we get
into a situation where larger jobs will not start. I think this mostly an
issue when a user is completing with themselves, in this case "ema". We have 3
processor types on our system so the qstat listing below is only showing the
jobs on the sandy bridge nodes. As you can see job 314578 was submitted first
(its been in the queue for a week) but his other smaller jobs keep starting
before the larger job, even though the start priority of the large job is
double that of the smaller jobs. It doesn't help that there are some other
large jobs (316488 and 316511) that are in a non-blocked state – I'm sure maui
is draining for these jobs. But I would think that at some point ema's large
job would get prioritized over his own smaller jobs. There are no reservations
on any of the san nodes. Any ideas?
Thanks,
Darby
Job ID Username Queue Jobname N:ppn Proc Wall S Elap SP
Features
------ -------- ------- ---------------- ----- ---- ----- - ----- ------
--------------
314578 ema normal F9_case2_refined 40:16 640 08:00 Q -- 21165 san
316460 ema normal m0.60a175r-60_dr 6:16 96 05:30 R 01:45 10185 san
316467 dgs normal m0.175a0.0_rwb02 9:16 144 04:00 R 02:31 10102 san
316477 flumpkin normal DC1_vent 16:16 256 08:00 R 02:21 10040 san
316483 ema normal m0.30a160r22.5_d 6:16 96 05:30 R 01:45 10066 san
316488 mwhite normal v8.0396_h55.943_ 21:16 336 08:00 Q -- 10130 san
316511 flumpkin normal DC1_vent 16:16 256 08:00 Q -- 10067 san
316512 ema normal m0.30a150r-90_dr 6:16 96 05:30 R 01:01 10000 san
316513 stuart normal m0.00a180_ICC=1_ 6:16 96 03:00 R 00:53 10000 san
316528 ema normal m0.90a165r-90_dr 6:16 96 05:30 R 00:12 10000 san
In the above output "SP" is the "StartPriority" of the jobs.
# checkjob 314578
checking job 314578
State: Idle
Creds: user:ema group:eg3 class:normal qos:DEFAULT
WallTime: 00:00:00 of 8:00:00
SubmitTime: Tue May 5 22:22:36
(Time Queued Total: 7:17:56:26 Eligible: 7:17:54:25)
Total Tasks: 640
Req[0] TaskCount: 640 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [san]
IWD: [NONE] Executable: [NONE]
Bypass: 25 StartCount: 0
PartitionMask: [ALL]
Flags: RESTARTABLE
PE: 640.00 StartPriority: 21153
job cannot run in partition DEFAULT. (job 314578 violates active SOFT MAXJOB
limit of 2 for user ema (R: 1, U: 10)
)
# showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
316322 gsalaza3 Running 360 00:33:17 Wed May 13 08:52:50
316357 aschwing Running 192 00:51:01 Wed May 13 09:10:34
316327 pmcclou1 Running 576 1:01:54 Tue May 12 21:21:27
316467 dgs Running 144 1:39:04 Wed May 13 13:58:37
316513 stuart Running 96 2:16:53 Wed May 13 15:36:26
316257 ema Running 96 3:10:09 Wed May 13 13:59:42
316500 aschwing Running 24 3:13:21 Wed May 13 15:32:54
316273 ema Running 96 3:55:01 Wed May 13 14:44:34
316460 ema Running 96 3:55:01 Wed May 13 14:44:34
316483 ema Running 96 3:55:01 Wed May 13 14:44:34
316363 gsalaza3 Running 360 3:55:09 Wed May 13 12:14:42
316491 ema Running 96 3:59:11 Wed May 13 14:48:44
316494 ema Running 96 4:03:18 Wed May 13 14:52:51
316367 aschwing Running 192 4:16:23 Wed May 13 12:35:56
316423 mwhite Running 168 4:19:45 Wed May 13 12:39:18
316512 ema Running 96 4:37:44 Wed May 13 15:27:17
316514 ema Running 96 5:03:09 Wed May 13 15:52:42
316516 ema Running 96 5:05:27 Wed May 13 15:55:00
316528 ema Running 96 5:27:24 Wed May 13 16:16:57
316477 flumpkin Running 256 5:48:17 Wed May 13 14:07:50
316499 boliver Running 24 6:49:56 Wed May 13 15:09:29
316496 marichal Running 240 7:03:25 Wed May 13 15:22:58
316515 boliver Running 56 7:33:25 Wed May 13 15:52:58
23 Active Jobs 3648 of 3888 Processors Active (93.83%)
263 of 282 Nodes Active (93.26%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
316458 mwhite Idle 336 8:00:00 Wed May 13 11:12:53
316487 jmai Idle 240 8:00:00 Wed May 13 14:15:53
316488 mwhite Idle 336 8:00:00 Wed May 13 14:18:39
316511 flumpkin Idle 256 8:00:00 Wed May 13 15:22:48
4 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
314578 ema Idle 640 8:00:00 Tue May 5 22:22:36
316344 ema Idle 96 5:30:00 Tue May 12 20:41:40
316378 ema Idle 96 5:30:00 Wed May 13 00:39:24
316379 ema Idle 96 5:30:00 Wed May 13 00:51:31
316386 ema Idle 96 5:30:00 Wed May 13 01:32:58
316388 ema Idle 96 5:30:00 Wed May 13 01:40:10
316393 ema Idle 96 5:30:00 Wed May 13 02:28:01
316398 ema Idle 96 5:30:00 Wed May 13 03:09:06
316402 ema Idle 96 5:30:00 Wed May 13 03:44:30
316405 ema Idle 96 5:30:00 Wed May 13 04:21:13
316415 ema Idle 96 5:30:00 Wed May 13 06:14:25
316428 aschwing Idle 192 8:00:00 Wed May 13 07:57:56
316445 ema Idle 96 5:30:00 Wed May 13 09:26:48
316464 pmcclou1 Idle 336 8:00:00 Wed May 13 11:59:21
316466 gsalaza3 Idle 360 8:00:00 Wed May 13 12:13:50
316468 ema Idle 96 5:30:00 Wed May 13 12:28:43
316470 aschwing Idle 192 8:00:00 Wed May 13 12:35:01
316471 aschwing Idle 192 8:00:00 Wed May 13 12:35:04
316472 aschwing Idle 192 8:00:00 Wed May 13 12:35:06
316473 ema Idle 96 5:30:00 Wed May 13 12:38:55
316485 ema Idle 96 5:30:00 Wed May 13 13:59:37
316489 ema Idle 96 5:30:00 Wed May 13 14:23:59
316490 ema Idle 96 5:30:00 Wed May 13 14:33:10
316492 breddell Idle 60 8:00:00 Wed May 13 14:49:22
316493 breddell Idle 60 8:00:00 Wed May 13 14:49:28
316501 aschwing Idle 24 4:00:00 Wed May 13 15:12:13
316502 aschwing Idle 24 4:00:00 Wed May 13 15:12:13
316503 aschwing Idle 24 4:00:00 Wed May 13 15:12:13
316504 aschwing Idle 24 4:00:00 Wed May 13 15:12:13
316505 aschwing Idle 24 4:00:00 Wed May 13 15:12:13
316506 aschwing Idle 24 4:00:00 Wed May 13 15:12:13
316507 aschwing Idle 24 4:00:00 Wed May 13 15:12:13
316508 aschwing Idle 24 8:00:00 Wed May 13 15:12:13
316509 aschwing Idle 24 4:00:00 Wed May 13 15:12:13
316510 aschwing Idle 24 4:00:00 Wed May 13 15:12:13
316517 aschwing Idle 24 4:00:00 Wed May 13 16:09:35
316518 aschwing Idle 24 4:00:00 Wed May 13 16:09:35
316519 aschwing Idle 24 4:00:00 Wed May 13 16:09:35
316520 aschwing Idle 24 4:00:00 Wed May 13 16:09:35
316521 aschwing Idle 24 4:00:00 Wed May 13 16:09:35
316522 aschwing Idle 24 4:00:00 Wed May 13 16:09:35
316523 aschwing Idle 24 4:00:00 Wed May 13 16:09:35
316524 aschwing Idle 24 4:00:00 Wed May 13 16:09:35
316525 aschwing Idle 24 8:00:00 Wed May 13 16:09:35
316526 aschwing Idle 24 4:00:00 Wed May 13 16:09:35
316527 aschwing Idle 24 4:00:00 Wed May 13 16:09:35
Total Jobs: 73 Active Jobs: 23 Idle Jobs: 4 Blocked Jobs: 46
[root@service0 etc]#
Here are the relevant parts of maui.cfg
QUEUETIMEWEIGHT 1
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST
NODEALLOCATIONPOLICY PRIORITY
NODEACCESSPOLICY SINGLEJOB
ENABLEMULTIREQJOBS TRUE
JOBNODEMATCHPOLICY EXACTNODE
FSWEIGHT 0
CREDWEIGHT 1
CLASSWEIGHT 1
USERCFG[DEFAULT] MAXJOB=2,40 MAXPROC=769,10000
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers