On Wed, Dec 10, 2008 at 03:57:02PM -0500, Steve Young wrote:
> I've used a routing queue to solve this problem. The queue that the user is
> running on can only utilize 32 cpu's. The thousands of jobs are 1 cpu each.
> So I have this for a routing queue:
>
> create queue physics
> set queue physics queue_type = Route
> set queue physics acl_group_enable = True
> set queue physics route_destinations += herc
> set queue physics enabled = True
> set queue physics started = True
>
> So jobs that go into here are moved to the herc execution queue. This queue
> has the following setting:
>
> set queue herc max_queuable = 36
>
> This way only 36 jobs at time can be queue'd from the routing queue. This
> way maui doesn't even have to worry about considering each of all the
> thousand's of jobs each iteration. It only has to worry about scheduling
> the jobs for the resources it has to run on.
>
> I also use MAXIJOB in maui:
>
> CLASSCFG[herc] QLIST=md QDEF=md MAXIJOB=4
>
> This way even if a user had lots of jobs in the queue only their top 4 idle
> jobs will get considered for scheduling. This way others will be able to
> get their jobs to run without having to wait for maui to process thousands
> of jobs that can't run yet anyhow.
>
ok, i'm in this boat as well (lots of serial jobs). i attempted to implement
this
thusly:
create queue sroute
set queue sroute queue_type = Route
set queue sroute acl_group_enable = True
set queue sroute route_destinations = serial
set queue sroute route_destinations += serial
set queue sroute enabled = True
set queue sroute started = True
create queue serial
set queue serial queue_type = Execution
set queue serial max_queuable = 36
set queue serial resources_max.walltime = 168:00:00
set queue serial resources_default.neednodes = serial
set queue serial resources_default.nodes = 12
set queue serial resources_default.walltime = 168:00:00
set queue serial enabled = True
set queue serial started = True
jobs are dropping from the routing queue into the serial
queue but not running:
[r...@bioinfo server_logs]# qstat -q
server:
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
annotate -- -- 18:00:00 -- 0 0 -- E R
sroute -- -- -- -- 0 8163 -- E R
md -- -- 168:00:0 -- 18 0 -- E R
serial -- -- 168:00:0 -- 0 36 -- E R
----- -----
18 8199
[r...@bioinfo server_logs]# showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME
51493 pkc Running 1 2:13:33:53 Thu Dec 11 21:41:56
68422 pkc Running 1 6:06:22:44 Mon Dec 15 14:30:47
68423 pkc Running 1 6:06:22:44 Mon Dec 15 14:30:47
18 Active Jobs 18 of 260 Processors Active (6.92%)
5 of 66 Nodes Active (7.58%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME
68424 bci Idle 1 7:00:00:00 Tue Dec 16 03:38:51
68425 bci Idle 1 7:00:00:00 Tue Dec 16 03:38:51
68426 bci Idle 1 7:00:00:00 Tue Dec 16 03:38:51
68427 bci Idle 1 7:00:00:00 Tue Dec 16 03:41:21
68428 bci Idle 1 7:00:00:00 Tue Dec 16 03:45:53
68429 bci Idle 1 7:00:00:00 Tue Dec 16 03:48:23
68430 bci Idle 1 7:00:00:00 Tue Dec 16 03:51:53
68431 bci Idle 1 7:00:00:00 Tue Dec 16 03:52:53
68432 bci Idle 1 7:00:00:00 Tue Dec 16 03:58:56
68433 bci Idle 1 7:00:00:00 Tue Dec 16 03:59:26
68434 bci Idle 1 7:00:00:00 Tue Dec 16 03:59:26
68435 bci Idle 1 7:00:00:00 Tue Dec 16 04:00:26
68436 bci Idle 1 7:00:00:00 Tue Dec 16 04:00:56
68437 bci Idle 1 7:00:00:00 Tue Dec 16 04:05:28
68438 bci Idle 1 7:00:00:00 Tue Dec 16 04:07:00
68439 bci Idle 1 7:00:00:00 Tue Dec 16 04:08:30
68440 bci Idle 1 7:00:00:00 Tue Dec 16 04:10:34
68441 bci Idle 1 7:00:00:00 Tue Dec 16 04:16:36
68442 bci Idle 1 7:00:00:00 Tue Dec 16 04:18:36
68443 bci Idle 1 7:00:00:00 Tue Dec 16 04:20:39
68444 bci Idle 1 7:00:00:00 Tue Dec 16 04:20:39
68445 bci Idle 1 7:00:00:00 Tue Dec 16 04:26:19
68446 bci Idle 1 7:00:00:00 Tue Dec 16 04:28:19
68447 bci Idle 1 7:00:00:00 Tue Dec 16 04:28:19
68448 bci Idle 1 7:00:00:00 Tue Dec 16 04:29:26
68449 bci Idle 1 7:00:00:00 Tue Dec 16 04:34:02
68450 bci Idle 1 7:00:00:00 Tue Dec 16 04:35:04
68451 bci Idle 1 7:00:00:00 Tue Dec 16 04:38:40
68452 bci Idle 1 7:00:00:00 Tue Dec 16 04:40:42
68453 bci Idle 1 7:00:00:00 Tue Dec 16 04:41:42
68454 bci Idle 1 7:00:00:00 Tue Dec 16 04:50:05
68455 bci Idle 1 7:00:00:00 Tue Dec 16 05:02:38
68456 bci Idle 1 7:00:00:00 Tue Dec 16 05:03:08
68457 bci Idle 1 7:00:00:00 Tue Dec 16 05:04:42
68458 bci Idle 1 7:00:00:00 Tue Dec 16 05:11:24
68459 bci Idle 1 7:00:00:00 Tue Dec 16 05:25:36
Total Jobs: 54 Active Jobs: 18 Idle Jobs: 0 Blocked Jobs: 36
and checkjob on one of the blocked jobs:
[r...@bioinfo server_logs]# checkjob 68440
checking job 68440
State: Idle
Creds: user:bci group:bci class:serial qos:DEFAULT
WallTime: 00:00:00 of 7:00:00:00
SubmitTime: Tue Dec 16 04:10:34
(Time Queued Total: 4:00:04 Eligible: 00:00:00)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [serial]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 4
PartitionMask: [ALL]
Flags: HOSTLIST RESTARTABLE
HostList:
[c0-70:1]
Holds: Defer
Messages: job cannot be started - cannot set hostlist
PE: 1.00 StartPriority: 41
cannot select job 68440 for partition DEFAULT (job hold active)
i clearly made an error somewhere, just cannot see it. any help
greatly apprecited.
-- michael
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers