Hi Michael,
First I would try to submit some jobs just to the execution queue to
make sure it works. I'm wondering since it says " job cannot be
started - cannot set hostlist" that you have a list of machines in
your server_priv/nodes file that lists "serial" as a feature to a
certain amount of nodes for this queue. Another thing I wonder is what
does the batch script for the job look like? Is the user using -l
host=<name of node> in it? I'm not for certain what the message is
supposed to mean but it sounds like it isn't able to find any nodes to
allocate the job to. Hope this helps,
-Steve
On Dec 16, 2008, at 8:12 AM, Michael Galloway wrote:
On Wed, Dec 10, 2008 at 03:57:02PM -0500, Steve Young wrote:
I've used a routing queue to solve this problem. The queue that the
user is
running on can only utilize 32 cpu's. The thousands of jobs are 1
cpu each.
So I have this for a routing queue:
create queue physics
set queue physics queue_type = Route
set queue physics acl_group_enable = True
set queue physics route_destinations += herc
set queue physics enabled = True
set queue physics started = True
So jobs that go into here are moved to the herc execution queue.
This queue
has the following setting:
set queue herc max_queuable = 36
This way only 36 jobs at time can be queue'd from the routing
queue. This
way maui doesn't even have to worry about considering each of all the
thousand's of jobs each iteration. It only has to worry about
scheduling
the jobs for the resources it has to run on.
I also use MAXIJOB in maui:
CLASSCFG[herc] QLIST=md QDEF=md MAXIJOB=4
This way even if a user had lots of jobs in the queue only their
top 4 idle
jobs will get considered for scheduling. This way others will be
able to
get their jobs to run without having to wait for maui to process
thousands
of jobs that can't run yet anyhow.
ok, i'm in this boat as well (lots of serial jobs). i attempted to
implement this
thusly:
create queue sroute
set queue sroute queue_type = Route
set queue sroute acl_group_enable = True
set queue sroute route_destinations = serial
set queue sroute route_destinations += serial
set queue sroute enabled = True
set queue sroute started = True
create queue serial
set queue serial queue_type = Execution
set queue serial max_queuable = 36
set queue serial resources_max.walltime = 168:00:00
set queue serial resources_default.neednodes = serial
set queue serial resources_default.nodes = 12
set queue serial resources_default.walltime = 168:00:00
set queue serial enabled = True
set queue serial started = True
jobs are dropping from the routing queue into the serial
queue but not running:
[r...@bioinfo server_logs]# qstat -q
server:
Queue Memory CPU Time Walltime Node Run Que Lm State
---------------- ------ -------- -------- ---- --- --- -- -----
annotate -- -- 18:00:00 -- 0 0 -- E R
sroute -- -- -- -- 0 8163 -- E R
md -- -- 168:00:0 -- 18 0 -- E R
serial -- -- 168:00:0 -- 0 36 -- E R
----- -----
18 8199
[r...@bioinfo server_logs]# showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING
STARTTIME
51493 pkc Running 1 2:13:33:53 Thu Dec 11
21:41:56
68422 pkc Running 1 6:06:22:44 Mon Dec 15
14:30:47
68423 pkc Running 1 6:06:22:44 Mon Dec 15
14:30:47
18 Active Jobs 18 of 260 Processors Active (6.92%)
5 of 66 Nodes Active (7.58%)
IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
0 Idle Jobs
BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT
QUEUETIME
68424 bci Idle 1 7:00:00:00 Tue Dec 16
03:38:51
68425 bci Idle 1 7:00:00:00 Tue Dec 16
03:38:51
68426 bci Idle 1 7:00:00:00 Tue Dec 16
03:38:51
68427 bci Idle 1 7:00:00:00 Tue Dec 16
03:41:21
68428 bci Idle 1 7:00:00:00 Tue Dec 16
03:45:53
68429 bci Idle 1 7:00:00:00 Tue Dec 16
03:48:23
68430 bci Idle 1 7:00:00:00 Tue Dec 16
03:51:53
68431 bci Idle 1 7:00:00:00 Tue Dec 16
03:52:53
68432 bci Idle 1 7:00:00:00 Tue Dec 16
03:58:56
68433 bci Idle 1 7:00:00:00 Tue Dec 16
03:59:26
68434 bci Idle 1 7:00:00:00 Tue Dec 16
03:59:26
68435 bci Idle 1 7:00:00:00 Tue Dec 16
04:00:26
68436 bci Idle 1 7:00:00:00 Tue Dec 16
04:00:56
68437 bci Idle 1 7:00:00:00 Tue Dec 16
04:05:28
68438 bci Idle 1 7:00:00:00 Tue Dec 16
04:07:00
68439 bci Idle 1 7:00:00:00 Tue Dec 16
04:08:30
68440 bci Idle 1 7:00:00:00 Tue Dec 16
04:10:34
68441 bci Idle 1 7:00:00:00 Tue Dec 16
04:16:36
68442 bci Idle 1 7:00:00:00 Tue Dec 16
04:18:36
68443 bci Idle 1 7:00:00:00 Tue Dec 16
04:20:39
68444 bci Idle 1 7:00:00:00 Tue Dec 16
04:20:39
68445 bci Idle 1 7:00:00:00 Tue Dec 16
04:26:19
68446 bci Idle 1 7:00:00:00 Tue Dec 16
04:28:19
68447 bci Idle 1 7:00:00:00 Tue Dec 16
04:28:19
68448 bci Idle 1 7:00:00:00 Tue Dec 16
04:29:26
68449 bci Idle 1 7:00:00:00 Tue Dec 16
04:34:02
68450 bci Idle 1 7:00:00:00 Tue Dec 16
04:35:04
68451 bci Idle 1 7:00:00:00 Tue Dec 16
04:38:40
68452 bci Idle 1 7:00:00:00 Tue Dec 16
04:40:42
68453 bci Idle 1 7:00:00:00 Tue Dec 16
04:41:42
68454 bci Idle 1 7:00:00:00 Tue Dec 16
04:50:05
68455 bci Idle 1 7:00:00:00 Tue Dec 16
05:02:38
68456 bci Idle 1 7:00:00:00 Tue Dec 16
05:03:08
68457 bci Idle 1 7:00:00:00 Tue Dec 16
05:04:42
68458 bci Idle 1 7:00:00:00 Tue Dec 16
05:11:24
68459 bci Idle 1 7:00:00:00 Tue Dec 16
05:25:36
Total Jobs: 54 Active Jobs: 18 Idle Jobs: 0 Blocked Jobs: 36
and checkjob on one of the blocked jobs:
[r...@bioinfo server_logs]# checkjob 68440
checking job 68440
State: Idle
Creds: user:bci group:bci class:serial qos:DEFAULT
WallTime: 00:00:00 of 7:00:00:00
SubmitTime: Tue Dec 16 04:10:34
(Time Queued Total: 4:00:04 Eligible: 00:00:00)
Total Tasks: 1
Req[0] TaskCount: 1 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [serial]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 4
PartitionMask: [ALL]
Flags: HOSTLIST RESTARTABLE
HostList:
[c0-70:1]
Holds: Defer
Messages: job cannot be started - cannot set hostlist
PE: 1.00 StartPriority: 41
cannot select job 68440 for partition DEFAULT (job hold active)
i clearly made an error somewhere, just cannot see it. any help
greatly apprecited.
-- michael
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers