Hi Michael,
First I would try to submit some jobs just to the execution queue to make sure it works. I'm wondering since it says " job cannot be started - cannot set hostlist" that you have a list of machines in your server_priv/nodes file that lists "serial" as a feature to a certain amount of nodes for this queue. Another thing I wonder is what does the batch script for the job look like? Is the user using -l host=<name of node> in it? I'm not for certain what the message is supposed to mean but it sounds like it isn't able to find any nodes to allocate the job to. Hope this helps,

-Steve

On Dec 16, 2008, at 8:12 AM, Michael Galloway wrote:


On Wed, Dec 10, 2008 at 03:57:02PM -0500, Steve Young wrote:
I've used a routing queue to solve this problem. The queue that the user is running on can only utilize 32 cpu's. The thousands of jobs are 1 cpu each.
So I have this for a routing queue:

create queue physics
set queue physics queue_type = Route
set queue physics acl_group_enable = True
set queue physics route_destinations += herc
set queue physics enabled = True
set queue physics started = True

So jobs that go into here are moved to the herc execution queue. This queue
has the following setting:

set queue herc max_queuable = 36

This way only 36 jobs at time can be queue'd from the routing queue. This
way maui doesn't even have to worry about considering each of all the
thousand's of jobs each iteration. It only has to worry about scheduling
the jobs for the resources it has to run on.

I also use MAXIJOB in maui:

CLASSCFG[herc]          QLIST=md QDEF=md MAXIJOB=4

This way even if a user had lots of jobs in the queue only their top 4 idle jobs will get considered for scheduling. This way others will be able to get their jobs to run without having to wait for maui to process thousands
of jobs that can't run yet anyhow.


ok, i'm in this boat as well (lots of serial jobs). i attempted to implement this
thusly:

create queue sroute
set queue sroute queue_type = Route
set queue sroute acl_group_enable = True
set queue sroute route_destinations = serial
set queue sroute route_destinations += serial
set queue sroute enabled = True
set queue sroute started = True

create queue serial
set queue serial queue_type = Execution
set queue serial max_queuable = 36
set queue serial resources_max.walltime = 168:00:00
set queue serial resources_default.neednodes = serial
set queue serial resources_default.nodes = 12
set queue serial resources_default.walltime = 168:00:00
set queue serial enabled = True
set queue serial started = True

jobs are dropping from the routing queue into the serial
queue but not running:

[r...@bioinfo server_logs]# qstat -q

server:

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
annotate           --      --    18:00:00   --    0   0 --   E R
sroute             --      --       --      --    0 8163 --   E R
md                 --      --    168:00:0   --   18   0 --   E R
serial             --      --    168:00:0   --    0  36 --   E R
                                              ----- -----
                                                 18  8199


[r...@bioinfo server_logs]# showq
ACTIVE JOBS--------------------
JOBNAME USERNAME STATE PROC REMAINING STARTTIME

51493 pkc Running 1 2:13:33:53 Thu Dec 11 21:41:56 68422 pkc Running 1 6:06:22:44 Mon Dec 15 14:30:47 68423 pkc Running 1 6:06:22:44 Mon Dec 15 14:30:47

   18 Active Jobs      18 of  260 Processors Active (6.92%)
                        5 of   66 Nodes Active      (7.58%)

IDLE JOBS----------------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME


0 Idle Jobs

BLOCKED JOBS----------------
JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME

68424 bci Idle 1 7:00:00:00 Tue Dec 16 03:38:51 68425 bci Idle 1 7:00:00:00 Tue Dec 16 03:38:51 68426 bci Idle 1 7:00:00:00 Tue Dec 16 03:38:51 68427 bci Idle 1 7:00:00:00 Tue Dec 16 03:41:21 68428 bci Idle 1 7:00:00:00 Tue Dec 16 03:45:53 68429 bci Idle 1 7:00:00:00 Tue Dec 16 03:48:23 68430 bci Idle 1 7:00:00:00 Tue Dec 16 03:51:53 68431 bci Idle 1 7:00:00:00 Tue Dec 16 03:52:53 68432 bci Idle 1 7:00:00:00 Tue Dec 16 03:58:56 68433 bci Idle 1 7:00:00:00 Tue Dec 16 03:59:26 68434 bci Idle 1 7:00:00:00 Tue Dec 16 03:59:26 68435 bci Idle 1 7:00:00:00 Tue Dec 16 04:00:26 68436 bci Idle 1 7:00:00:00 Tue Dec 16 04:00:56 68437 bci Idle 1 7:00:00:00 Tue Dec 16 04:05:28 68438 bci Idle 1 7:00:00:00 Tue Dec 16 04:07:00 68439 bci Idle 1 7:00:00:00 Tue Dec 16 04:08:30 68440 bci Idle 1 7:00:00:00 Tue Dec 16 04:10:34 68441 bci Idle 1 7:00:00:00 Tue Dec 16 04:16:36 68442 bci Idle 1 7:00:00:00 Tue Dec 16 04:18:36 68443 bci Idle 1 7:00:00:00 Tue Dec 16 04:20:39 68444 bci Idle 1 7:00:00:00 Tue Dec 16 04:20:39 68445 bci Idle 1 7:00:00:00 Tue Dec 16 04:26:19 68446 bci Idle 1 7:00:00:00 Tue Dec 16 04:28:19 68447 bci Idle 1 7:00:00:00 Tue Dec 16 04:28:19 68448 bci Idle 1 7:00:00:00 Tue Dec 16 04:29:26 68449 bci Idle 1 7:00:00:00 Tue Dec 16 04:34:02 68450 bci Idle 1 7:00:00:00 Tue Dec 16 04:35:04 68451 bci Idle 1 7:00:00:00 Tue Dec 16 04:38:40 68452 bci Idle 1 7:00:00:00 Tue Dec 16 04:40:42 68453 bci Idle 1 7:00:00:00 Tue Dec 16 04:41:42 68454 bci Idle 1 7:00:00:00 Tue Dec 16 04:50:05 68455 bci Idle 1 7:00:00:00 Tue Dec 16 05:02:38 68456 bci Idle 1 7:00:00:00 Tue Dec 16 05:03:08 68457 bci Idle 1 7:00:00:00 Tue Dec 16 05:04:42 68458 bci Idle 1 7:00:00:00 Tue Dec 16 05:11:24 68459 bci Idle 1 7:00:00:00 Tue Dec 16 05:25:36

Total Jobs: 54   Active Jobs: 18   Idle Jobs: 0   Blocked Jobs: 36

and checkjob on one of the blocked jobs:

[r...@bioinfo server_logs]# checkjob 68440
checking job 68440

State: Idle
Creds:  user:bci  group:bci  class:serial  qos:DEFAULT
WallTime: 00:00:00 of 7:00:00:00
SubmitTime: Tue Dec 16 04:10:34
 (Time Queued  Total: 4:00:04  Eligible: 00:00:00)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [serial]


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 4
PartitionMask: [ALL]
Flags:       HOSTLIST RESTARTABLE
HostList:
 [c0-70:1]
Holds:    Defer
Messages:  job cannot be started - cannot set hostlist
PE:  1.00  StartPriority:  41
cannot select job 68440 for partition DEFAULT (job hold active)

i clearly made an error somewhere, just cannot see it. any help
greatly apprecited.

-- michael


_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to