Ok, I just made a big newbie mistake, pardon my repost to correct it.
I finally got qmgr to list me the settings for the queues. The setting
that Lennart suggested was not set. So I added it and restarted the
server. It still reports a a policy violation of 128 > 70.
This is the current setting for the queue low:
Queue low
queue_type = Execution
Priority = 10
total_jobs = 2
state_count = Transit:0 Queued:2 Held:0 Waiting:0 Running:0
Exiting:0
max_running = 10
resources_max.ncpus = 70
resources_max.nodect = 140
resources_max.walltime = 96:00:00
mtime = Wed Jul 18 11:33:12 2007
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
enabled = True
started = True
This is the information from PBS about one of the jobs waiting because
of the policy violation:
Resource_List.ncpus = 1
Resource_List.nodect = 32
Resource_List.nodes = 32:ppn=4
What is the difference between .ncpus and .nodect? And which one does
the maui scheduler look at?
Thanks again to anyone who can help,
Kelli
-------------------------------------------------
Dr. K. Hendrickson
MIT Research Engineer, Vortical Flow Research Lab
[EMAIL PROTECTED] | 617-258-7675 | 5-326B
Lennart Karlsson wrote:
I would try to add a resources_max.nodect declaration in qmgr for
each PBS queue, as for example:
set queue short resources_max.nodect = 140
This sets the upper limit on how many processors/cores you may
use in a single job.
Regarding the count of 70, I do not know why you get it. Perhaps
it is due to a double-counting bug of Maui (see bug number 99 in
http://clusterresources.com/bugzilla/), but I am not sure if it
appears already in p13 of Maui-3.2.6. The bug appears sometime
after p11 and at latest in snapshots of p14. I run a snap version
of p16 (or p11) on our Maui systems and both are free from that
bug. Perhaps you should upgrade to some snap version of p16 or
later (I run maui-3.2.6p16-snap.1157560841 here)?
Cheers,
-- Lennart Karlsson <[EMAIL PROTECTED]>
National Supercomputer Centre in Linkoping, Sweden
http://www.nsc.liu.se
+46 706 49 55 35
+46 13 28 26 24
Kelli Hendrickson wrote:
I've browsed the web and haven't been able to find a solution to this
problem and the technical support for my cluster has somewhat left me in
the wind on this issue.
We've got Maui version 3.2.6p13, configured out of the box by my cluster
vendor running on SuSE 10.1 with pbs_server 2.1.3.
The system has 35 dual cpu, dual core compute nodes on it. When the
system is open and we submit a job using qsub -l nodes=32:ppn=4 (i.e.
looking for 128 procs), the job will start immediately and run.
However, if this job ever has to wait, the Maui scheduler puts it in
Batch hold. A checkjob reports that there is a Policy Violation, the
number of procs is too high (128 > 70) - output below.
The job will run if you use the "runjob" command but not if you do a
"releasehold" command.
The diagnose -n command reports that there are 35 nodes, and 140 procs.
The nodeallocationpolicy is set to minresource.
qmgr also reports the correct number of nodes/procs (output below)
So my question is this... where is maui getting this 70 from?
Obviously, the procs are available because the job runs with the use of
the runjob command. Is there another setting that was missed to make
this all work correctly? My vendor essentially suggested I change to
using the pbs scheduler instead but that seems like giving up on
something which seems like a simple matter.
Any help would be greatly appreciated.
Thanks,
Kelli
--------------------------------- checkjob output.
checking job 1626
State: Idle
Creds: user:kelli group:users class:low qos:DEFAULT
WallTime: 00:00:00 of 4:00:00:00
SubmitTime: Tue Jul 17 11:53:55
(Time Queued Total: 00:13:16 Eligible: 00:00:01)
Total Tasks: 128
Req[0] TaskCount: 128 Partition: ALL
Network: [NONE] Memory >= 0 Disk >= 0 Swap >= 0
Opsys: [NONE] Arch: [NONE] Features: [NONE]
IWD: [NONE] Executable: [NONE]
Bypass: 0 StartCount: 0
PartitionMask: [ALL]
Holds: Batch (hold reason: PolicyViolation)
Messages: procs too high (128 > 70)
PE: 128.00 StartPriority: 1
cannot select job 1626 for partition DEFAULT (job hold active)
---------------------------------------------------- qmgr output
server_state = Active
scheduling = True
total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0
Exiting:0
default_queue = short
log_events = 511
mail_from = adm
query_other_jobs = True
resources_available.ncpus = 140
resources_available.nodect = 35
resources_default.ncpus = 1
resources_default.nodect = 1
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
scheduler_iteration = 120
node_check_rate = 150
tcp_timeout = 6
pbs_version = 2.1.3
--------------------------------------------------- end
--
-------------------------------------------------
Dr. K. Hendrickson
MIT Research Engineer, Vortical Flow Research Lab
[EMAIL PROTECTED] | 617-258-7675 | 5-326B
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers