Garrick Staples wrote:
On Fri, Oct 13, 2006 at 04:52:23PM -0400, Neelesh Arora alleged:
Garrick Staples wrote:
On Thu, Oct 12, 2006 at 06:58:09PM -0400, Neelesh Arora alleged:
- There are several jobs in the queue that are in the Q state. When I do
checkjob <jobid>, I get (among other things):
"job can run in partition DEFAULT (63 procs available. 1 procs required)"
but the job remains in Q forever. It is not the case of a resource
requirement not being met (as the above message indicates)
That means a reservation is set preventing the jobs from running.
- restarting torque and maui did not help either
Look at the reservations preventing the job from running.
If I do showres, I get the expected reservations for the running jobs.
By expected, I mean the number/name of nodes assigned to each job are as
reported by qstat/checkjob. There is only one reservation for an idle job:
ReservationID Type S Start End Duration N/P
StartTime
88655 Job I INFINITY INFINITY INFINITY 5/10
Mon Nov 12 15:52:32
and,
# showres -n|grep 88655
node015 Job 88655 Idle 2 INFINITY
INFINITE Mon Nov 12 15:52:32
node014 Job 88655 Idle 2 INFINITY
INFINITE Mon Nov 12 15:52:32
node010 Job 88655 Idle 2 INFINITY
INFINITE Mon Nov 12 15:52:32
node003 Job 88655 Idle 2 INFINITY
INFINITE Mon Nov 12 15:52:32
node002 Job 88655 Idle 2 INFINITY
INFINITE Mon Nov 12 15:52:32
So, this probably means that no other job can start on these nodes. That
still leaves 60+ nodes that have no reservations on them. Is there
something else I am missing here?
You might need to increase RESERVATIONDEPTH, I have mine at 500.
Indeed, increasing RESERVATIONDEPTH fixed the issue. All stuck jobs
started running and there are more reservations for Idle jobs now.
Thanks.
Is there a good rule-of-thumb when deciding on the value for this
parameter? Or like most things, one has to go through trial and error?
-Neel
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers