Garrick Staples wrote:
On Fri, Oct 13, 2006 at 04:52:23PM -0400, Neelesh Arora alleged:
Garrick Staples wrote:
On Thu, Oct 12, 2006 at 06:58:09PM -0400, Neelesh Arora alleged:
- There are several jobs in the queue that are in the Q state. When I do checkjob <jobid>, I get (among other things):
"job can run in partition DEFAULT (63 procs available.  1 procs required)"
but the job remains in Q forever. It is not the case of a resource requirement not being met (as the above message indicates)
That means a reservation is set preventing the jobs from running.

- restarting torque and maui did not help either
Look at the reservations preventing the job from running.

If I do showres, I get the expected reservations for the running jobs. By expected, I mean the number/name of nodes assigned to each job are as reported by qstat/checkjob. There is only one reservation for an idle job: ReservationID Type S Start End Duration N/P StartTime 88655 Job I INFINITY INFINITY INFINITY 5/10 Mon Nov 12 15:52:32
and,
# showres -n|grep 88655
node015 Job 88655 Idle 2 INFINITY INFINITE Mon Nov 12 15:52:32 node014 Job 88655 Idle 2 INFINITY INFINITE Mon Nov 12 15:52:32 node010 Job 88655 Idle 2 INFINITY INFINITE Mon Nov 12 15:52:32 node003 Job 88655 Idle 2 INFINITY INFINITE Mon Nov 12 15:52:32 node002 Job 88655 Idle 2 INFINITY INFINITE Mon Nov 12 15:52:32

So, this probably means that no other job can start on these nodes. That still leaves 60+ nodes that have no reservations on them. Is there something else I am missing here?

You might need to increase RESERVATIONDEPTH, I have mine at 500.


Indeed, increasing RESERVATIONDEPTH fixed the issue. All stuck jobs started running and there are more reservations for Idle jobs now.
Thanks.

Is there a good rule-of-thumb when deciding on the value for this parameter? Or like most things, one has to go through trial and error?

-Neel
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to