On Fri, Oct 13, 2006 at 04:52:23PM -0400, Neelesh Arora alleged: > Garrick Staples wrote: > >On Thu, Oct 12, 2006 at 06:58:09PM -0400, Neelesh Arora alleged: > >>- There are several jobs in the queue that are in the Q state. When I do > >>checkjob <jobid>, I get (among other things): > >>"job can run in partition DEFAULT (63 procs available. 1 procs required)" > >>but the job remains in Q forever. It is not the case of a resource > >>requirement not being met (as the above message indicates) > > > >That means a reservation is set preventing the jobs from running. > > > >>- restarting torque and maui did not help either > > > >Look at the reservations preventing the job from running. > > > > If I do showres, I get the expected reservations for the running jobs. > By expected, I mean the number/name of nodes assigned to each job are as > reported by qstat/checkjob. There is only one reservation for an idle job: > ReservationID Type S Start End Duration N/P > StartTime > 88655 Job I INFINITY INFINITY INFINITY 5/10 > Mon Nov 12 15:52:32 > and, > # showres -n|grep 88655 > node015 Job 88655 Idle 2 INFINITY > INFINITE Mon Nov 12 15:52:32 > node014 Job 88655 Idle 2 INFINITY > INFINITE Mon Nov 12 15:52:32 > node010 Job 88655 Idle 2 INFINITY > INFINITE Mon Nov 12 15:52:32 > node003 Job 88655 Idle 2 INFINITY > INFINITE Mon Nov 12 15:52:32 > node002 Job 88655 Idle 2 INFINITY > INFINITE Mon Nov 12 15:52:32 > > So, this probably means that no other job can start on these nodes. That > still leaves 60+ nodes that have no reservations on them. Is there > something else I am missing here?
You might need to increase RESERVATIONDEPTH, I have mine at 500. > >> An update: > >> I notice that when these jobs are stuck, one way to get them started is > >> to set a walltime (using qalter) less than the default walltime. We set > >> a default_walltime of 9999:00:00 at the server level and require the > >> users to specify the needed cpu-time. > >> > >> This was set a long time ago and has not been causing any issues. > But it > >> seems now that if you have set this default and then a user submits a > >> job with an explicit -l walltime=<time> specification, then that job > >> runs while older jobs with default walltime wait. > >> > >> Can some one please shed some light on this - I am out of clues here? > > > > Walltime is really important to maui. Smaller walltimes allow jobs to > > run within backfill windows. If everyone has infinite walltimes, you > > basicly reduce yourself to a simple FIFO scheduler and might as well > > just use pbs_sched. > > Well, we set default_walltime so high because maui does not care about > the specified cpu-time (and we want to do job allocation based on > cpu-time). Maui would take wall-time = cpu-time and kill the job if > wall-time was exceeded, even if cpu-time was not. Refer to our previous > discussion on this maui bug at: > http://www.clusterresources.com/pipermail/torqueusers/2006-June/003729.html It is a valid work-around, which has obviously served you well for a few months, but the *optimum* case is to have correct walltimes so that you can take advantage of backfill. Maybe give pbs_sched a try. Not that I really recommend it, but it won't try to be smart about things. _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
