Garrick Staples wrote:
On Thu, Oct 12, 2006 at 06:58:09PM -0400, Neelesh Arora alleged:
- There are several jobs in the queue that are in the Q state. When I do checkjob <jobid>, I get (among other things):
"job can run in partition DEFAULT (63 procs available.  1 procs required)"
but the job remains in Q forever. It is not the case of a resource requirement not being met (as the above message indicates)

That means a reservation is set preventing the jobs from running.

- restarting torque and maui did not help either

Look at the reservations preventing the job from running.

If I do showres, I get the expected reservations for the running jobs. By expected, I mean the number/name of nodes assigned to each job are as reported by qstat/checkjob. There is only one reservation for an idle job: ReservationID Type S Start End Duration N/P StartTime 88655 Job I INFINITY INFINITY INFINITY 5/10 Mon Nov 12 15:52:32
and,
# showres -n|grep 88655
node015 Job 88655 Idle 2 INFINITY INFINITE Mon Nov 12 15:52:32 node014 Job 88655 Idle 2 INFINITY INFINITE Mon Nov 12 15:52:32 node010 Job 88655 Idle 2 INFINITY INFINITE Mon Nov 12 15:52:32 node003 Job 88655 Idle 2 INFINITY INFINITE Mon Nov 12 15:52:32 node002 Job 88655 Idle 2 INFINITY INFINITE Mon Nov 12 15:52:32

So, this probably means that no other job can start on these nodes. That still leaves 60+ nodes that have no reservations on them. Is there something else I am missing here?

>> An update:
>> I notice that when these jobs are stuck, one way to get them started is
>> to set a walltime (using qalter) less than the default walltime. We set
>> a default_walltime of 9999:00:00 at the server level and require the
>> users to specify the needed cpu-time.
>>
>> This was set a long time ago and has not been causing any issues. But it
>> seems now that if you have set this default and then a user submits a
>> job with an explicit -l walltime=<time> specification, then that job
>> runs while older jobs with default walltime wait.
>>
>> Can some one please shed some light on this - I am out of clues here?
>
> Walltime is really important to maui.  Smaller walltimes allow jobs to
> run within backfill windows.  If everyone has infinite walltimes, you
> basicly reduce yourself to a simple FIFO scheduler and might as well
> just use pbs_sched.

Well, we set default_walltime so high because maui does not care about the specified cpu-time (and we want to do job allocation based on cpu-time). Maui would take wall-time = cpu-time and kill the job if wall-time was exceeded, even if cpu-time was not. Refer to our previous discussion on this maui bug at: http://www.clusterresources.com/pipermail/torqueusers/2006-June/003729.html

-Neel
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to