Garrick Staples wrote:
On Thu, Oct 12, 2006 at 06:58:09PM -0400, Neelesh Arora alleged:
- There are several jobs in the queue that are in the Q state. When I do
checkjob <jobid>, I get (among other things):
"job can run in partition DEFAULT (63 procs available. 1 procs required)"
but the job remains in Q forever. It is not the case of a resource
requirement not being met (as the above message indicates)
That means a reservation is set preventing the jobs from running.
- restarting torque and maui did not help either
Look at the reservations preventing the job from running.
If I do showres, I get the expected reservations for the running jobs.
By expected, I mean the number/name of nodes assigned to each job are as
reported by qstat/checkjob. There is only one reservation for an idle job:
ReservationID Type S Start End Duration N/P
StartTime
88655 Job I INFINITY INFINITY INFINITY 5/10
Mon Nov 12 15:52:32
and,
# showres -n|grep 88655
node015 Job 88655 Idle 2 INFINITY
INFINITE Mon Nov 12 15:52:32
node014 Job 88655 Idle 2 INFINITY
INFINITE Mon Nov 12 15:52:32
node010 Job 88655 Idle 2 INFINITY
INFINITE Mon Nov 12 15:52:32
node003 Job 88655 Idle 2 INFINITY
INFINITE Mon Nov 12 15:52:32
node002 Job 88655 Idle 2 INFINITY
INFINITE Mon Nov 12 15:52:32
So, this probably means that no other job can start on these nodes. That
still leaves 60+ nodes that have no reservations on them. Is there
something else I am missing here?
>> An update:
>> I notice that when these jobs are stuck, one way to get them started is
>> to set a walltime (using qalter) less than the default walltime. We set
>> a default_walltime of 9999:00:00 at the server level and require the
>> users to specify the needed cpu-time.
>>
>> This was set a long time ago and has not been causing any issues.
But it
>> seems now that if you have set this default and then a user submits a
>> job with an explicit -l walltime=<time> specification, then that job
>> runs while older jobs with default walltime wait.
>>
>> Can some one please shed some light on this - I am out of clues here?
>
> Walltime is really important to maui. Smaller walltimes allow jobs to
> run within backfill windows. If everyone has infinite walltimes, you
> basicly reduce yourself to a simple FIFO scheduler and might as well
> just use pbs_sched.
Well, we set default_walltime so high because maui does not care about
the specified cpu-time (and we want to do job allocation based on
cpu-time). Maui would take wall-time = cpu-time and kill the job if
wall-time was exceeded, even if cpu-time was not. Refer to our previous
discussion on this maui bug at:
http://www.clusterresources.com/pipermail/torqueusers/2006-June/003729.html
-Neel
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers