> I've seen something like this from time to time, but not recently.
> It was a problem when a few jobs blocked all queues. In my
> observation the reason was:
> 1. the job 1 started on WNs wn-1
> 2. has a problem on stage-in
> 3. the job 1 has been ended up with error (in stage-in) on WN wn-1
> 4. the job 1 return to the queue (in the "W" state), as stage-in error
>    isn't fatal from torque point of view
> 5. at this point job 1 have WN wn-1 as assigned resources
> 6. next job 2 submitted to the same WN wn-1 has been started and run
>    w/o error for a (relatively) long time
> 
> maui will try to run job 1 in the next schedule cycle, doesn't
> change already assigned resources, but job 1 can not run on
> already busy WN wn-1 occupied by job 2. As I seen, at this point
> (or in case there are a few such jobs) maui get stuck and
> terminate schedule cycle.


That I have seen too, but I doubt this is the case now. Namely the level is 
maintained and jobs are scheduled and run. Just not beyond a certain threshhold 
and the error is reservation issue. If I stop maui and start pbs_sched, then it 
will happily schedule jobs. Moving back to maui it'll let the level drop back 
to its comfort zone and then start scheduling jobs again. The scenario you 
describe usually leads to full blocking and you see just job count dropping, 
not a flow of jobs. Also, we use nodes with 27-36 job slots so a single job 
can't really block a node. 

Mario Kadastik, PhD
Researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why 
we do it" 
     -- Richard P. Feynman

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to