Re: [gridengine users] How to detect "blackhole" host in gridengine?

Edward Lauzier Thu, 10 Mar 2011 08:20:44 -0800

Hi Reuti,

This was caused by one host having a scsi disk error...
sge_execd was ok, but could not properly fire up the shepherd...
( we could not log into the console...because of disk access errors....)
So, the jobs failed with the error message:


03/10/2011 07:14:38|worker|ge-seq-prod|W|job 9548360.1 failed on host
node1182 invalid execution state because: shepherd exited with exit status
127: invalid execution state

And, man did it chew through a lot of jobs fast...

We set the load adjust to 0.50 per job for one minute to and load formula to
slots...

Things run fine and fast...

And the scheduler can really dispatch fast, esp to a blackhole host...

-Ed



Hi,
>
> Am 10.03.2011 um 16:50 schrieb Edward Lauzier:
>
> > I'm looking for best practices and techniques to detect blackhole hosts
> quickly
> > and disable them.  ( Platform LSF has this already built in...)
> >
> > What I see is possible is:
> >
> > Using a cron job on a ge client node...
> >
> > -  tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>'
> > -  if detected, use qmod -d '<queue_instance>' to disable
> > -  send email to ge_admin list
> > -  possibly send email of failed jobs to user(s)
> >
> > Must be robust to be able to timeout properly when ge is down or too busy
> > for qmod to respond...and/or filesystem problems, etc...
> >
> > ( perl or php alarm and sig handlers for proc_open work well for
> enforcing timeouts...)
> >
> > Any hints would be appreciated before I start on it...
> >
> > Won't take long to write the code, just looking for best practices and
> maybe
> > a setting I'm missing in the ge config...
>
> what is causing the blackhole? For example: if it's a full file system on a
> node, you could detect it by a load sensor in SGE and define in the queue
> setup an alarm threshold, so that no more jobs are schedule to this
> particular node.
>
> -- Reuti

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] How to detect "blackhole" host in gridengine?

Reply via email to