Hi Reuti, This was caused by one host having a scsi disk error... sge_execd was ok, but could not properly fire up the shepherd... ( we could not log into the console...because of disk access errors....) So, the jobs failed with the error message:
03/10/2011 07:14:38|worker|ge-seq-prod|W|job 9548360.1 failed on host node1182 invalid execution state because: shepherd exited with exit status 127: invalid execution state And, man did it chew through a lot of jobs fast... We set the load adjust to 0.50 per job for one minute to and load formula to slots... Things run fine and fast... And the scheduler can really dispatch fast, esp to a blackhole host... -Ed Hi, > > Am 10.03.2011 um 16:50 schrieb Edward Lauzier: > > > I'm looking for best practices and techniques to detect blackhole hosts > quickly > > and disable them. ( Platform LSF has this already built in...) > > > > What I see is possible is: > > > > Using a cron job on a ge client node... > > > > - tail -f 1000 <qmaster_messages_file> | egrep '<for_desired_string>' > > - if detected, use qmod -d '<queue_instance>' to disable > > - send email to ge_admin list > > - possibly send email of failed jobs to user(s) > > > > Must be robust to be able to timeout properly when ge is down or too busy > > for qmod to respond...and/or filesystem problems, etc... > > > > ( perl or php alarm and sig handlers for proc_open work well for > enforcing timeouts...) > > > > Any hints would be appreciated before I start on it... > > > > Won't take long to write the code, just looking for best practices and > maybe > > a setting I'm missing in the ge config... > > what is causing the blackhole? For example: if it's a full file system on a > node, you could detect it by a load sensor in SGE and define in the queue > setup an alarm threshold, so that no more jobs are schedule to this > particular node. > > -- Reuti
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users