On 02/03/2011 10:06 AM, Axel Kohlmeyer wrote: > On Thu, Feb 3, 2011 at 12:33 PM, Naveed Near-Ansari <[email protected]> > wrote: >> Is there a way to allocate nodes more randomly. Currently our jobs seem >> to allocate to nodes first added into torque. This can cause problems >> when one node starts having problems. The jobs keep getting allocated >> to the same node, causing the same failures, even when there are >> hundreds of other nodes available. Obviously pulling the node as soon >> as possible is the right thing to do, but sometimes this can take a >> while (like in the middle of the night when people are working). > the best way to address this, is to run a node health check script. > most of the scenarios that cause problems with failing jobs can be > checked for by parsing the output of dmesg or some log files. > the health check can then set the node into a disabled state and > the scheduler will reschedule, skipping the "bad" node. > > been using this for years very successfully on a machine with > myrinet, where the myrinet cards occasionally have a crash in > their firmware. > . > randomizing node access only helps, if most of your nodes > are empty most of the time. if your machine is properly used, > then the queued up jobs will still go into the broken node and > crash. so you have to attack this problem differently. > > cheers, > axel. > > >
I do have a health_check script that catches most of the problems. this is also a myrinet cluster and I do catch most myrinet problems. I have recently been having certain hardware failures that are showing up with various symtpoms and i have been adding things to the check scripts as they happen. The script does check dmesg and run a few commands for proper output also. The problem occurs when a node has a problem, it isn't caught by the check scripts and jobs keep getting placed on them. The cluster is not always at 100% usage, so it is irritating that when there are plenty of functioning nodes available but all jobs get stuck because of a single machine. Others have given me some suggestions that i am trying out. >> I would still like individual jobs sent to the smallest number of nodes >> (16 core job on 2 nodes,) but have the nodes assigned in a more random >> fashion rather than just the next one available in the list. I have >> read through the documentation and am not finding such an option, but >> perhaps i missed, or misunderstood something >> >> Let me know if this should be on the torque list, but i thought maui was >> responsible for the allocation. >> >> Naveed >> _______________________________________________ >> mauiusers mailing list >> [email protected] >> http://www.supercluster.org/mailman/listinfo/mauiusers >> > > _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
