On Thu, Feb 3, 2011 at 12:33 PM, Naveed Near-Ansari <[email protected]> wrote:
>
> Is there a way to allocate nodes more randomly. Currently our jobs seem
> to allocate to nodes first added into torque. This can cause problems
> when one node starts having problems. The jobs keep getting allocated
> to the same node, causing the same failures, even when there are
> hundreds of other nodes available. Obviously pulling the node as soon
> as possible is the right thing to do, but sometimes this can take a
> while (like in the middle of the night when people are working).
the best way to address this, is to run a node health check script.
most of the scenarios that cause problems with failing jobs can be
checked for by parsing the output of dmesg or some log files.
the health check can then set the node into a disabled state and
the scheduler will reschedule, skipping the "bad" node.
been using this for years very successfully on a machine with
myrinet, where the myrinet cards occasionally have a crash in
their firmware.
.
randomizing node access only helps, if most of your nodes
are empty most of the time. if your machine is properly used,
then the queued up jobs will still go into the broken node and
crash. so you have to attack this problem differently.
cheers,
axel.
> I would still like individual jobs sent to the smallest number of nodes
> (16 core job on 2 nodes,) but have the nodes assigned in a more random
> fashion rather than just the next one available in the list. I have
> read through the documentation and am not finding such an option, but
> perhaps i missed, or misunderstood something
>
> Let me know if this should be on the torque list, but i thought maui was
> responsible for the allocation.
>
> Naveed
> _______________________________________________
> mauiusers mailing list
> [email protected]
> http://www.supercluster.org/mailman/listinfo/mauiusers
>
--
Dr. Axel Kohlmeyer [email protected]
http://sites.google.com/site/akohlmey/
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers