On 02/03/2011 10:06 AM, Axel Kohlmeyer wrote:
> On Thu, Feb 3, 2011 at 12:33 PM, Naveed Near-Ansari <[email protected]> 
> wrote:
>> Is there a way to allocate nodes more randomly.  Currently our jobs seem
>> to allocate to nodes first added into torque.  This can cause problems
>> when one node starts having problems.  The jobs keep getting allocated
>> to the same node, causing the same failures, even when there are
>> hundreds of other nodes available.  Obviously pulling the node as soon
>> as possible is the right thing to do, but sometimes this can take a
>> while (like in the middle of the night when people are working).
> the best way to address this, is to run a node health check script.
> most of the scenarios that cause problems with failing jobs can be
> checked for by parsing the output of dmesg or some log files.
> the health check can then set the node into a disabled state and
> the scheduler will reschedule, skipping the "bad" node.
>
> been using this for years very successfully on a machine with
> myrinet, where the myrinet cards occasionally have a crash in
> their firmware.
> .
> randomizing node access only helps, if most of your nodes
> are empty most of the time. if your machine is properly used,
> then the queued up jobs will still go into the broken node and
> crash. so you have to attack this problem differently.
>
> cheers,
>     axel.
>
>
>

I do have a health_check script that catches most of the problems.  this
is also a myrinet cluster and I do catch most myrinet problems.  I have
recently been having certain hardware failures that are showing up with
various symtpoms and i have been adding things to the check scripts as
they happen. The script does check dmesg and run a few commands for
proper output also. The problem occurs when a node has a problem, it
isn't caught by the check scripts and jobs keep getting placed on them. 
The cluster is not always at 100% usage, so it is irritating that when
there are plenty of functioning nodes available but all jobs get stuck
because of a single machine.

Others have given me some suggestions that i am trying out.

>> I would still like individual jobs sent to the smallest number of nodes
>> (16 core job on 2 nodes,) but have the nodes assigned in a more random
>> fashion rather than just the next one available in the list.  I have
>> read through the documentation and am not finding such an option, but
>> perhaps i missed, or misunderstood something
>>
>> Let me know if this should be on the torque list, but i thought maui was
>> responsible for the allocation.
>>
>> Naveed
>> _______________________________________________
>> mauiusers mailing list
>> [email protected]
>> http://www.supercluster.org/mailman/listinfo/mauiusers
>>
>
>
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to