Most cluster setups are quite sensitive to node failures, though there
are certainly ways to increase error tolerance.  They are currently
outside the bounds of a default Oscar setup.

>From the error messages it seems that you are still in an environment
that thinks node 2 is part of your execution environment.

My suggestion is to delete the dead node using the install_cluster
wizard as detailed in the installation manual, then add it again when
it is repaired.  This will clean up all the oscar generated files
related to that node.  You will have to fix any host files etc that
you generated on your own.  You at least need to lamhalt, then lamboot
with a host file that does not contain the dead node.

The mpi implementations do not take responsibility for knowing
anything about the states of the nodes they are running code on.  In
my limited experience, even the resource managers/schedulers (Such as
Torque/Maui) need to be told when there is a node hardware failure.

On Mon, 28 Mar 2005 15:13:49 +0800, ullas ckm <[EMAIL PROTECTED]> wrote:
> Hi all,
> 
> we are facing problems while executing mpi programs.if all the client
> nodes are up, then the mpi application works fine. But
> in case if any one of the node is down then the execution terminates.
> 
> Here is one sample output:(specifically oscarnode2 is dead)
> 
> # mpirun -np 4 ./a.out
> oscarserver.oscardomain
> Sun Mar 27 23:13:15 IST 2005
> oscarnode1.oscardomain
> Sun Mar 27 23:13:25 IST 2005
> ssh: connect to host oscarnode2 port 22: No route to host
> p0_14474:  p4_error: Child process exited while making connection to
> remote process on oscarnode2: 0
> Killed by signal 2.
> 
> /opt/mpich-1.2.5.10-ch_p4-gcc/bin/mpirun: line 1: 14474 Broken pipe
>           /home/oscartst/suresh/./a.out -p4pg
> /home/oscartst/suresh/PI14350 -p4wd /home/oscartst/suresh
> 
> But it should not affect the performance of the cluster even if one
> of the node is dead right?
> 
> Eagerly waiting for the response.
> 
> ullas
> 
> --
> ______________________________________________
> Check out the latest SMS services @ http://www.linuxmail.org
> This allows you to send and receive SMS through your mailbox.
> 
> Powered by Outblaze
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_ide95&alloc_id396&opclick
> _______________________________________________
> Oscar-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/oscar-users
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_ide95&alloc_id396&op=click
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to