David Mathog wrote:
mg <[EMAIL PROTECTED]> wrote:
I use MPICH-1.2.5.2 to generate and run an FEM parallel application.
During a parallel run, one process can crash, leaving the other
processes run and OS commands have to be used for kill these zombies.
So, does someone have a solution to avoid zombies after a failed
parallel run: can the crashed process kill the other processes?
I think what you're saying is one compute node dies and this causes the
master and processes on the other nodes to run forever, or at least
not exit even if they have stopped using CPU. Or are you really
asking about processes that show up in the unix "Zombie" state?
Assuming the former, <snip>
I think what the OP is asking is how to kill (automagicallY) all
processes in a parallel run once one process crashed (due to
segmentation failure or soth.)
Generally if one process (in the whole bunch of processes) crashes, all
other processes will wait eternally from the moment they try to
communicate with the crashed process or at the MPI_Finalize. So how can
one kill all remaining processes?
toon
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf