On Mon, 2006-11-27 at 15:57 -0800, Mark A. Grondona wrote:
> > On Mon, 2006-11-27 at 16:29 -0700, Brian W Barrett wrote:
> > > On Nov 27, 2006, at 4:19 PM, Matt Leininger wrote:
> > > 
> > > >  I've been running more tests of OpenMPI v1.2b.  I've run into several
> > > > cases where the app+MPI use too much memory and the OOM handler kills
> > > > off tasks.  Sometimes the ompi mpirun shuts down gracefully, but other
> > > > times the OOM handler may kill off 1 to 4 MPI tasks per node (when I'm
> > > > using 8 MPI tasks per node).  The remaining MPI tasks keep
> > > > running/polling and have to be killed off by hand.  Has anyone seen  
> > > > this
> > > > behavior before?
> > > 
> > > Are the orteds also getting killed? 
> > 
> >   Not sure.  I'll check the next time I see this.
> > 
> 
> I haven't seen any evidence that orteds are being killed by the Out of Memory
> killer. Only MPI application processes seem to be the chosen victim(s).

  I can confirm this.  I'm running a 2 node 16 MPI task job.  On one
node all 8 mpi tasks where killed and the other node only had 1 mpi task
killed.  The orted's are still running on each node, but it's not
cleaning up.

  - Matt
> 
> 
> > > 
> > > I'm not really familiar with the OOM killer -- does it cause the  
> > > parent of the killed process to get a SIGCHLD?  If not, that could be  
> > > a fairly serious problem for us, as we rely on SIGCHLDs being  
> > > received by the orteds when things die...
> > 
> >   Mark Grondona could answer this.  His reply to devel-core bounced so
> > I'm including de...@open-mpi.org on this thread.
> 
> 
> No, being killed by the OOM killer should be the same as being sent
> SIGKILL as far as userspace is concerned. SIGCHLD to the parent will still
> be sent (and wait(2) will return, etc.)
> 
> mark
> 


Reply via email to