I'd be in favor of bringing this to v1.3. Are there other
dependencies / would it be difficult?
Begin forwarded message:
From: "Open MPI" <b...@open-mpi.org>
Date: June 8, 2009 11:31:20 AM PDT
Cc: <b...@osl.iu.edu>
Subject: Re: [Open MPI] #1927: v1.3 COMM_SPAWN loop test fails after
~120 spawns
#1927: v1.3 COMM_SPAWN loop test fails after ~120 spawns
-----------------------
+----------------------------------------------------
Reporter: jsquyres | Owner: rhc
Type: defect | Status: closed
Priority: critical | Milestone: Open MPI 1.3.4
Version: 1.3 branch | Resolution: fixed
Keywords: |
-----------------------
+----------------------------------------------------
Changes (by rhc):
* status: new => closed
* resolution: => fixed
Comment:
This was due to a very tight loop on comm_spawn not giving enough
time for
the prior proc to completely terminate (and thus free its file
descriptors) before the next proc was launched. Eventually, we
built up a
backlog of terminations to process and ran out of fd's.
We introduced a check-and-delay in the code that detects we don't
have
enough fd's to launch another proc, and then waits a second to see if
enough become free before aborting.
Fixed in trunk - can see if we want to bring it to 1.3.
--
Ticket URL: <https://svn.open-mpi.org/trac/ompi/ticket/1927#comment:3>
Open MPI <http://www.open-mpi.org/>
--
Jeff Squyres
Cisco Systems