Just to throw out more info on this, the test code runs fine on
previous versions of OMPI. It only hangs on the 1.3 line when the cid
reaches 65536.
-david
--
David Gunter
HPC-3: Parallel Tools Team
Los Alamos National Laboratory
On Apr 30, 2009, at 12:28 PM, Edgar Gabriel wrote:
cid's are in fact not recycled in the block algorithm. The problem
is that comm_free is not collective, so you can not make any
assumptions whether other procs have also released that communicator.
But nevertheless, a cid in the communicator structure is a uint32_t,
so it should not hit the 16k limit there yet. this is not new, so if
there is a discrepancy between what the comm structure assumes that
a cid is and what the pml assumes, than this was in the code since
the very first days of Open MPI...
Thanks
Edgar
Brian W. Barrett wrote:
On Thu, 30 Apr 2009, Ralph Castain wrote:
We seem to have hit a problem here - it looks like we are seeing a
built-in limit on the number of communicators one can create in a
program. The program basically does a loop, calling MPI_Comm_split
each
time through the loop to create a sub-communicator, does a reduce
operation on the members of the sub-communicator, and then calls
MPI_Comm_free to release it (this is a minimized reproducer for
the real
code). After 64k times through the loop, the program fails.
This looks remarkably like a 16-bit index that hits a max value
and then
blocks.
I have looked at the communicator code, but I don't immediately
see such
a field. Is anyone aware of some other place where we would have a
limit
that would cause this problem?
There's a maximum of 32768 communicator ids when using OB1 (each
PML can set the max contextid, although the communicator code is
the part that actually assigns a cid). Assuming that comm_free is
actually properly called, there should be plenty of cids available
for that pattern. However, I'm not sure I understand the block
algorithm someone added to cid allocation - I'd have to guess that
there's something funny with that routine and cids aren't being
recycled properly.
Brian
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel