On Thu, 30 Apr 2009, Edgar Gabriel wrote:
Brian W. Barrett wrote:
When we added the CM PML, we added a pml_max_contextid field to the PML
structure, which is the max size cid the PML can handle (because the
matching interfaces don't allow 32 bits to be used for the cid. At the
same time, the max cid for OB1 was shrunk significantly, so that the header
on a short message would be packed tightly with no alignment padding.
At the time, we believed 32k simultaneous communicators was plenty, and
that CIDs were reused (we checked, I'm pretty sure). It sounds like
someone removed the CID reuse code, which seems rather bad to me.
yes, we added the block algorithm. Not reusing a CID actually doesn't bite me
as that dramatic, and I am still not sure and convinced about that:-) We do
not have an empty array or something like that, its just a number.
The reason for the block algorithm was that the performance of our
communicator creation code sucked, and the cid allocation was one portion of
that. We used to require *at least* 4 collective operations per communicator
creation at that time. We are now potentially down to 0, among others thanks
to the block algorithm.
However, let me think about reusing entire blocks, its probably doable just
requires a little more bookkeeping...
There have to be unused CIDs in Ralph's example - is there a way to
fallback out of the block algorithm when it can't find a new CID and find
one it can reuse? Other than setting the multi-threaded case back on, that
is?
remember that its not the communicator id allocation that is failing at this
point, so the question is do we have to 'validate' a cid with the pml before
we declare it to be ok?
well, that's only because the code's doing something it shouldn't. Have a
look at comm_cid.c:185 - there's the check we added to the multi-threaded
case (which was the only case when we added it). The cid generation
should never generate a number larger than mca_pml.pml_max_contextid.
I'm actually somewhat amazed this fails gracefully, as OB1 doesn't appear
to check it got a valid cid in add_comm, which it should probably do.
Looking at the differences between v1.2 and v1.3, the max_contextid code
was already in v1.2 and OB1 was setting it to 32k. So the cid blocking
code removed a rather critical feature and probably should be fixed or
removed for v1.3. On Portals, I only get 8k cids, so not having reuse is
a rather large problem.
Brian