On Thu, Apr 30, 2009 at 12:55 PM, Brian W. Barrett <brbar...@open-mpi.org>wrote:

> On Thu, 30 Apr 2009, Edgar Gabriel wrote:
>
>  Brian W. Barrett wrote:
>>
>>> When we added the CM PML, we added a pml_max_contextid field to the PML
>>> structure, which is the max size cid the PML can handle (because the
>>> matching interfaces don't allow 32 bits to be used for the cid.  At the same
>>> time, the max cid for OB1 was shrunk significantly, so that the header on a
>>> short message would be packed tightly with no alignment padding.
>>>
>>> At the time, we believed 32k simultaneous communicators was plenty, and
>>> that CIDs were reused (we checked, I'm pretty sure).  It sounds like someone
>>> removed the CID reuse code, which seems rather bad to me.
>>>
>>
>> yes, we added the block algorithm. Not reusing a CID actually doesn't bite
>> me as that dramatic, and I am still not sure and convinced about that:-) We
>> do not have an empty array or something like that, its just a number.
>>
>> The reason for the block algorithm was that the performance of our
>> communicator creation code sucked, and the cid allocation was one portion of
>> that. We used to require *at least* 4 collective operations per communicator
>> creation at that time. We are now potentially down to 0, among others thanks
>> to the block algorithm.
>>
>> However, let me think about reusing entire blocks, its probably doable
>> just requires a little more bookkeeping...
>>
>>  There have to be unused CIDs in Ralph's example - is there a way to
>>> fallback out of the block algorithm when it can't find a new CID and find
>>> one it can reuse?  Other than setting the multi-threaded case back on, that
>>> is?
>>>
>>
>> remember that its not the communicator id allocation that is failing at
>> this point, so the question is do we have to 'validate' a cid with the pml
>> before we declare it to be ok?
>>
>
> well, that's only because the code's doing something it shouldn't.  Have a
> look at comm_cid.c:185 - there's the check we added to the multi-threaded
> case (which was the only case when we added it).  The cid generation should
> never generate a number larger than mca_pml.pml_max_contextid. I'm actually
> somewhat amazed this fails gracefully, as OB1 doesn't appear to check it got
> a valid cid in add_comm, which it should probably do.
>


Actually, as an FYI: it doesn't fail gracefully. It just hangs...ick.



>
> Looking at the differences between v1.2 and v1.3, the max_contextid code
> was already in v1.2 and OB1 was setting it to 32k.  So the cid blocking code
> removed a rather critical feature and probably should be fixed or removed
> for v1.3.  On Portals, I only get 8k cids, so not having reuse is a rather
> large problem.
>
>
> Brian
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Reply via email to