I'll file a ticket against it....oh joy!!! You all know how much I *love*
tickets!


On Thu, Apr 30, 2009 at 1:11 PM, Ralph Castain <r...@open-mpi.org> wrote:

>
> On Thu, Apr 30, 2009 at 12:55 PM, Brian W. Barrett 
> <brbar...@open-mpi.org>wrote:
>
>> On Thu, 30 Apr 2009, Edgar Gabriel wrote:
>>
>>  Brian W. Barrett wrote:
>>>
>>>> When we added the CM PML, we added a pml_max_contextid field to the PML
>>>> structure, which is the max size cid the PML can handle (because the
>>>> matching interfaces don't allow 32 bits to be used for the cid.  At the 
>>>> same
>>>> time, the max cid for OB1 was shrunk significantly, so that the header on a
>>>> short message would be packed tightly with no alignment padding.
>>>>
>>>> At the time, we believed 32k simultaneous communicators was plenty, and
>>>> that CIDs were reused (we checked, I'm pretty sure).  It sounds like 
>>>> someone
>>>> removed the CID reuse code, which seems rather bad to me.
>>>>
>>>
>>> yes, we added the block algorithm. Not reusing a CID actually doesn't
>>> bite me as that dramatic, and I am still not sure and convinced about
>>> that:-) We do not have an empty array or something like that, its just a
>>> number.
>>>
>>> The reason for the block algorithm was that the performance of our
>>> communicator creation code sucked, and the cid allocation was one portion of
>>> that. We used to require *at least* 4 collective operations per communicator
>>> creation at that time. We are now potentially down to 0, among others thanks
>>> to the block algorithm.
>>>
>>> However, let me think about reusing entire blocks, its probably doable
>>> just requires a little more bookkeeping...
>>>
>>>  There have to be unused CIDs in Ralph's example - is there a way to
>>>> fallback out of the block algorithm when it can't find a new CID and find
>>>> one it can reuse?  Other than setting the multi-threaded case back on, that
>>>> is?
>>>>
>>>
>>> remember that its not the communicator id allocation that is failing at
>>> this point, so the question is do we have to 'validate' a cid with the pml
>>> before we declare it to be ok?
>>>
>>
>> well, that's only because the code's doing something it shouldn't.  Have a
>> look at comm_cid.c:185 - there's the check we added to the multi-threaded
>> case (which was the only case when we added it).  The cid generation should
>> never generate a number larger than mca_pml.pml_max_contextid. I'm actually
>> somewhat amazed this fails gracefully, as OB1 doesn't appear to check it got
>> a valid cid in add_comm, which it should probably do.
>>
>
>
> Actually, as an FYI: it doesn't fail gracefully. It just hangs...ick.
>
>
>
>>
>> Looking at the differences between v1.2 and v1.3, the max_contextid code
>> was already in v1.2 and OB1 was setting it to 32k.  So the cid blocking code
>> removed a rather critical feature and probably should be fixed or removed
>> for v1.3.  On Portals, I only get 8k cids, so not having reuse is a rather
>> large problem.
>>
>>
>> Brian
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>

Reply via email to