I'll file a ticket against it....oh joy!!! You all know how much I *love* tickets!
On Thu, Apr 30, 2009 at 1:11 PM, Ralph Castain <r...@open-mpi.org> wrote: > > On Thu, Apr 30, 2009 at 12:55 PM, Brian W. Barrett > <brbar...@open-mpi.org>wrote: > >> On Thu, 30 Apr 2009, Edgar Gabriel wrote: >> >> Brian W. Barrett wrote: >>> >>>> When we added the CM PML, we added a pml_max_contextid field to the PML >>>> structure, which is the max size cid the PML can handle (because the >>>> matching interfaces don't allow 32 bits to be used for the cid. At the >>>> same >>>> time, the max cid for OB1 was shrunk significantly, so that the header on a >>>> short message would be packed tightly with no alignment padding. >>>> >>>> At the time, we believed 32k simultaneous communicators was plenty, and >>>> that CIDs were reused (we checked, I'm pretty sure). It sounds like >>>> someone >>>> removed the CID reuse code, which seems rather bad to me. >>>> >>> >>> yes, we added the block algorithm. Not reusing a CID actually doesn't >>> bite me as that dramatic, and I am still not sure and convinced about >>> that:-) We do not have an empty array or something like that, its just a >>> number. >>> >>> The reason for the block algorithm was that the performance of our >>> communicator creation code sucked, and the cid allocation was one portion of >>> that. We used to require *at least* 4 collective operations per communicator >>> creation at that time. We are now potentially down to 0, among others thanks >>> to the block algorithm. >>> >>> However, let me think about reusing entire blocks, its probably doable >>> just requires a little more bookkeeping... >>> >>> There have to be unused CIDs in Ralph's example - is there a way to >>>> fallback out of the block algorithm when it can't find a new CID and find >>>> one it can reuse? Other than setting the multi-threaded case back on, that >>>> is? >>>> >>> >>> remember that its not the communicator id allocation that is failing at >>> this point, so the question is do we have to 'validate' a cid with the pml >>> before we declare it to be ok? >>> >> >> well, that's only because the code's doing something it shouldn't. Have a >> look at comm_cid.c:185 - there's the check we added to the multi-threaded >> case (which was the only case when we added it). The cid generation should >> never generate a number larger than mca_pml.pml_max_contextid. I'm actually >> somewhat amazed this fails gracefully, as OB1 doesn't appear to check it got >> a valid cid in add_comm, which it should probably do. >> > > > Actually, as an FYI: it doesn't fail gracefully. It just hangs...ick. > > > >> >> Looking at the differences between v1.2 and v1.3, the max_contextid code >> was already in v1.2 and OB1 was setting it to 32k. So the cid blocking code >> removed a rather critical feature and probably should be fixed or removed >> for v1.3. On Portals, I only get 8k cids, so not having reuse is a rather >> large problem. >> >> >> Brian >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >