On Thu, Apr 30, 2009 at 12:55 PM, Brian W. Barrett <brbar...@open-mpi.org>wrote:
> On Thu, 30 Apr 2009, Edgar Gabriel wrote: > > Brian W. Barrett wrote: >> >>> When we added the CM PML, we added a pml_max_contextid field to the PML >>> structure, which is the max size cid the PML can handle (because the >>> matching interfaces don't allow 32 bits to be used for the cid. At the same >>> time, the max cid for OB1 was shrunk significantly, so that the header on a >>> short message would be packed tightly with no alignment padding. >>> >>> At the time, we believed 32k simultaneous communicators was plenty, and >>> that CIDs were reused (we checked, I'm pretty sure). It sounds like someone >>> removed the CID reuse code, which seems rather bad to me. >>> >> >> yes, we added the block algorithm. Not reusing a CID actually doesn't bite >> me as that dramatic, and I am still not sure and convinced about that:-) We >> do not have an empty array or something like that, its just a number. >> >> The reason for the block algorithm was that the performance of our >> communicator creation code sucked, and the cid allocation was one portion of >> that. We used to require *at least* 4 collective operations per communicator >> creation at that time. We are now potentially down to 0, among others thanks >> to the block algorithm. >> >> However, let me think about reusing entire blocks, its probably doable >> just requires a little more bookkeeping... >> >> There have to be unused CIDs in Ralph's example - is there a way to >>> fallback out of the block algorithm when it can't find a new CID and find >>> one it can reuse? Other than setting the multi-threaded case back on, that >>> is? >>> >> >> remember that its not the communicator id allocation that is failing at >> this point, so the question is do we have to 'validate' a cid with the pml >> before we declare it to be ok? >> > > well, that's only because the code's doing something it shouldn't. Have a > look at comm_cid.c:185 - there's the check we added to the multi-threaded > case (which was the only case when we added it). The cid generation should > never generate a number larger than mca_pml.pml_max_contextid. I'm actually > somewhat amazed this fails gracefully, as OB1 doesn't appear to check it got > a valid cid in add_comm, which it should probably do. > Actually, as an FYI: it doesn't fail gracefully. It just hangs...ick. > > Looking at the differences between v1.2 and v1.3, the max_contextid code > was already in v1.2 and OB1 was setting it to 32k. So the cid blocking code > removed a rather critical feature and probably should be fixed or removed > for v1.3. On Portals, I only get 8k cids, so not having reuse is a rather > large problem. > > > Brian > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >