Well, since I'm the "guy who wrote the code", I'll offer my $0.0001 (my dollars went the way of the market...).

Jeff's memory about why we went to 16 bits isn't quite accurate. The fact is that we always had 32-bit jobids, and still do. Up to about a year ago, all of that space was available for comm_spawn. What changed at that time was a decision to make every mpirun independently create a unique identifier so that two mpirun's could connect/accept without requiring a persistent orted to coordinate their name at launch. This was the subject of a lengthy discussion involving multiple institutions that spanned several months last year.

As a result of that discussion, we claimed 16-bits of the 32-bits for the mpirun identifier. We investigated using only 8-bits (thus leaving 24-bits for comm_spawn'd jobs), but the probability of duplicate identifiers was too high.

Likewise, we looked at increasing the total size of the jobid to 64- bits, but that seemed ridiculously high - and (due to the way memory gets allocated for structures) meant that we had to also increase the vpid size to 64-bits. Thus, the move to 64-bit id's would have increased the size of the name struct from 32-bits to 128-bits - and now you do start to see a non-zero impact on memory footprint for extreme scale clusters involving several hundred thousand processes.

So we accepted the 16-bit limit on comm_spawn and moved on....until someone now wants to do 100k comm_spawns.

I don't believe Jeff's proposed solution will solve that user's request as he was dynamically constructing a very large server farm (so the procs are not short lived). However, IMHO, I think this was a poorly designed application - it didn't need to be done the way he was doing it, and could easily (and more efficiently) be built to fit within the 64k constraint.

So, my suggestion is to stick with the 64k limit, perhaps add this reuse proposal, and just document that constraint.

Ralph


On Oct 27, 2008, at 4:14 PM, Jeff Squyres wrote:

On Oct 27, 2008, at 5:52 PM, Andreas Schäfer wrote:

I don't know any implementation details, but is making a 16-bit
counter a 32-bit counter really so much harder than this fancy
(overengineered? ;-) ) table construction? The way I see it, this
table which might become a real mess if there are multiple
MPI_Comm_spawn issued simultaneously in different communicators. (Would
that be legal MPI?)

FWIW, all the spawns are proxied back to the HNP (i.e., mpirun), so there would only be a need for 1 table. I don't think that a simple table lookup is overengineered. :-) It's a simple solution to the "need a global ID" issue. By limiting the size of the table, you can avoid scalability issues as MPI jobs are being run on more and more cores (e.g., growing without bound, particularly for 99% of the apps out there that never call comm_spawn).

We actually went down to 16 bits recently (it used to be 32) as one item toward reducing the memory footprint of MPI processes (and mpirun and the orted's), particularly when running very large scale jobs. So while increasing this one value back to 32 bits may not be tragic, it would be nice to keep it down as 16 bits (IMHO).

Regardless of how big the value is (8, 16, 32, 64...) you still need a unique value for comm_spawn. Therefore, some kind of duplicate detection mechanism is needed. If you increase the size of the container, you decrease the probability of collision, but it can still happen. And since machines are growing in size and # of cores, it could just delay the probability of collision until someone runs on a big enough machine. Regardless, I'd prefer to fix it the Right way rather than rely on probability to prevent a problem. In my experience, "that could *never* happen!" is just an invitation for a disaster, even if it's 1-5 years in the future. (didn't someone say that we'd never need more than 640k of RAM? :-) )

Just my IMHO, of course... (and I'm not the guy writing the code!) :-)

--
Jeff Squyres
Cisco Systems


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to