I do not like the fact that add_procs is called with every proc in the MPI_COMM_WORLD. That needs to change, so, I will not rely on the number of procs being added being the same as the world or universe size.
-Nathan On Thu, Jul 31, 2014 at 09:22:00AM -0600, George Bosilca wrote: > I definitively think you misunderstood this scope of this RFC. The > information that is so important to you to configure the mailbox size is > available to you when you need it. This information is made available by > the PML through the call to add_procs, which comes with all the procs in > the MPI_COMM_WORLD. So, ugni doesn't need anything more than it is > available today. [This is of course under the assumption that someone > clean the BTL and remove the usage of MPI_COMM_WORLD.] > > The real scope of this RFC is to move this information before that in > order to allow the BTLs to have access to some possible number of > processes between the call to btl_open and the call to btl_all_proc (in > other words during btl_init). > > George. > > PS: here is the patch that fixes all issues in ugni. > > On Jul 31, 2014, at 10:58 , Nathan Hjelm <hje...@lanl.gov> wrote: > > > > > +2^10000000 > > > > This information is absolutely necessary at this point. If someone has a > > better solution they can provide it as an alternative RFC. Until then > > this is how it should be done... Otherwise we loose uGNI support on the > > trunk. Because we ARE NOT going to remove the mailbox size optimization. > > > > -Nathan > > > > On Wed, Jul 30, 2014 at 10:00:18PM +0000, Jeff Squyres (jsquyres) wrote: > >> WHAT: Should we make the job size (i.e., initial number of procs) > available in OPAL? > >> > >> WHY: At least 2 BTLs are using this info (*more below) > >> > >> WHERE: usnic and ugni > >> > >> TIMEOUT: there's already been some inflammatory emails about this; > let's discuss next Tuesday on the teleconf: Tue, 5 Aug 2014 > >> > >> MORE DETAIL: > >> > >> This is an open question. We *have* the information at the time that > the BTLs are initialized: do we allow that information to go down to OPAL? > >> > >> Ralph added this info down in OPAL in r32355, but George reverted it in > r32361. > >> > >> Points for: YES, WE SHOULD > >> +++ 2 BTLs were using it (usinc, ugni) > >> +++ Other RTE job-related info are already in OPAL (num local ranks, > local rank) > >> > >> Points for: NO, WE SHOULD NOT > >> --- What exactly is this number (e.g., num currently-connected procs?), > and when is it updated? > >> --- We need to precisely delineate what belongs in OPAL vs. above-OPAL > >> > >> FWIW: here's how ompi_process_info.num_procs was used before the BTL > move down to OPAL: > >> > >> - usnic: for a minor latency optimization / sizing of a shared receive > buffer queue length, and for the initial size of a peer lookup hash > >> - ugni: to determine the size of the per-peer buffers used for > send/recv communication > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/07/15373.php > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/07/15394.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/07/15399.php
pgpo6WjkLZPnT.pgp
Description: PGP signature