My comment here is that one will want different types of modex capabilities, depending on the type of system being targeted, so the instantiation of an interface needs to accommodate this, regardless of where the interface sits. When you have order several hundred K end points, like large systems today already have, you likely don't need the information on all endpoints stored in a single location on a "node" that is being used for compute. So, as the BTL code moves, should keep this in the back of our minds, and consider what impact this may have (if any) on the code.
Rich -----Original Message----- From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain Sent: Wednesday, June 13, 2012 9:09 AM To: Open MPI Developers Subject: Re: [OMPI devel] Modex ????? I'm talking about how to implement it, not what level holds the interface. Besides, "pineapple" hit a roadblock during the call and is a totally separate discussion. On Jun 13, 2012, at 7:03 AM, Richard Graham wrote: > I would suggest exposing modex at the pineapple level, and not tie it to a > particular instance of run-time instantiation. This decouples the > instantiation from the details of the run-time, and also gives the freedom to > provide different instantiations for different job scenarios. > > Rich > > -----Original Message----- > From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Wednesday, June 13, 2012 12:10 AM > To: Open MPI Developers > Subject: [OMPI devel] Modex > > George raised something during this morning's call that I wanted to follow-up > on relating to improving our modex operation. I've been playing with an > approach that sounded similar to what he suggested, and perhaps we could > pursue it in accordance with moving the BTL's to OPAL. > > We currently block on exchange of contact information for the BTL's when we > perform an all-to-all operation we term the "modex". At the end of that > operation, each process constructs a list of information for all processes in > the job, and each process contains the complete BTL contact info for every > process in its modex database. This consumes a significant amount of memory, > especially as we scale to ever larger applications. In addition, the modex > operation itself is one of the largest time consumers during MPI_Init. > > An alternative approach is for the BTL's to "add proc" only on "first > message" to or from that process - i.e., we would not construct a list of all > procs during MPI_Init, but only add an entry for a process with which we > communicate. The method would go like this: > > 1. during MPI_Init, each BTL posts its contact info to the local modex > > 2. the "modex" call in MPI_Init simply sends that data to the local daemon, > which asynchronously executes an all-to-all collective with the other daemons > in the job. At the end of that operation, each daemon holds a complete modex > database for the job. Meantime, the application process continues to run. > > 3. we remove the "add_procs" call within MPI_Init, and perhaps can eliminate > the ORTE barrier at the end of MPI_Init. The reason we had that barrier was > to ensure that all procs were ready to communicate before we allowed anyone > to send a message. However, with this method, that may no longer be required. > > 4. we modify the BTL's so they (a) can receive a message from an unknown > source, adding that source to their local proc list, and (b) when sending a > message to another process, obtain the required contact info from their local > daemon if they don't already have it. Thus, we will see an increased latency > on first message - but we will ONLY store info for processes with which we > actually communicate (thus reducing the memory burden) and will wireup much > faster than we do today. > > I'm not (yet) that familiar with the details of many of the BTLs, but my > initial review of them didn't see any showstoppers for this approach. If > people think this might work and be an interesting approach, I'd be happy to > help implement a prototype to quantify its behavior. > > Ralph > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel _______________________________________________ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel