????? I'm talking about how to implement it, not what level holds the interface. Besides, "pineapple" hit a roadblock during the call and is a totally separate discussion.
On Jun 13, 2012, at 7:03 AM, Richard Graham wrote: > I would suggest exposing modex at the pineapple level, and not tie it to a > particular instance of run-time instantiation. This decouples the > instantiation from the details of the run-time, and also gives the freedom to > provide different instantiations for different job scenarios. > > Rich > > -----Original Message----- > From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Wednesday, June 13, 2012 12:10 AM > To: Open MPI Developers > Subject: [OMPI devel] Modex > > George raised something during this morning's call that I wanted to follow-up > on relating to improving our modex operation. I've been playing with an > approach that sounded similar to what he suggested, and perhaps we could > pursue it in accordance with moving the BTL's to OPAL. > > We currently block on exchange of contact information for the BTL's when we > perform an all-to-all operation we term the "modex". At the end of that > operation, each process constructs a list of information for all processes in > the job, and each process contains the complete BTL contact info for every > process in its modex database. This consumes a significant amount of memory, > especially as we scale to ever larger applications. In addition, the modex > operation itself is one of the largest time consumers during MPI_Init. > > An alternative approach is for the BTL's to "add proc" only on "first > message" to or from that process - i.e., we would not construct a list of all > procs during MPI_Init, but only add an entry for a process with which we > communicate. The method would go like this: > > 1. during MPI_Init, each BTL posts its contact info to the local modex > > 2. the "modex" call in MPI_Init simply sends that data to the local daemon, > which asynchronously executes an all-to-all collective with the other daemons > in the job. At the end of that operation, each daemon holds a complete modex > database for the job. Meantime, the application process continues to run. > > 3. we remove the "add_procs" call within MPI_Init, and perhaps can eliminate > the ORTE barrier at the end of MPI_Init. The reason we had that barrier was > to ensure that all procs were ready to communicate before we allowed anyone > to send a message. However, with this method, that may no longer be required. > > 4. we modify the BTL's so they (a) can receive a message from an unknown > source, adding that source to their local proc list, and (b) when sending a > message to another process, obtain the required contact info from their local > daemon if they don't already have it. Thus, we will see an increased latency > on first message - but we will ONLY store info for processes with which we > actually communicate (thus reducing the memory burden) and will wireup much > faster than we do today. > > I'm not (yet) that familiar with the details of many of the BTLs, but my > initial review of them didn't see any showstoppers for this approach. If > people think this might work and be an interesting approach, I'd be happy to > help implement a prototype to quantify its behavior. > > Ralph > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel