On Jun 13, 2012, at 06:09 , Ralph Castain wrote: > George raised something during this morning's call that I wanted to follow-up > on relating to improving our modex operation. I've been playing with an > approach that sounded similar to what he suggested, and perhaps we could > pursue it in accordance with moving the BTL's to OPAL. > > We currently block on exchange of contact information for the BTL's when we > perform an all-to-all operation we term the "modex". At the end of that > operation, each process constructs a list of information for all processes in > the job, and each process contains the complete BTL contact info for every > process in its modex database. This consumes a significant amount of memory, > especially as we scale to ever larger applications. In addition, the modex > operation itself is one of the largest time consumers during MPI_Init. > > An alternative approach is for the BTL's to "add proc" only on "first > message" to or from that process -
This is not easy. It requires a completely different initialization steps. The procs will become dynamic, the BTL structures (endpoint and friends) must be initialized on the first message. Plus we have the issue related to the remote architecture, and thus how we store the BTL headers. In addition we should not forget that some BTL will still require a coordinated initialization. > i.e., we would not construct a list of all procs during MPI_Init, but only > add an entry for a process with which we communicate. The method would go > like this: > > 1. during MPI_Init, each BTL posts its contact info to the local modex > > 2. the "modex" call in MPI_Init simply sends that data to the local daemon, > which asynchronously executes an all-to-all collective with the other daemons > in the job. At the end of that operation, each daemon holds a complete modex > database for the job. Meantime, the application process continues to run. > > 3. we remove the "add_procs" call within MPI_Init, and perhaps can eliminate > the ORTE barrier at the end of MPI_Init. The reason we had that barrier was > to ensure that all procs were ready to communicate before we allowed anyone > to send a message. However, with this method, that may no longer be required. > > 4. we modify the BTL's so they (a) can receive a message from an unknown > source, adding that source to their local proc list, and (b) when sending a > message to another process, obtain the required contact info from their local > daemon if they don't already have it. Thus, we will see an increased latency > on first message - but we will ONLY store info for processes with which we > actually communicate (thus reducing the memory burden) and will wireup much > faster than we do today. > > I'm not (yet) that familiar with the details of many of the BTLs, but my > initial review of them didn't see any showstoppers for this approach. If > people think this might work and be an interesting approach, I'd be happy to > help implement a prototype to quantify its behavior. It depends from which perspective you look at this. I guess from an engineering perspective, adding this to Open MPI should not hurt. From a research perspective there is incentive, it has ben already done quite a while ago in the context of other runtimes ([1]). Having an asynchronous modex is the last step missing from [2]. We expect it to improve the startup performance significantly, especially for sparse communicative environments. george. 1. J. Sridhar, M. Koop, J. Perkins and D. K. Panda, ScELA: Scalable and Extensible Launching Architecture for Clusters, International Conference on High Performance Computing (HiPC 08), December 2008. 2. George Bosilca, Thomas Herault, Ala Rezmerita and Jack Dongarra, On Scalability for MPI Runtime Systems, The International Workshop on Runtime and Operating Systems for Supercomputers, May 31, 2011. > > Ralph > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel