Re: [OMPI devel] Modex

Ralph Castain Wed, 13 Jun 2012 09:07:55 -0400

?????

I'm talking about how to implement it, not what level holds the interface. 
Besides, "pineapple" hit a roadblock during the call and is a totally separate 
discussion.



On Jun 13, 2012, at 7:03 AM, Richard Graham wrote:

> I would suggest exposing modex at the pineapple level, and not tie it to a 
> particular instance of run-time instantiation.  This decouples the 
> instantiation from the details of the run-time, and also gives the freedom to 
> provide different instantiations for different job scenarios.
> 
> Rich
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On 
> Behalf Of Ralph Castain
> Sent: Wednesday, June 13, 2012 12:10 AM
> To: Open MPI Developers
> Subject: [OMPI devel] Modex
> 
> George raised something during this morning's call that I wanted to follow-up 
> on relating to improving our modex operation. I've been playing with an 
> approach that sounded similar to what he suggested, and perhaps we could 
> pursue it in accordance with moving the BTL's to OPAL.
> 
> We currently block on exchange of contact information for the BTL's when we 
> perform an all-to-all operation we term the "modex". At the end of that 
> operation, each process constructs a list of information for all processes in 
> the job, and each process contains the complete BTL contact info for every 
> process in its modex database. This consumes a significant amount of memory, 
> especially as we scale to ever larger applications. In addition, the modex 
> operation itself is one of the largest time consumers during MPI_Init.
> 
> An alternative approach is for the BTL's to "add proc" only on "first 
> message" to or from that process - i.e., we would not construct a list of all 
> procs during MPI_Init, but only add an entry for a process with which we 
> communicate. The method would go like this:
> 
> 1. during MPI_Init, each BTL posts its contact info to the local modex
> 
> 2. the "modex" call in MPI_Init simply sends that data to the local daemon, 
> which asynchronously executes an all-to-all collective with the other daemons 
> in the job. At the end of that operation, each daemon holds a complete modex 
> database for the job. Meantime, the application process continues to run.
> 
> 3. we remove the "add_procs" call within MPI_Init, and perhaps can eliminate 
> the ORTE barrier at the end of MPI_Init. The reason we had that barrier was 
> to ensure that all procs were ready to communicate before we allowed anyone 
> to send a message. However, with this method, that may no longer be required.
> 
> 4. we modify the BTL's so they (a) can receive a message from an unknown 
> source, adding that source to their local proc list, and (b) when sending a 
> message to another process, obtain the required contact info from their local 
> daemon if they don't already have it. Thus, we will see an increased latency 
> on first message - but we will ONLY store info for processes with which we 
> actually communicate (thus reducing the memory burden) and will wireup much 
> faster than we do today.
> 
> I'm not (yet) that familiar with the details of many of the BTLs, but my 
> initial review of them didn't see any showstoppers for this approach. If 
> people think this might work and be an interesting approach, I'd be happy to 
> help implement a prototype to quantify its behavior.
> 
> Ralph
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Modex

Reply via email to