(I'm catching up on email from an unanticipated absence - forgive the delay)

Pineapple did not hit a roadblock during the call. It is still on
track. I will start a separate thread for the discussion. As I have
said many (many, many) times, if the pineapple interface needs to
change for OMPI/ORTE/OPAL then we will change it. George's problem (as
best I could tell) was not with the interface, but with pineapple
being a separate project in the tree versus being a framework in OMPI.
But that is a discussion we can have on another thread.

-- Jsoh

On Wed, Jun 13, 2012 at 9:07 AM, Ralph Castain <r...@open-mpi.org> wrote:
> ?????
>
> I'm talking about how to implement it, not what level holds the interface. 
> Besides, "pineapple" hit a roadblock during the call and is a totally 
> separate discussion.
>
>
> On Jun 13, 2012, at 7:03 AM, Richard Graham wrote:
>
>> I would suggest exposing modex at the pineapple level, and not tie it to a 
>> particular instance of run-time instantiation.  This decouples the 
>> instantiation from the details of the run-time, and also gives the freedom 
>> to provide different instantiations for different job scenarios.
>>
>> Rich
>>
>> -----Original Message-----
>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On 
>> Behalf Of Ralph Castain
>> Sent: Wednesday, June 13, 2012 12:10 AM
>> To: Open MPI Developers
>> Subject: [OMPI devel] Modex
>>
>> George raised something during this morning's call that I wanted to 
>> follow-up on relating to improving our modex operation. I've been playing 
>> with an approach that sounded similar to what he suggested, and perhaps we 
>> could pursue it in accordance with moving the BTL's to OPAL.
>>
>> We currently block on exchange of contact information for the BTL's when we 
>> perform an all-to-all operation we term the "modex". At the end of that 
>> operation, each process constructs a list of information for all processes 
>> in the job, and each process contains the complete BTL contact info for 
>> every process in its modex database. This consumes a significant amount of 
>> memory, especially as we scale to ever larger applications. In addition, the 
>> modex operation itself is one of the largest time consumers during MPI_Init.
>>
>> An alternative approach is for the BTL's to "add proc" only on "first 
>> message" to or from that process - i.e., we would not construct a list of 
>> all procs during MPI_Init, but only add an entry for a process with which we 
>> communicate. The method would go like this:
>>
>> 1. during MPI_Init, each BTL posts its contact info to the local modex
>>
>> 2. the "modex" call in MPI_Init simply sends that data to the local daemon, 
>> which asynchronously executes an all-to-all collective with the other 
>> daemons in the job. At the end of that operation, each daemon holds a 
>> complete modex database for the job. Meantime, the application process 
>> continues to run.
>>
>> 3. we remove the "add_procs" call within MPI_Init, and perhaps can eliminate 
>> the ORTE barrier at the end of MPI_Init. The reason we had that barrier was 
>> to ensure that all procs were ready to communicate before we allowed anyone 
>> to send a message. However, with this method, that may no longer be required.
>>
>> 4. we modify the BTL's so they (a) can receive a message from an unknown 
>> source, adding that source to their local proc list, and (b) when sending a 
>> message to another process, obtain the required contact info from their 
>> local daemon if they don't already have it. Thus, we will see an increased 
>> latency on first message - but we will ONLY store info for processes with 
>> which we actually communicate (thus reducing the memory burden) and will 
>> wireup much faster than we do today.
>>
>> I'm not (yet) that familiar with the details of many of the BTLs, but my 
>> initial review of them didn't see any showstoppers for this approach. If 
>> people think this might work and be an interesting approach, I'd be happy to 
>> help implement a prototype to quantify its behavior.
>>
>> Ralph
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

Reply via email to