George raised something during this morning's call that I wanted to follow-up 
on relating to improving our modex operation. I've been playing with an 
approach that sounded similar to what he suggested, and perhaps we could pursue 
it in accordance with moving the BTL's to OPAL.

We currently block on exchange of contact information for the BTL's when we 
perform an all-to-all operation we term the "modex". At the end of that 
operation, each process constructs a list of information for all processes in 
the job, and each process contains the complete BTL contact info for every 
process in its modex database. This consumes a significant amount of memory, 
especially as we scale to ever larger applications. In addition, the modex 
operation itself is one of the largest time consumers during MPI_Init.

An alternative approach is for the BTL's to "add proc" only on "first message" 
to or from that process - i.e., we would not construct a list of all procs 
during MPI_Init, but only add an entry for a process with which we communicate. 
The method would go like this:

1. during MPI_Init, each BTL posts its contact info to the local modex

2. the "modex" call in MPI_Init simply sends that data to the local daemon, 
which asynchronously executes an all-to-all collective with the other daemons 
in the job. At the end of that operation, each daemon holds a complete modex 
database for the job. Meantime, the application process continues to run.

3. we remove the "add_procs" call within MPI_Init, and perhaps can eliminate 
the ORTE barrier at the end of MPI_Init. The reason we had that barrier was to 
ensure that all procs were ready to communicate before we allowed anyone to 
send a message. However, with this method, that may no longer be required.

4. we modify the BTL's so they (a) can receive a message from an unknown 
source, adding that source to their local proc list, and (b) when sending a 
message to another process, obtain the required contact info from their local 
daemon if they don't already have it. Thus, we will see an increased latency on 
first message - but we will ONLY store info for processes with which we 
actually communicate (thus reducing the memory burden) and will wireup much 
faster than we do today.

I'm not (yet) that familiar with the details of many of the BTLs, but my 
initial review of them didn't see any showstoppers for this approach. If people 
think this might work and be an interesting approach, I'd be happy to help 
implement a prototype to quantify its behavior.

Ralph


Reply via email to