Re: [OMPI devel] Open MPI BTL meeting in Knoxville

George Bosilca Tue, 5 Mar 2013 11:08:00 -0500

All,

[This is in complement to the internal notes that Jeff sent out earlier.]


As you might have heard some of us had a meeting few weeks ago at UTK to talk 
about the BTL, and their possible move down at the OPAL level. As a result 
several key components have been identified as susceptible candidates that must 
be moved prior to the BTL. You might have already noticed some of the changes 
identified during this meeting have already begun.

Here is a comprehensive list of things to be moved. The ones marked with * have 
been already completed.
* Modex (ortedb)
- Mpool + rcache + conv
* Help messaging (get rid of ompi_show_help by replacing it with opal_show_help)
- RML / OOB
- BTL

Two additional things to be addressed and clearly defined during this move are:

- Naming + Endpoints: For now we'll go with an uint64_t packaged as an OPAL 
type (to be defined). This naming will only be used during the initial steps, 
up to when the upper layer (RTE or OMPI) is taking control, and the 
corresponding naming scheme will be used. This name is provided by the upper 
layer, OPAL will only used it as an index in the opal_db.

- Threads safety: Minimize the locking per unit of usage. For this we will 
cleanup the locking to only keep two methods: lower and upper (almost as they 
are today). The meaning is that lower case __always__ has the protected 
meaning, while the upper-case will be surrounded by an "if(threads_active)". 
Moreover, the upper-case version will be removed from the OPAL level into the 
OMPI label (thus their name will change from OPAL_* to OMPI_*). 

>From a technical perspective, few other ideas have bee throw around:

- orte_show_help should lose the DECLSPEC, and it's usage should be confined to 
the ORTE layer.
- fix UDCM !!!
- everything that is not performance critical from the MPI standard will be 
protected by a big lock. One lock per type of resources (attributes, info, 
whatever else)
- redo the dynamic processing layer

After all these discussions we ended up with a plan to move forward.
Step -2: Remove Solaris threads
Step -1: Fix UDCM/openib
Step 0: opal_db/modex_db down in the OPAL
Step 0.5: shared opal_db on the node (may be delayed it is not critical).
Step 1: move the BTLs and all the other needed components.
Step 2: Always enable locking in BTL. Evaluate the impact on the performance 
before enabling.
Step 3: Fix the atomics (lower case and upper case). The condition in 
upper-case should disappear.
Step 4: Fix the perfs (if necessary), and redesign the locking strategy.

  George.






On Mar 5, 2013, at 16:33 , Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:

> Sorry it took so long to forward these notes to everyone.  Here's some notes 
> from the BTL meeting we had in Knoxville a few weeks ago.
> 
>> Date: 
>>   Feb. 12, 2013
>> 
>> People: 
>>   Thomas Herault
>>   George Bosilca
>>   Jeff Squyres
>>   Brian Barrett
>>   Aurelien Bouteiller
>>   Ralf Castain
>>   Nathan Hjelmn
>> 
>> Goal: 
>>   Lay out the general design of moving the BTL framework into OPAL. 
>> 
>> 
>> -== Identifying dependencies ==-    
>> 
>> BTL
>> +------> Modex
>> +------> Mpool + rcache + conv
>> +------> bml / allocators
>> +------> Help/*
>> +------> Naming + Endpoints?
>> +------> (RML/OOB)
>> +------> Threads
>> 
>> 
>> ==== ACTION PLAN ====
>> 
>> 0. Remove Solaris Threads (--with-thread option is attached)
>> 1. Opal DB/modex
>> 1.b OpenIB UDCM independent from OOB
>> 2. Move BTL down to OPAL
>> 3. Move to locks to lowercase versions (that are always locking), look at 
>> perf.
>> 4. Look at conditions, atomics, etc
>> 4.5: add big locks on things that are maybe not thread safe and not 
>> performance critical
>> 5. Fix perf/redesign locking (in SM, in particular)
>> 6. Use BTL tcp in place of OOB in ORTE
>> 
>> 
>> ==== DETAIL OF ISSUES ====
>> 
>> -== IB BTL boostrapping ==-
>> 
>> IB BTL is the only one that depends on OOB/RML
>> Options: 
>> 1. Use the TCP BTL to boostrap IB BTL. Brian doesn't like this, because
>> making it available is an enabler for bad practice that will creep in 
>> the codebase
>> 2. Remove OOB, Fix UDCM so it stops doing things it should not have 
>> done anyway. 
>> We settle for option 2.
>> 
>> -== SM initialization ==- 
>> 
>> Some technical discussion on the way the shared segment is created and 
>> the sync mechanism for the shared file. There are a number of issues, 
>> that seem to benefit from the fact that the modex synchronize before 
>> we attempt the file access. There may be trouble if the modex is 
>> removed (or is not synchronizing).
>> 
>> -== Process name scalability ==-
>> 
>> Process names use a lot of space.
>> Do we need the process information from everybody at all time ?
>> 
>> Modex vs opal_db. (need to clarify, I was doing something else)
>> 
>> Too many things are going into the modex/db. In many arch, we don't need
>> the hostname, or other info, because they can be derived. Some other
>> machines, the hostname has no meaning.
>> 
>> Brian: BTL should not have the hostname - at all - ? BTL should not 
>> report errors themselves, errors should go up and the BTL stay silent 
>> (also avoids some massive multinode error logs).
>> * error is reported upstack (no printf)
>> * Callback to get the error string later, when the pretty print happens
>> 
>> *** 
>> We need a temporary name during bootstrapping (before we get the OMPI 
>> names setup). Could be created from a 128bit hash, it should have low 
>> probability of collision, we can crash the job if we detect collision (?).
>> 
>> ompi_name is some sort of proxy for orte_name. Everytime we use a
>> ompi_name, it gets converted to orte_name immediately after.
>> 
>> We also need an identifier to prevent random stuff to connect to us.
>> There is an issue in dynamic process, for names can still be unknown
>> yet. DPM is expensive and saved by a modex, we'll have a problem later 
>> on if we make it fast. 
>> 
>> ***
>> When does the BTL need a name first ? 
>> 
>> opal_init, opal_init_util (from ompi_info)
>> add? opal_init_btl(name) + opal_fini_btl()
>> 
>> During opal_init_btl:
>> * Btls register with the name
>> * Add their local info to the DB (opal_db)
>>   use a hashtable for storing name{ key=value ... }
>>   Align by 64bits the values, so that all keys are allocated (and sent) 
>>   in a single bulk. 
>> * Should some modex key appear as global shared, local shared, local, 
>>   (lazy propagated?)
>> * 
>> 
>> -== BML ==-
>> 
>> We should not care about it and not move it around. We are fine using 
>> BTL only, the BML offers little functionality. We'll try, but if it 
>> is hard we'll forget it. 
>> 
>> * Addprocs: Assumption that we have to call addprocs for each 
>> endpoint. Maybe we can change this so that addprocs is called only once. 
>> * If orte uses BTL, it will have to be called twice, that is sorry 
>> (or not ?). It can be postponed for when ORTE moves to BTL. 
>> 
>> -== Active Message TAG numbers ==-
>> 
>> They have to move down too. The split in 32bits groups makes the tags 
>> sparse. We have a layer separation break here, but we may not want to 
>> have all PML_OB1 tags appear down in the OPAL. We'll put down the header
>> file, we don't change it for now (we are not overcrowded so its ok for 
>> it to be sparse). George will rearrange so that there are more possible 
>> families (at the expense of the number of possible tags per families).
>> 
>> -== Thread safety ==-
>> 
>> Because BTL are now being used by top layers that we don't know what 
>> they are doing, we have to assume that threads are on (by default), 
>> leading to a bad performance hit, due to using opal_list_t that are 
>> locked deep inside. 
>> 
>> What needs to be protected ? 
>> * per btl locking: huge cost on everything
>> * per endpoint locking? 
>> 
>> That's related to enabling async progress, but that is a big chunk of 
>> work. We just want to keep in mind that goal so that we don't make it 
>> worse than it already is. 
>> 
>> Do we want the ability to turn off/on thread safety at runtime ?
>> * lock, unlock, trylock:
>>   * accessors that are always safe
>>   * accessors that can be turned unsafe at runtime (only for OMPI level)
>> * swap, cmpswap, substract, add (32, 64bit atomics): no change (we already 
>> have both), but the CAPS move to OMPI 
>> * signal and condition variables
>>   * When, how do we call progress when needed if we remove the 
>>   UPPER_CASE that calls it ?
>>   * appears in wait, test, free_list
>> * We'll try to remove as many as we can (upper case locks) and see
>>  where we are in 4-5 months from now. 
>> 
>> ===== OTHER issues found while talking =====
>> 
>> * DPM is slow, it needs a speedup. 
>> 
>> -== Error reporting / printf ==-
>> Replace all orte_show_help with opal_show_help. Make sure that the
>> symbol is not exposed anymore outside ORTE to force update.
>> * Orte_show_help, deduplicating happens in orte, ompi_show_help backcalls 
>> orte_show_help. Lets get rid of it completely. 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Open MPI BTL meeting in Knoxville

Reply via email to