Sorry it took so long to forward these notes to everyone.  Here's some notes 
from the BTL meeting we had in Knoxville a few weeks ago.

> Date: 
>    Feb. 12, 2013
> 
> People: 
>    Thomas Herault
>    George Bosilca
>    Jeff Squyres
>    Brian Barrett
>    Aurelien Bouteiller
>    Ralf Castain
>    Nathan Hjelmn
> 
> Goal: 
>    Lay out the general design of moving the BTL framework into OPAL. 
> 
> 
> -== Identifying dependencies ==-    
> 
> BTL
> +------> Modex
> +------> Mpool + rcache + conv
> +------> bml / allocators
> +------> Help/*
> +------> Naming + Endpoints?
> +------> (RML/OOB)
> +------> Threads
> 
> 
> ==== ACTION PLAN ====
> 
> 0. Remove Solaris Threads (--with-thread option is attached)
> 1. Opal DB/modex
> 1.b OpenIB UDCM independent from OOB
> 2. Move BTL down to OPAL
> 3. Move to locks to lowercase versions (that are always locking), look at 
> perf.
> 4. Look at conditions, atomics, etc
> 4.5: add big locks on things that are maybe not thread safe and not 
> performance critical
> 5. Fix perf/redesign locking (in SM, in particular)
> 6. Use BTL tcp in place of OOB in ORTE
> 
> 
> ==== DETAIL OF ISSUES ====
> 
> -== IB BTL boostrapping ==-
> 
> IB BTL is the only one that depends on OOB/RML
> Options: 
>  1. Use the TCP BTL to boostrap IB BTL. Brian doesn't like this, because
>  making it available is an enabler for bad practice that will creep in 
>  the codebase
>  2. Remove OOB, Fix UDCM so it stops doing things it should not have 
>  done anyway. 
> We settle for option 2.
> 
> -== SM initialization ==- 
> 
> Some technical discussion on the way the shared segment is created and 
> the sync mechanism for the shared file. There are a number of issues, 
> that seem to benefit from the fact that the modex synchronize before 
> we attempt the file access. There may be trouble if the modex is 
> removed (or is not synchronizing).
> 
> -== Process name scalability ==-
> 
> Process names use a lot of space.
> Do we need the process information from everybody at all time ?
> 
> Modex vs opal_db. (need to clarify, I was doing something else)
> 
> Too many things are going into the modex/db. In many arch, we don't need
> the hostname, or other info, because they can be derived. Some other
> machines, the hostname has no meaning.
> 
> Brian: BTL should not have the hostname - at all - ? BTL should not 
> report errors themselves, errors should go up and the BTL stay silent 
> (also avoids some massive multinode error logs).
>  * error is reported upstack (no printf)
>  * Callback to get the error string later, when the pretty print happens
> 
> *** 
> We need a temporary name during bootstrapping (before we get the OMPI 
> names setup). Could be created from a 128bit hash, it should have low 
> probability of collision, we can crash the job if we detect collision (?).
> 
> ompi_name is some sort of proxy for orte_name. Everytime we use a
> ompi_name, it gets converted to orte_name immediately after.
> 
> We also need an identifier to prevent random stuff to connect to us.
> There is an issue in dynamic process, for names can still be unknown
> yet. DPM is expensive and saved by a modex, we'll have a problem later 
> on if we make it fast. 
> 
> ***
> When does the BTL need a name first ? 
> 
> opal_init, opal_init_util (from ompi_info)
> add? opal_init_btl(name) + opal_fini_btl()
> 
> During opal_init_btl:
>  * Btls register with the name
>  * Add their local info to the DB (opal_db)
>    use a hashtable for storing name{ key=value ... }
>    Align by 64bits the values, so that all keys are allocated (and sent) 
>    in a single bulk. 
>  * Should some modex key appear as global shared, local shared, local, 
>    (lazy propagated?)
>  * 
> 
> -== BML ==-
> 
> We should not care about it and not move it around. We are fine using 
> BTL only, the BML offers little functionality. We'll try, but if it 
> is hard we'll forget it. 
> 
>  * Addprocs: Assumption that we have to call addprocs for each 
>  endpoint. Maybe we can change this so that addprocs is called only once. 
>  * If orte uses BTL, it will have to be called twice, that is sorry 
>  (or not ?). It can be postponed for when ORTE moves to BTL. 
> 
> -== Active Message TAG numbers ==-
> 
> They have to move down too. The split in 32bits groups makes the tags 
> sparse. We have a layer separation break here, but we may not want to 
> have all PML_OB1 tags appear down in the OPAL. We'll put down the header
> file, we don't change it for now (we are not overcrowded so its ok for 
> it to be sparse). George will rearrange so that there are more possible 
> families (at the expense of the number of possible tags per families).
> 
> -== Thread safety ==-
> 
> Because BTL are now being used by top layers that we don't know what 
> they are doing, we have to assume that threads are on (by default), 
> leading to a bad performance hit, due to using opal_list_t that are 
> locked deep inside. 
> 
> What needs to be protected ? 
>  * per btl locking: huge cost on everything
>  * per endpoint locking? 
> 
> That's related to enabling async progress, but that is a big chunk of 
> work. We just want to keep in mind that goal so that we don't make it 
> worse than it already is. 
> 
> Do we want the ability to turn off/on thread safety at runtime ?
> * lock, unlock, trylock:
>    * accessors that are always safe
>    * accessors that can be turned unsafe at runtime (only for OMPI level)
> * swap, cmpswap, substract, add (32, 64bit atomics): no change (we already 
> have both), but the CAPS move to OMPI 
> * signal and condition variables
>    * When, how do we call progress when needed if we remove the 
>    UPPER_CASE that calls it ?
>    * appears in wait, test, free_list
> * We'll try to remove as many as we can (upper case locks) and see
>   where we are in 4-5 months from now. 
> 
> ===== OTHER issues found while talking =====
> 
> * DPM is slow, it needs a speedup. 
> 
> -== Error reporting / printf ==-
> Replace all orte_show_help with opal_show_help. Make sure that the
> symbol is not exposed anymore outside ORTE to force update.
> * Orte_show_help, deduplicating happens in orte, ompi_show_help backcalls 
>  orte_show_help. Lets get rid of it completely. 

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to