Sorry it took so long to forward these notes to everyone. Here's some notes from the BTL meeting we had in Knoxville a few weeks ago.
> Date: > Feb. 12, 2013 > > People: > Thomas Herault > George Bosilca > Jeff Squyres > Brian Barrett > Aurelien Bouteiller > Ralf Castain > Nathan Hjelmn > > Goal: > Lay out the general design of moving the BTL framework into OPAL. > > > -== Identifying dependencies ==- > > BTL > +------> Modex > +------> Mpool + rcache + conv > +------> bml / allocators > +------> Help/* > +------> Naming + Endpoints? > +------> (RML/OOB) > +------> Threads > > > ==== ACTION PLAN ==== > > 0. Remove Solaris Threads (--with-thread option is attached) > 1. Opal DB/modex > 1.b OpenIB UDCM independent from OOB > 2. Move BTL down to OPAL > 3. Move to locks to lowercase versions (that are always locking), look at > perf. > 4. Look at conditions, atomics, etc > 4.5: add big locks on things that are maybe not thread safe and not > performance critical > 5. Fix perf/redesign locking (in SM, in particular) > 6. Use BTL tcp in place of OOB in ORTE > > > ==== DETAIL OF ISSUES ==== > > -== IB BTL boostrapping ==- > > IB BTL is the only one that depends on OOB/RML > Options: > 1. Use the TCP BTL to boostrap IB BTL. Brian doesn't like this, because > making it available is an enabler for bad practice that will creep in > the codebase > 2. Remove OOB, Fix UDCM so it stops doing things it should not have > done anyway. > We settle for option 2. > > -== SM initialization ==- > > Some technical discussion on the way the shared segment is created and > the sync mechanism for the shared file. There are a number of issues, > that seem to benefit from the fact that the modex synchronize before > we attempt the file access. There may be trouble if the modex is > removed (or is not synchronizing). > > -== Process name scalability ==- > > Process names use a lot of space. > Do we need the process information from everybody at all time ? > > Modex vs opal_db. (need to clarify, I was doing something else) > > Too many things are going into the modex/db. In many arch, we don't need > the hostname, or other info, because they can be derived. Some other > machines, the hostname has no meaning. > > Brian: BTL should not have the hostname - at all - ? BTL should not > report errors themselves, errors should go up and the BTL stay silent > (also avoids some massive multinode error logs). > * error is reported upstack (no printf) > * Callback to get the error string later, when the pretty print happens > > *** > We need a temporary name during bootstrapping (before we get the OMPI > names setup). Could be created from a 128bit hash, it should have low > probability of collision, we can crash the job if we detect collision (?). > > ompi_name is some sort of proxy for orte_name. Everytime we use a > ompi_name, it gets converted to orte_name immediately after. > > We also need an identifier to prevent random stuff to connect to us. > There is an issue in dynamic process, for names can still be unknown > yet. DPM is expensive and saved by a modex, we'll have a problem later > on if we make it fast. > > *** > When does the BTL need a name first ? > > opal_init, opal_init_util (from ompi_info) > add? opal_init_btl(name) + opal_fini_btl() > > During opal_init_btl: > * Btls register with the name > * Add their local info to the DB (opal_db) > use a hashtable for storing name{ key=value ... } > Align by 64bits the values, so that all keys are allocated (and sent) > in a single bulk. > * Should some modex key appear as global shared, local shared, local, > (lazy propagated?) > * > > -== BML ==- > > We should not care about it and not move it around. We are fine using > BTL only, the BML offers little functionality. We'll try, but if it > is hard we'll forget it. > > * Addprocs: Assumption that we have to call addprocs for each > endpoint. Maybe we can change this so that addprocs is called only once. > * If orte uses BTL, it will have to be called twice, that is sorry > (or not ?). It can be postponed for when ORTE moves to BTL. > > -== Active Message TAG numbers ==- > > They have to move down too. The split in 32bits groups makes the tags > sparse. We have a layer separation break here, but we may not want to > have all PML_OB1 tags appear down in the OPAL. We'll put down the header > file, we don't change it for now (we are not overcrowded so its ok for > it to be sparse). George will rearrange so that there are more possible > families (at the expense of the number of possible tags per families). > > -== Thread safety ==- > > Because BTL are now being used by top layers that we don't know what > they are doing, we have to assume that threads are on (by default), > leading to a bad performance hit, due to using opal_list_t that are > locked deep inside. > > What needs to be protected ? > * per btl locking: huge cost on everything > * per endpoint locking? > > That's related to enabling async progress, but that is a big chunk of > work. We just want to keep in mind that goal so that we don't make it > worse than it already is. > > Do we want the ability to turn off/on thread safety at runtime ? > * lock, unlock, trylock: > * accessors that are always safe > * accessors that can be turned unsafe at runtime (only for OMPI level) > * swap, cmpswap, substract, add (32, 64bit atomics): no change (we already > have both), but the CAPS move to OMPI > * signal and condition variables > * When, how do we call progress when needed if we remove the > UPPER_CASE that calls it ? > * appears in wait, test, free_list > * We'll try to remove as many as we can (upper case locks) and see > where we are in 4-5 months from now. > > ===== OTHER issues found while talking ===== > > * DPM is slow, it needs a speedup. > > -== Error reporting / printf ==- > Replace all orte_show_help with opal_show_help. Make sure that the > symbol is not exposed anymore outside ORTE to force update. > * Orte_show_help, deduplicating happens in orte, ompi_show_help backcalls > orte_show_help. Lets get rid of it completely. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/