One question there is a mention a new pml that is essentially CM+matching. Why is this no just another instance of CM ?
Rich On 11/26/07 7:54 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote: > OMPI OF Pow Wow Notes > 26 Nov 2007 > > --------------------------------------------------------------------------- > > Discussion of current / upcoming work: > > OCG (Chelsio): > - Did a bunch of udapl work, but abandoned it. Will commit it to a > tmp branch if anyone cares (likely not). > - They have been directed to move to the verbs API; will be starting > on that next week. > > Cisco: > - likely to get more involved in PML/BTL issues since Galen + Brian > now transferring out of these areas. > - race between Chelsio / Cisco as to who implements RDMA CM connect PC > first (more on this below). May involve some changes to the connect > PC interface. > - Re-working libevent and progress engine stuff with George > > LLNL: > - Andrew Friedley leaving LLNL in 3 weeks > - UD code more of less functional, but working on reliability stuff > down in the BTL. That part is not quite working yet. > - When he leaves LLNL, UD BTL may become unmaintained. > > IBM: > - Has an interest in NUNAs > - May be interested in maintaining the UD BTL; worried about scale. > > Mellanox: > - Just finished first implementation of XRC > - Now working on QA issues with XRC: testing with multiple subnets, > different numbers of HCAs/ports on different hosts, etc. > > Sun: > - Currently working full steam ahead on UDAPL. > - Will likely join in openib BTL/etc. when Sun's verbs stack is ready. > - Will *hopefully* get early access to Sun's verbs stack for testing / > compatibility issues before the stack becomes final. > > ORNL: > - Mostly working on non-PML/BTL issues these days. > - Galen's advice: get progress thread working for best IB overlap and > real application performance. > > Voltaire: > - Working on XRC improvements > - Working on message coalescing. Only sees benefit if you drastically > reduce the number of available buffers and credits -- i.e., be much > more like openib BTL before BSRQ (2 buffer sizes: large and small, > and have very few small buffer credits). > <lots of discussion about message coalescing; this will be worth at > least an FAQ item to explain all the trade-offs. There could be > non-artificial benefits for coalescing at scale because of limiting > the number of credits> > - Moving HCA initializing stuff to a common area to share it with > collective components. > > --------------------------------------------------------------------------- > > Discussion of various "moving forward" proposals: > > - ORNL, Cisco, Mellanox discussing how to reduce cost of memory > registration. Currently running some benchmarks to figure out where > the bottlenecks are. Cheap registration would *help* (but not > completely solve) overlap issues by reducing the number of sync > points -- e.g., always just do one big RDMA operation (vs. the > pipeline protocol). > > - Some discussion of a UD-based connect PC. Gleb mentions that what > was proposed is effectively the same as the IBTA CM (i.e., ibcm). > Jeff will go investigate. > > - Gleb also mentions that the connect PC needs to be based on the > endpoint, not the entire module (for non-uniform hardware > networks). Jeff took a to-do item to fix. Probably needs to be > done for v1.3. > > - To UD or not to UD? Lots of discussion on this. > > - Some data has been presented by OSU showing that UD drops don't > happen often. Their tests were run in a large non-blocking > network. Some in the group feel that in a busy blocking network, > UD drops will be [much] more common (there is at least some > evidence that in a busy non-blocking network, drops *are* rare). > This issue affects how we design the recovery of UD drops: if > drops are rare, recovery can be arbitrarily expensive. If drops > are not rare, recovery needs to be at least somewhat efficient. > If drops are frequent, recovery needs to be cheap/fast/easy. > > - Mellanox is investigating why ibv_rc_pingpong gives better > bandwidth than ibv_ud_pingpong (i.e., UD bandwidth is poor). > > - Discuss the possibility of doing connection caching: only allow so > many RC connections at a time. Maintain an LRU of RC connections. > When you need to close one, also recycle (or free) all of its > posted buffers. > > - Discussion of MVAPICH technique for large UD messages: "[receiver] > zero copy UD". Send a match header; receiver picks a UD QP from a > ready pool and sends it back to the sender. Fragments from the > user's buffer are posted to that QP on the receiver, so the sender > sends straight into the receiver's target buffer. This scheme > assumes no drops. For OMPI, this scheme also requires more > complexity from our current multi-device striping method: we'd > want to stripe across large contiguous portions of the message > (vs. round robining small fragments from the message). > > - One point specifically discussed: long message alltoall at scale > (i.e., large numbers of MPI processes). Andrew Friedley is going > to ask around LLNL if anyone does this, but our guess is no > because each host would need a *ton* of RAM to do this: > (num_procs_per_node * num_procs_total * length_of_buffer). Our > suspicion is that alltoall for short messages is much more common > (and still, by far, not the most common MPI collective). > > - One proposal: > - Use UD for short messages (except for peers that switch to eager > RDMA) > - Always use RC for long messages, potentially with connection > caching+fast IB connect (ibcm?) > > - Another proposal: let OSU keep forging ahead with UD and see what > they come up with. I.e., let them figure out if UD is worth it or > not. > > - End result: it's not 100% clear that UD is a "win" yet -- there > are many unanswered questions. > > - Make new PML that is essentially "CM+matching", send entire messages > down to lower layer instead of having the PML do the fragmenting: > > - Rationale: > - pretty simple PML > - allow lower layer to do more optimizations based on full > knowledge of the specific network being used > - networks get CM-like benefits without having to "natively" > support shmem (because matching will still be done in the PML > and there will be a lower layer/component for shmem) > - [possibly] remove some stuff from current code in ob1 that is > not necessary in IB/OF (Gleb didn't think that this would be > useful; most of OB1 is there to support IB/OF) > - not force other networks to same model as IB/OF (i.e., when we > want > new things in IB/OF, we have to go change all the other BTLs) > --> ^^ I forgot to mention this point on the call today > - if we go towards a combined RC+UD OF protocol, the current OB1 > is not good at this because the BTL flags will have to "lie" > about whether a given endpoint is capable of RDMA or not. > --> Gleb mentioned that it doesn't matter what the PML thinks; > even if the PML tells the BTL to RDMA PUT/GET, the BTL can > emulate it if it isn't supported (e.g., if an endpoint > switches between RD and UD) > > - Jeff sees this as a code re-org, not so much as a re-write. > > - Gleb is skeptical on the value of this; it may be more valuable if > we go towards a blended UD+RC protocol, though. > > The phone bridge started kicking people off at this point (after we > went 30+ minutes beyond the scheduled end time). So no conclusions > were reached. This discussion probably needs to continue in e-mail, > etc. > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >