Re: [OMPI devel] IB pow wow notes

Richard Graham Wed, 5 Dec 2007 19:48:58 -0500

One question  there is a mention a new pml that is essentially CM+matching.
Why is this no just another instance of CM ?


Rich


On 11/26/07 7:54 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote:

> OMPI OF Pow Wow Notes
> 26 Nov 2007
> 
> ---------------------------------------------------------------------------
> 
> Discussion of current / upcoming work:
> 
> OCG (Chelsio):
> - Did a bunch of udapl work, but abandoned it.  Will commit it to a
>    tmp branch if anyone cares (likely not).
> - They have been directed to move to the verbs API; will be starting
>    on that next week.
> 
> Cisco:
> - likely to get more involved in PML/BTL issues since Galen + Brian
>    now transferring out of these areas.
> - race between Chelsio / Cisco as to who implements RDMA CM connect PC
>    first (more on this below).  May involve some changes to the connect
>    PC interface.
> - Re-working libevent and progress engine stuff with George
> 
> LLNL:
> - Andrew Friedley leaving LLNL in 3 weeks
> - UD code more of less functional, but working on reliability stuff
>    down in the BTL.  That part is not quite working yet.
> - When he leaves LLNL, UD BTL may become unmaintained.
> 
> IBM:
> - Has an interest in NUNAs
> - May be interested in maintaining the UD BTL; worried about scale.
> 
> Mellanox:
> - Just finished first implementation of XRC
> - Now working on QA issues with XRC: testing with multiple subnets,
>    different numbers of HCAs/ports on different hosts, etc.
> 
> Sun:
> - Currently working full steam ahead on UDAPL.
> - Will likely join in openib BTL/etc. when Sun's verbs stack is ready.
> - Will *hopefully* get early access to Sun's verbs stack for testing /
>    compatibility issues before the stack becomes final.
> 
> ORNL:
> - Mostly working on non-PML/BTL issues these days.
> - Galen's advice: get progress thread working for best IB overlap and
>    real application performance.
> 
> Voltaire:
> - Working on XRC improvements
> - Working on message coalescing.  Only sees benefit if you drastically
>    reduce the number of available buffers and credits -- i.e., be much
>    more like openib BTL before BSRQ (2 buffer sizes: large and small,
>    and have very few small buffer credits).
>    <lots of discussion about message coalescing; this will be worth at
>    least an FAQ item to explain all the trade-offs.  There could be
>    non-artificial benefits for coalescing at scale because of limiting
>    the number of credits>
> - Moving HCA initializing stuff to a common area to share it with
>    collective components.
> 
> ---------------------------------------------------------------------------
> 
> Discussion of various "moving forward" proposals:
> 
> - ORNL, Cisco, Mellanox discussing how to reduce cost of memory
>    registration.  Currently running some benchmarks to figure out where
>    the bottlenecks are.  Cheap registration would *help* (but not
>    completely solve) overlap issues by reducing the number of sync
>    points -- e.g., always just do one big RDMA operation (vs. the
>    pipeline protocol).
> 
> - Some discussion of a UD-based connect PC.  Gleb mentions that what
>    was proposed is effectively the same as the IBTA CM (i.e., ibcm).
>    Jeff will go investigate.
> 
> - Gleb also mentions that the connect PC needs to be based on the
>    endpoint, not the entire module (for non-uniform hardware
>    networks).  Jeff took a to-do item to fix.  Probably needs to be
>    done for v1.3.
> 
> - To UD or not to UD?  Lots of discussion on this.
> 
>    - Some data has been presented by OSU showing that UD drops don't
>      happen often.  Their tests were run in a large non-blocking
>      network.  Some in the group feel that in a busy blocking network,
>      UD drops will be [much] more common (there is at least some
>      evidence that in a busy non-blocking network, drops *are* rare).
>      This issue affects how we design the recovery of UD drops: if
>      drops are rare, recovery can be arbitrarily expensive.  If drops
>      are not rare, recovery needs to be at least somewhat efficient.
>      If drops are frequent, recovery needs to be cheap/fast/easy.
> 
>    - Mellanox is investigating why ibv_rc_pingpong gives better
>      bandwidth than ibv_ud_pingpong (i.e., UD bandwidth is poor).
> 
>    - Discuss the possibility of doing connection caching: only allow so
>      many RC connections at a time.  Maintain an LRU of RC connections.
>      When you need to close one, also recycle (or free) all of its
>      posted buffers.
> 
>    - Discussion of MVAPICH technique for large UD messages: "[receiver]
>      zero copy UD".  Send a match header; receiver picks a UD QP from a
>      ready pool and sends it back to the sender.  Fragments from the
>      user's buffer are posted to that QP on the receiver, so the sender
>      sends straight into the receiver's target buffer.  This scheme
>      assumes no drops.  For OMPI, this scheme also requires more
>      complexity from our current multi-device striping method: we'd
>      want to stripe across large contiguous portions of the message
>      (vs. round robining small fragments from the message).
> 
>    - One point specifically discussed: long message alltoall at scale
>      (i.e., large numbers of MPI processes).  Andrew Friedley is going
>      to ask around LLNL if anyone does this, but our guess is no
>      because each host would need a *ton* of RAM to do this:
>      (num_procs_per_node * num_procs_total * length_of_buffer).  Our
>      suspicion is that alltoall for short messages is much more common
>      (and still, by far, not the most common MPI collective).
> 
>    - One proposal:
>      - Use UD for short messages (except for peers that switch to eager
>        RDMA)
>      - Always use RC for long messages, potentially with connection
>        caching+fast IB connect (ibcm?)
> 
>    - Another proposal: let OSU keep forging ahead with UD and see what
>      they come up with.  I.e., let them figure out if UD is worth it or
>      not.
> 
>    - End result: it's not 100% clear that UD is a "win" yet -- there
>      are many unanswered questions.
> 
> - Make new PML that is essentially "CM+matching", send entire messages
>    down to lower layer instead of having the PML do the fragmenting:
> 
>    - Rationale:
>      - pretty simple PML
>      - allow lower layer to do more optimizations based on full
>        knowledge of the specific network being used
>      - networks get CM-like benefits without having to "natively"
>        support shmem (because matching will still be done in the PML
>        and there will be a lower layer/component for shmem)
>      - [possibly] remove some stuff from current code in ob1 that is
>        not necessary in IB/OF (Gleb didn't think that this would be
>        useful; most of OB1 is there to support IB/OF)
>      - not force other networks to same model as IB/OF (i.e., when we
> want
>        new things in IB/OF, we have to go change all the other BTLs)
>        --> ^^ I forgot to mention this point on the call today
>      - if we go towards a combined RC+UD OF protocol, the current OB1
>        is not good at this because the BTL flags will have to "lie"
>        about whether a given endpoint is capable of RDMA or not.
>        --> Gleb mentioned that it doesn't matter what the PML thinks;
>            even if the PML tells the BTL to RDMA PUT/GET, the BTL can
>            emulate it if it isn't supported (e.g., if an endpoint
>            switches between RD and UD)
> 
>    - Jeff sees this as a code re-org, not so much as a re-write.
> 
>    - Gleb is skeptical on the value of this; it may be more valuable if
>      we go towards a blended UD+RC protocol, though.
> 
> The phone bridge started kicking people off at this point (after we
> went 30+ minutes beyond the scheduled end time).  So no conclusions
> were reached.  This discussion probably needs to continue in e-mail,
> etc.
> 
> --
> Jeff Squyres
> Cisco Systems
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] IB pow wow notes

Reply via email to