OMPI OF Pow Wow Notes
26 Nov 2007

---------------------------------------------------------------------------

Discussion of current / upcoming work:

OCG (Chelsio):
- Did a bunch of udapl work, but abandoned it.  Will commit it to a
  tmp branch if anyone cares (likely not).
- They have been directed to move to the verbs API; will be starting
  on that next week.

Cisco:
- likely to get more involved in PML/BTL issues since Galen + Brian
  now transferring out of these areas.
- race between Chelsio / Cisco as to who implements RDMA CM connect PC
  first (more on this below).  May involve some changes to the connect
  PC interface.
- Re-working libevent and progress engine stuff with George

LLNL:
- Andrew Friedley leaving LLNL in 3 weeks
- UD code more of less functional, but working on reliability stuff
  down in the BTL.  That part is not quite working yet.
- When he leaves LLNL, UD BTL may become unmaintained.

IBM:
- Has an interest in NUNAs
- May be interested in maintaining the UD BTL; worried about scale.

Mellanox:
- Just finished first implementation of XRC
- Now working on QA issues with XRC: testing with multiple subnets,
  different numbers of HCAs/ports on different hosts, etc.

Sun:
- Currently working full steam ahead on UDAPL.
- Will likely join in openib BTL/etc. when Sun's verbs stack is ready.
- Will *hopefully* get early access to Sun's verbs stack for testing /
  compatibility issues before the stack becomes final.

ORNL:
- Mostly working on non-PML/BTL issues these days.
- Galen's advice: get progress thread working for best IB overlap and
  real application performance.

Voltaire:
- Working on XRC improvements
- Working on message coalescing.  Only sees benefit if you drastically
  reduce the number of available buffers and credits -- i.e., be much
  more like openib BTL before BSRQ (2 buffer sizes: large and small,
  and have very few small buffer credits).
  <lots of discussion about message coalescing; this will be worth at
  least an FAQ item to explain all the trade-offs.  There could be
  non-artificial benefits for coalescing at scale because of limiting
  the number of credits>
- Moving HCA initializing stuff to a common area to share it with
  collective components.

---------------------------------------------------------------------------

Discussion of various "moving forward" proposals:

- ORNL, Cisco, Mellanox discussing how to reduce cost of memory
  registration.  Currently running some benchmarks to figure out where
  the bottlenecks are.  Cheap registration would *help* (but not
  completely solve) overlap issues by reducing the number of sync
  points -- e.g., always just do one big RDMA operation (vs. the
  pipeline protocol).

- Some discussion of a UD-based connect PC.  Gleb mentions that what
  was proposed is effectively the same as the IBTA CM (i.e., ibcm).
  Jeff will go investigate.

- Gleb also mentions that the connect PC needs to be based on the
  endpoint, not the entire module (for non-uniform hardware
  networks).  Jeff took a to-do item to fix.  Probably needs to be
  done for v1.3.

- To UD or not to UD?  Lots of discussion on this.

  - Some data has been presented by OSU showing that UD drops don't
    happen often.  Their tests were run in a large non-blocking
    network.  Some in the group feel that in a busy blocking network,
    UD drops will be [much] more common (there is at least some
    evidence that in a busy non-blocking network, drops *are* rare).
    This issue affects how we design the recovery of UD drops: if
    drops are rare, recovery can be arbitrarily expensive.  If drops
    are not rare, recovery needs to be at least somewhat efficient.
    If drops are frequent, recovery needs to be cheap/fast/easy.

  - Mellanox is investigating why ibv_rc_pingpong gives better
    bandwidth than ibv_ud_pingpong (i.e., UD bandwidth is poor).

  - Discuss the possibility of doing connection caching: only allow so
    many RC connections at a time.  Maintain an LRU of RC connections.
    When you need to close one, also recycle (or free) all of its
    posted buffers.

  - Discussion of MVAPICH technique for large UD messages: "[receiver]
    zero copy UD".  Send a match header; receiver picks a UD QP from a
    ready pool and sends it back to the sender.  Fragments from the
    user's buffer are posted to that QP on the receiver, so the sender
    sends straight into the receiver's target buffer.  This scheme
    assumes no drops.  For OMPI, this scheme also requires more
    complexity from our current multi-device striping method: we'd
    want to stripe across large contiguous portions of the message
    (vs. round robining small fragments from the message).

  - One point specifically discussed: long message alltoall at scale
    (i.e., large numbers of MPI processes).  Andrew Friedley is going
    to ask around LLNL if anyone does this, but our guess is no
    because each host would need a *ton* of RAM to do this:
    (num_procs_per_node * num_procs_total * length_of_buffer).  Our
    suspicion is that alltoall for short messages is much more common
    (and still, by far, not the most common MPI collective).

  - One proposal:
    - Use UD for short messages (except for peers that switch to eager
      RDMA)
    - Always use RC for long messages, potentially with connection
      caching+fast IB connect (ibcm?)

  - Another proposal: let OSU keep forging ahead with UD and see what
    they come up with.  I.e., let them figure out if UD is worth it or
    not.

  - End result: it's not 100% clear that UD is a "win" yet -- there
    are many unanswered questions.

- Make new PML that is essentially "CM+matching", send entire messages
  down to lower layer instead of having the PML do the fragmenting:

  - Rationale:
    - pretty simple PML
    - allow lower layer to do more optimizations based on full
      knowledge of the specific network being used
    - networks get CM-like benefits without having to "natively"
      support shmem (because matching will still be done in the PML
      and there will be a lower layer/component for shmem)
    - [possibly] remove some stuff from current code in ob1 that is
      not necessary in IB/OF (Gleb didn't think that this would be
      useful; most of OB1 is there to support IB/OF)
- not force other networks to same model as IB/OF (i.e., when we want
      new things in IB/OF, we have to go change all the other BTLs)
      --> ^^ I forgot to mention this point on the call today
    - if we go towards a combined RC+UD OF protocol, the current OB1
      is not good at this because the BTL flags will have to "lie"
      about whether a given endpoint is capable of RDMA or not.
      --> Gleb mentioned that it doesn't matter what the PML thinks;
          even if the PML tells the BTL to RDMA PUT/GET, the BTL can
          emulate it if it isn't supported (e.g., if an endpoint
          switches between RD and UD)

  - Jeff sees this as a code re-org, not so much as a re-write.

  - Gleb is skeptical on the value of this; it may be more valuable if
    we go towards a blended UD+RC protocol, though.

The phone bridge started kicking people off at this point (after we
went 30+ minutes beyond the scheduled end time).  So no conclusions
were reached.  This discussion probably needs to continue in e-mail,
etc.

--
Jeff Squyres
Cisco Systems

Reply via email to