[OMPI devel] Changing btp_openib_receive_queues default value

Jeff Squyres Tue, 7 Aug 2007 18:28:57 -0400

Galen and I were looking at the default value for btl_openib_receive_queues value today and noticed that the first P value seems to be quite low (i.e., they're what Galen used to *force* flow control while he was debugging the new protocol). We propose changing it to:

P,128,256,128,16:S,1024,256,128,32:S,4096,256,128,32:S, 65536,256,128,32

With these numbers, if you have a fully-OFA-connected 512 process MPI_COMM_WORLD, each process will take up a little over 65MB of buffering space per MPI process.

The 128 byte buffers are so small that we can have a lot more of them, but we want to have a repost value high enough (128) to allow at least "1 wire full" of messages (plus some extra to account for credit processing, etc.), and have aggressing credit acking (16) to possibly allow not stalling the sender for a lazy receiver (see full explanation of these parameters below).

For hardware that does not support SRQ (eHCA v1, iWARP, ...?), we propose the following values:


    P,128,256,128,16:P,1024,32,16:P,4096,32,16:P,65536,32,16

For a fully-connected 512 process MPI_COMM_WORLD, this will consume ~385MB of buffering per MPI process. Of course, if you take off the 64k QP and simply make "long" messages shorter, you're down to 128MB per process which is potentially a bit more manageable (note that we currently do not have a way to automatically change PML values based on the hardware found in the host). These values should probably be discussed in detail by the vendors who do not support SRQ to decide what they want.

It's still an open question as to what the mechanism should be to determine which of these two strings should be used; it's likely to be something like this:


1. if the user specifies a string (MCA param), use it

2. probe the HW at run time; if the hardware supports SRQ, use the SRQ string (eHCA v1 and v2 supports the attribute -- I don't know if iWARP cards do...?)

3. use the non-SRQ string

I've attached a spreadsheet for those who are interested to help explore different parameter value sets. The top block is the 1 per- peer QP + 3 SRQ case; the bottom block is the 4 per-peer QP case. Feel free to modify as you want; enjoy.

To explain all these numbers, here's a first cut at a writeup what they mean (think of this as preliminary FAQ fodder -- it's likely to be modified a bit before it hits the FAQ). Remember that this is OMPI trunk only (i.e., 1.3 series -- not 1.2 series).


=================================

btl_openib_receive_queues allows the specification of multiple receive
queues for OpenFabrics networks.  Each queue is designated by its type
followed by a series of numeric parameters.

Queues can be one of two types:

- Per-peer (P), meaning that each queue is dedicated to receiving
  messages from a single, specific peer MPI process.  Buffers to
  receive incoming messages from the peer are guaranteed through
  explicit flow control by Open MPI (i.e., OpenFabrics network-level
  retransmissions due to "receiver not ready" (RNR) errors will never
  occur).
- Shared receive queue (S), meaning that a receive queue is shared
  between all MPI sending processes.  Buffers to receive incoming
  messages from all peers are not necessarily guaranteed because no
  flow control is possible if less than (num_peers*num_buffers_each)
  buffers are available in the shared receive queue (which is
  typically a goal of using SRQ: providing less than N*M buffers).

Shared receive queues can be faster than per-peer queues because of
the lack of explicit flow control traffic, but OpenFabrics
network-level retransmission errors can occur if multiple senders
combine to overflow the shared receive queue's available receive
buffers.

Per-peer queues are specified in the following form:

P,<size>,<num_buffers>,[<low_watermark>[,<window_size> [,<reserve>]]]


- <size>: The size of receive buffers to be posted in this queue (in
  bytes).
- <num_buffers>: The maximum number of buffers to post to this queue
  for incoming MPI message fragments.
- <low_watermark>: An optional parameter specifying the number of
  available buffers left on the queue before Open MPI will re-post
  buffers up to <num_buffers>.  Note that as a latency reduction
  mechanism, Open MPI does not re-post a receive buffer as soon as it
  becomes available (because it is expensive to do so).  Instead, Open
  MPI waits until several receive buffers become available again and
  then posts them all at once.  If not specified, <low_watermark>
  defaults to <num_buffers>/2.
- <window_size>: An optional parameter specifying the number of ACKs
  to accumulate before sending an explicit ACK control message back to
  a peer.  ACKs are typically piggybacked on outgoing messages to a
  peer; they are grouped into explicit control messages only where
  they are no other outgoing messages to a peer.  If not specified,
  <window_size> defaults to <low_watermark>/2.
- <reserved>: An optional parameter specifying the number of receive
  buffers to post to the queue that are specifically used for incoming
  ACK control messages (vs. incoming MPI messages).  If unspecified,
  <reserved> defaults to ((<num_buffers>*2)-1)/<window_size>.  Note
  that control messages use their own flow control (separate from the
  flow control for MPI message fragments); explicit control messages
  are always ACK'ed via piggyback data on other messages to ensure
  that control messages will not trigger RNR errors.

For example:

    P,128,16,4

Specifies a per-peer receive queue that initially posts 16 buffers,
each of size 128 bytes.  When there are 4 buffers left on the receive
queue, Open MPI will re-post 124 buffers to the queue, restoring it to
having a total of 128 buffers available for incoming messages.
Explicit ACK control messages will be sent back for every 2 incoming
messages (if not already piggybacked on other outgoing messages).  127
buffers are reserved for ACK control messages.

Shared queues are specified in the following form:

    S,<size>,<num_buffers>,[<low_watermark>[,<max_pending_sends>]]

- <size>: Same as for per-peer queues.
- <num_buffers>: Same as for per-peer queues.
- <low_watermark>: Same as for per-peer queues.
- <max_pending_sends>: An optional parameter that specifies the number
  of outstanding sends that are allowed at a given time on the queue.
  This provides a "good enough" mechanism of flow control for some
  regular communication patterns.  If not specified,
  <max_pending_sends> defaults to <low_watermark>/4.

For example:

    S,1024,256,128,32

Specifies a shared receive queue that posts 256 buffers, each of size
1024 bytes.  When there are 128 buffers left on the receive queue,
Open MPI will re-post 128 buffers to the queue, restoring it to having
a total of 256 buffers available for incoming messages.  A maximum of
32 non-local-completed messages are allowed to be pending to a peer at
any given time.

Note that queues MUST be specified in ascending receive buffer size
order.  This requirement may be removed prior to 1.3 release.

--
Jeff Squyres
Cisco Systems

openib-buffering.xls
Description: Binary data

[OMPI devel] Changing btp_openib_receive_queues default value

Reply via email to