P,128,256,128,16:S,1024,256,128,32:S,4096,256,128,32:S, 65536,256,128,32
With these numbers, if you have a fully-OFA-connected 512 process MPI_COMM_WORLD, each process will take up a little over 65MB of buffering space per MPI process.
The 128 byte buffers are so small that we can have a lot more of them, but we want to have a repost value high enough (128) to allow at least "1 wire full" of messages (plus some extra to account for credit processing, etc.), and have aggressing credit acking (16) to possibly allow not stalling the sender for a lazy receiver (see full explanation of these parameters below).
For hardware that does not support SRQ (eHCA v1, iWARP, ...?), we propose the following values:
P,128,256,128,16:P,1024,32,16:P,4096,32,16:P,65536,32,16For a fully-connected 512 process MPI_COMM_WORLD, this will consume ~385MB of buffering per MPI process. Of course, if you take off the 64k QP and simply make "long" messages shorter, you're down to 128MB per process which is potentially a bit more manageable (note that we currently do not have a way to automatically change PML values based on the hardware found in the host). These values should probably be discussed in detail by the vendors who do not support SRQ to decide what they want.
It's still an open question as to what the mechanism should be to determine which of these two strings should be used; it's likely to be something like this:
1. if the user specifies a string (MCA param), use it2. probe the HW at run time; if the hardware supports SRQ, use the SRQ string (eHCA v1 and v2 supports the attribute -- I don't know if iWARP cards do...?)
3. use the non-SRQ stringI've attached a spreadsheet for those who are interested to help explore different parameter value sets. The top block is the 1 per- peer QP + 3 SRQ case; the bottom block is the 4 per-peer QP case. Feel free to modify as you want; enjoy.
To explain all these numbers, here's a first cut at a writeup what they mean (think of this as preliminary FAQ fodder -- it's likely to be modified a bit before it hits the FAQ). Remember that this is OMPI trunk only (i.e., 1.3 series -- not 1.2 series).
================================= btl_openib_receive_queues allows the specification of multiple receive queues for OpenFabrics networks. Each queue is designated by its type followed by a series of numeric parameters. Queues can be one of two types: - Per-peer (P), meaning that each queue is dedicated to receiving messages from a single, specific peer MPI process. Buffers to receive incoming messages from the peer are guaranteed through explicit flow control by Open MPI (i.e., OpenFabrics network-level retransmissions due to "receiver not ready" (RNR) errors will never occur). - Shared receive queue (S), meaning that a receive queue is shared between all MPI sending processes. Buffers to receive incoming messages from all peers are not necessarily guaranteed because no flow control is possible if less than (num_peers*num_buffers_each) buffers are available in the shared receive queue (which is typically a goal of using SRQ: providing less than N*M buffers). Shared receive queues can be faster than per-peer queues because of the lack of explicit flow control traffic, but OpenFabrics network-level retransmission errors can occur if multiple senders combine to overflow the shared receive queue's available receive buffers. Per-peer queues are specified in the following form:P,<size>,<num_buffers>,[<low_watermark>[,<window_size> [,<reserve>]]]
- <size>: The size of receive buffers to be posted in this queue (in bytes). - <num_buffers>: The maximum number of buffers to post to this queue for incoming MPI message fragments. - <low_watermark>: An optional parameter specifying the number of available buffers left on the queue before Open MPI will re-post buffers up to <num_buffers>. Note that as a latency reduction mechanism, Open MPI does not re-post a receive buffer as soon as it becomes available (because it is expensive to do so). Instead, Open MPI waits until several receive buffers become available again and then posts them all at once. If not specified, <low_watermark> defaults to <num_buffers>/2. - <window_size>: An optional parameter specifying the number of ACKs to accumulate before sending an explicit ACK control message back to a peer. ACKs are typically piggybacked on outgoing messages to a peer; they are grouped into explicit control messages only where they are no other outgoing messages to a peer. If not specified, <window_size> defaults to <low_watermark>/2. - <reserved>: An optional parameter specifying the number of receive buffers to post to the queue that are specifically used for incoming ACK control messages (vs. incoming MPI messages). If unspecified, <reserved> defaults to ((<num_buffers>*2)-1)/<window_size>. Note that control messages use their own flow control (separate from the flow control for MPI message fragments); explicit control messages are always ACK'ed via piggyback data on other messages to ensure that control messages will not trigger RNR errors. For example: P,128,16,4 Specifies a per-peer receive queue that initially posts 16 buffers, each of size 128 bytes. When there are 4 buffers left on the receive queue, Open MPI will re-post 124 buffers to the queue, restoring it to having a total of 128 buffers available for incoming messages. Explicit ACK control messages will be sent back for every 2 incoming messages (if not already piggybacked on other outgoing messages). 127 buffers are reserved for ACK control messages. Shared queues are specified in the following form: S,<size>,<num_buffers>,[<low_watermark>[,<max_pending_sends>]] - <size>: Same as for per-peer queues. - <num_buffers>: Same as for per-peer queues. - <low_watermark>: Same as for per-peer queues. - <max_pending_sends>: An optional parameter that specifies the number of outstanding sends that are allowed at a given time on the queue. This provides a "good enough" mechanism of flow control for some regular communication patterns. If not specified, <max_pending_sends> defaults to <low_watermark>/4. For example: S,1024,256,128,32 Specifies a shared receive queue that posts 256 buffers, each of size 1024 bytes. When there are 128 buffers left on the receive queue, Open MPI will re-post 128 buffers to the queue, restoring it to having a total of 256 buffers available for incoming messages. A maximum of 32 non-local-completed messages are allowed to be pending to a peer at any given time. Note that queues MUST be specified in ascending receive buffer size order. This requirement may be removed prior to 1.3 release. -- Jeff Squyres Cisco Systems
openib-buffering.xls
Description: Binary data