Hi Gleb,
Gleb Natapov wrote:
In the case of TCP, kernel is kind enough to progress message for you,
but only if there was enough space in a kernel internal buffers. If there
was no place there, TCP BTL will also buffer messages in userspace and
will, eventually, have the same problem.
Occasionally buffering to hide flow-control issue is fine, assuming that
there is a mechanism to flush the buffer (below). However, you cannot
buffer everything and it is just as fine to expose the back pressure
when the buffer space is exhausted, to show the application that there
is a sustained problem. In this case, it is reasonable to block the
application (ie the MPI request) while you cannot buffer the outgoing data.
The problem of the progression of already buffered outgoing data is the
real problem, not the buffering itself.
Here, the proposal is to allow the BTL to buffer, but requires the PML
to handle progress. That's broken, IMHO.
To progress such outstanding messages additional thread is needed in
userspace. Is this what MX does?
MX uses user-level thread but it's mainly for progressing the
higher-level protocol on the receive side. On the send side for the
low-level protocol, it is easier to ask your driver to either wake you
up when the sending resource is available again (blocking on a CQ for
IB) or take care of the sending itself.
<usual rant>
My overall problem with this proposal is a race to the bottom, based on
the lowest BTL, functionality-wise. The PML already imposes a pipelining
for large messages (with a few knobs, but still) when most protocols in
other BTLs already have their own. Now it's flow-control progression
(not MPI progression).
Can each BTL implement what is needed for a particular back-end instead
of bloating the upper layer ?
</usual rant>
Patrick