Hi Gleb,

Gleb Natapov wrote:
In the case of TCP, kernel is kind enough to progress message for you,
but only if there was enough space in a kernel internal buffers. If there
was no place there, TCP BTL will also buffer messages in userspace and
will, eventually, have the same problem.

Occasionally buffering to hide flow-control issue is fine, assuming that there is a mechanism to flush the buffer (below). However, you cannot buffer everything and it is just as fine to expose the back pressure when the buffer space is exhausted, to show the application that there is a sustained problem. In this case, it is reasonable to block the application (ie the MPI request) while you cannot buffer the outgoing data.

The problem of the progression of already buffered outgoing data is the real problem, not the buffering itself.

Here, the proposal is to allow the BTL to buffer, but requires the PML to handle progress. That's broken, IMHO.

To progress such outstanding messages additional thread is needed in
userspace. Is this what MX does?

MX uses user-level thread but it's mainly for progressing the higher-level protocol on the receive side. On the send side for the low-level protocol, it is easier to ask your driver to either wake you up when the sending resource is available again (blocking on a CQ for IB) or take care of the sending itself.

<usual rant>
My overall problem with this proposal is a race to the bottom, based on the lowest BTL, functionality-wise. The PML already imposes a pipelining for large messages (with a few knobs, but still) when most protocols in other BTLs already have their own. Now it's flow-control progression (not MPI progression).

Can each BTL implement what is needed for a particular back-end instead of bloating the upper layer ?
</usual rant>

Patrick

Reply via email to