The sm BTL used to have two mechanisms for dealing with congested
FIFOs. One was to grow the FIFOs. Another was to queue pending sends
locally (on the sender's side). I think the grow-FIFO mechanism was
typically invoked and the pending-send mechanism used only under extreme
circumstances (no more memory).
With the sm makeover of 1.3.2, we dropped the ability to grow FIFOs.
The code added complexity and there seemed to be no need to have two
mechanisms to deal with congested FIFOs. In ticket 1944, however, we
see that repeated collectives can produce hangs, and this seems to be
due to the pending-send code not adequately dealing with congested FIFOs.
Today, when a process tries to write to a remote FIFO and fails, it
queues the write as a pending send. The only condition under which it
retries pending sends is when it gets a fragment back from a remote process.
I think the logic must have been that the FIFO got congested because we
issued too many sends. Getting a fragment back indicates that the
remote process has made progress digesting those sends. In ticket 1944,
we see that a FIFO can also get congested from too many returning
fragments. Further, with shared FIFOs, a FIFO could become congested
due to the activity of a third-party process.
In sum, getting a fragment back from a remote process is a poor
indicator that it's time to retry pending sends.
Maybe the real way to know when to retry pending sends is just to check
if there's room on the FIFO.
So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start by checking
if there are pending sends. If so, it'll retry them before performing
the requested write. This should also help preserve ordering a little
better. I'm guessing this will not hurt our message latency in any
meaningful way, but I'll check this out.
Meanwhile, I wanted to check in with y'all for any guidance you might have.