Hi folks
As has been reported, we (LANL) have been investigating a series of
problems in the 1.3.2 release that continue to exist in the OMPI
trunk. The recent shared memory work has helped alleviate some of
these. However, while we had hoped that this work would solve
everything, several problems persist. We are continuing to
investigate, but since I have heard that others may be encountering
similar problems, I thought it might be helpful if I presented the
full situation to the extent possible.
We are investigating two major problems that definitely appear at
large scale, and even occasionally at scale sizes as small as 32
procs. The problems may indeed be related, though that has not been
proven:
I. lockup of the IPoIB communication system due to kernel buffer
overflow. We route all OOB messaging over the IPoIB network. The
resource manager uses a completely different Ethernet, but the Panasas
parallel file system also flows across IPoIB. OMPI's openib BTL is
active, along with the sm BTL, but *not* the TCP BTL.
The job launches without any problems and runs for some period of
time. At some point, we suddenly receive error messages from the OOB
indicating that connection retries have been exceeded. This happens
with a single node, though the specific node appears random (i.e., it
is not where mpirun is executing, nor any specific rank). All
communication with that node subsequently fails. The error message is
received from several processes, indicating that several ranks on
different nodes are attempting to open a TCP connection to this node.
When we investigate, we find that the kernel buffers for the IPoIB TCP
stack are completely full and "wedged" - i.e., no communication can
occur. Thus, all connection requests are being rejected.
At question is the source of all these messages, and why they were not
sent. Are they coming from OMPI, from the application, or from
something in the system? Are they trying to go somewhere that is
overloaded, unable to recv, or...?
We don't know yet. What we do know is that the application is -not-
generating any stdout/err messages, nor transferring any stdin. There
could be connection handshakes flowing over the OOB in support of
openib, but that shouldn't be overwhelming. However, there are no
Panasas operations in effect either, making this all rather mysterious.
We have a locally developed tool (called Loba) that can track the
number of messages being sent by each rank, where they are intended to
go, and their size. Currently the tool only does this for MPI
communications, but I have asked that the tool be extended to cover
the OOB. This will tell us if the message overload is somehow flowing
through OMPI itself. I will report on those findings as they become
available, and file a ticket if we confirm that OMPI is the culprit.
II. loss of communication that causes the resource manager to believe
the node has failed. This smells somewhat similar to the above - an
application runs for awhile, but then suddenly terminates because the
RM does not receive a required heartbeat from one or more nodes. Since
those heartbeats flow over TCP, our immediate thought was that this
problem was most likely caused by the same issue as above.
However, subsequent investigation doesn't appear to support that
hypothesis. What we find is:
1. one process on a node starts to run slow, falling further and
further behind the others in terms of collective operations. Now this
sounds like the shared memory problem, but we do -not- see memory
usage build up in this case. Instead, the process just runs slow, and
the job begins to slow down as a result.
2. at some point, the RM aborts the job after failing to receive the
heartbeat.
Subsequent investigation reveals the following:
1. the process that runs slow is ALWAYS located on the same core that
the IB IRQ is using. Since we are using IB for the inter-node
communication, this makes some sense. Unfortunately, IB appears to
"bind" that IRQ to a single core, so the pain isn't shared - it only
impacts a single process.
2. the heartbeat doesn't utilize IPoIB, but is instead flowing over a
separate Ethernet. However, the RM daemon on the node is not getting
any cpu time, and thus cannot generate the heartbeat. The three
processes (this is a 4-core system) other than the slow one are
running the typical 99% usage, indicating they are polling hard
waiting for a message to arrive. The "slow" process, however, is
dragging along at a cpu usage of only ~10% by the time we crash -
while it starts at 99%, it gradually drops as time progresses until
hitting the crash point. We do not currently know -why- it loses cpu
usage.
Our best current guess is that, for some strange reason, the IB IRQ
goes into hyper-mode and just fires like crazy. As a result, the
process that shares that core loses its ability to process messages,
and the RM daemon is blocked from running (why is still unknown) -
thus causing the RM to believe the node has died and reboot the
system. We don't know if this is being caused by the IB system getting
flooded with messages (perhaps as in scenario I above), or some other
reason.
We are continuing to investigate these problems. Any thoughts are
welcome - these have proven very, very hard to debug. Again, I offer
this information in case others out there are seeing similar problems
in the hope that this might help you recognize the problem, and that
we might share in its solution.
Thanks
Ralph