Hi folks

As has been reported, we (LANL) have been investigating a series of problems in the 1.3.2 release that continue to exist in the OMPI trunk. The recent shared memory work has helped alleviate some of these. However, while we had hoped that this work would solve everything, several problems persist. We are continuing to investigate, but since I have heard that others may be encountering similar problems, I thought it might be helpful if I presented the full situation to the extent possible.

We are investigating two major problems that definitely appear at large scale, and even occasionally at scale sizes as small as 32 procs. The problems may indeed be related, though that has not been proven:

I. lockup of the IPoIB communication system due to kernel buffer overflow. We route all OOB messaging over the IPoIB network. The resource manager uses a completely different Ethernet, but the Panasas parallel file system also flows across IPoIB. OMPI's openib BTL is active, along with the sm BTL, but *not* the TCP BTL.

The job launches without any problems and runs for some period of time. At some point, we suddenly receive error messages from the OOB indicating that connection retries have been exceeded. This happens with a single node, though the specific node appears random (i.e., it is not where mpirun is executing, nor any specific rank). All communication with that node subsequently fails. The error message is received from several processes, indicating that several ranks on different nodes are attempting to open a TCP connection to this node.

When we investigate, we find that the kernel buffers for the IPoIB TCP stack are completely full and "wedged" - i.e., no communication can occur. Thus, all connection requests are being rejected.

At question is the source of all these messages, and why they were not sent. Are they coming from OMPI, from the application, or from something in the system? Are they trying to go somewhere that is overloaded, unable to recv, or...?

We don't know yet. What we do know is that the application is -not- generating any stdout/err messages, nor transferring any stdin. There could be connection handshakes flowing over the OOB in support of openib, but that shouldn't be overwhelming. However, there are no Panasas operations in effect either, making this all rather mysterious.

We have a locally developed tool (called Loba) that can track the number of messages being sent by each rank, where they are intended to go, and their size. Currently the tool only does this for MPI communications, but I have asked that the tool be extended to cover the OOB. This will tell us if the message overload is somehow flowing through OMPI itself. I will report on those findings as they become available, and file a ticket if we confirm that OMPI is the culprit.


II. loss of communication that causes the resource manager to believe the node has failed. This smells somewhat similar to the above - an application runs for awhile, but then suddenly terminates because the RM does not receive a required heartbeat from one or more nodes. Since those heartbeats flow over TCP, our immediate thought was that this problem was most likely caused by the same issue as above.

However, subsequent investigation doesn't appear to support that hypothesis. What we find is:

1. one process on a node starts to run slow, falling further and further behind the others in terms of collective operations. Now this sounds like the shared memory problem, but we do -not- see memory usage build up in this case. Instead, the process just runs slow, and the job begins to slow down as a result.

2. at some point, the RM aborts the job after failing to receive the heartbeat.

Subsequent investigation reveals the following:

1. the process that runs slow is ALWAYS located on the same core that the IB IRQ is using. Since we are using IB for the inter-node communication, this makes some sense. Unfortunately, IB appears to "bind" that IRQ to a single core, so the pain isn't shared - it only impacts a single process.

2. the heartbeat doesn't utilize IPoIB, but is instead flowing over a separate Ethernet. However, the RM daemon on the node is not getting any cpu time, and thus cannot generate the heartbeat. The three processes (this is a 4-core system) other than the slow one are running the typical 99% usage, indicating they are polling hard waiting for a message to arrive. The "slow" process, however, is dragging along at a cpu usage of only ~10% by the time we crash - while it starts at 99%, it gradually drops as time progresses until hitting the crash point. We do not currently know -why- it loses cpu usage.

Our best current guess is that, for some strange reason, the IB IRQ goes into hyper-mode and just fires like crazy. As a result, the process that shares that core loses its ability to process messages, and the RM daemon is blocked from running (why is still unknown) - thus causing the RM to believe the node has died and reboot the system. We don't know if this is being caused by the IB system getting flooded with messages (perhaps as in scenario I above), or some other reason.


We are continuing to investigate these problems. Any thoughts are welcome - these have proven very, very hard to debug. Again, I offer this information in case others out there are seeing similar problems in the hope that this might help you recognize the problem, and that we might share in its solution.

Thanks
Ralph

Reply via email to