[OMPI devel] Known problems in 1.3.2 and trunk

Ralph Castain Fri, 3 Jul 2009 13:41:01 -0400

Hi folks

As has been reported, we (LANL) have been investigating a series ofproblems in the 1.3.2 release that continue to exist in the OMPItrunk. The recent shared memory work has helped alleviate some ofthese. However, while we had hoped that this work would solveeverything, several problems persist. We are continuing toinvestigate, but since I have heard that others may be encounteringsimilar problems, I thought it might be helpful if I presented thefull situation to the extent possible.

We are investigating two major problems that definitely appear atlarge scale, and even occasionally at scale sizes as small as 32procs. The problems may indeed be related, though that has not beenproven:

I. lockup of the IPoIB communication system due to kernel bufferoverflow. We route all OOB messaging over the IPoIB network. Theresource manager uses a completely different Ethernet, but the Panasasparallel file system also flows across IPoIB. OMPI's openib BTL isactive, along with the sm BTL, but *not* the TCP BTL.

The job launches without any problems and runs for some period oftime. At some point, we suddenly receive error messages from the OOBindicating that connection retries have been exceeded. This happenswith a single node, though the specific node appears random (i.e., itis not where mpirun is executing, nor any specific rank). Allcommunication with that node subsequently fails. The error message isreceived from several processes, indicating that several ranks ondifferent nodes are attempting to open a TCP connection to this node.

When we investigate, we find that the kernel buffers for the IPoIB TCPstack are completely full and "wedged" - i.e., no communication canoccur. Thus, all connection requests are being rejected.

At question is the source of all these messages, and why they were notsent. Are they coming from OMPI, from the application, or fromsomething in the system? Are they trying to go somewhere that isoverloaded, unable to recv, or...?

We don't know yet. What we do know is that the application is -not-generating any stdout/err messages, nor transferring any stdin. Therecould be connection handshakes flowing over the OOB in support ofopenib, but that shouldn't be overwhelming. However, there are noPanasas operations in effect either, making this all rather mysterious.

We have a locally developed tool (called Loba) that can track thenumber of messages being sent by each rank, where they are intended togo, and their size. Currently the tool only does this for MPIcommunications, but I have asked that the tool be extended to coverthe OOB. This will tell us if the message overload is somehow flowingthrough OMPI itself. I will report on those findings as they becomeavailable, and file a ticket if we confirm that OMPI is the culprit.

II. loss of communication that causes the resource manager to believethe node has failed. This smells somewhat similar to the above - anapplication runs for awhile, but then suddenly terminates because theRM does not receive a required heartbeat from one or more nodes. Sincethose heartbeats flow over TCP, our immediate thought was that thisproblem was most likely caused by the same issue as above.

However, subsequent investigation doesn't appear to support thathypothesis. What we find is:

1. one process on a node starts to run slow, falling further andfurther behind the others in terms of collective operations. Now thissounds like the shared memory problem, but we do -not- see memoryusage build up in this case. Instead, the process just runs slow, andthe job begins to slow down as a result.

2. at some point, the RM aborts the job after failing to receive theheartbeat.


Subsequent investigation reveals the following:

1. the process that runs slow is ALWAYS located on the same core thatthe IB IRQ is using. Since we are using IB for the inter-nodecommunication, this makes some sense. Unfortunately, IB appears to"bind" that IRQ to a single core, so the pain isn't shared - it onlyimpacts a single process.

2. the heartbeat doesn't utilize IPoIB, but is instead flowing over aseparate Ethernet. However, the RM daemon on the node is not gettingany cpu time, and thus cannot generate the heartbeat. The threeprocesses (this is a 4-core system) other than the slow one arerunning the typical 99% usage, indicating they are polling hardwaiting for a message to arrive. The "slow" process, however, isdragging along at a cpu usage of only ~10% by the time we crash -while it starts at 99%, it gradually drops as time progresses untilhitting the crash point. We do not currently know -why- it loses cpuusage.

Our best current guess is that, for some strange reason, the IB IRQgoes into hyper-mode and just fires like crazy. As a result, theprocess that shares that core loses its ability to process messages,and the RM daemon is blocked from running (why is still unknown) -thus causing the RM to believe the node has died and reboot thesystem. We don't know if this is being caused by the IB system gettingflooded with messages (perhaps as in scenario I above), or some otherreason.

We are continuing to investigate these problems. Any thoughts arewelcome - these have proven very, very hard to debug. Again, I offerthis information in case others out there are seeing similar problemsin the hope that this might help you recognize the problem, and thatwe might share in its solution.


Thanks
Ralph

[OMPI devel] Known problems in 1.3.2 and trunk

Reply via email to