Hi folks As mentioned in today's telecon, we at LANL are continuing to see hangs when running even small jobs that involve shared memory in collective operations. This has been the topic of discussion before, but I bring it up again because (a) the problem is beginning to become epidemic across our application codes, and (b) repeated testing provides more info and (most importantly) confirms that this problem -does not- occur under 1.2.x - it is strictly a 1.3.2 (we haven't checked to see if it is in 1.3.0 or 1.3.1) problem.
The condition is caused when the application performs a loop over collective operations such as MPI_Allgather, MPI_Reduce, and MPI_Bcast. This list is not intended to be exhaustive, but only represents the ones for which we have solid and repeatable data. The symptoms are a "hanging" job, typically (but not always!) associated with fully-consumed memory. The loops do not have to involve substantial amounts of memory (the Bcast loop hangs after moving a whole 32Mbytes, total), nor involve high loop counts. They only have to repeatedly call the collective. Disabling the shared memory BTL is enough to completely resolve the problem. However, this creates an undesirable performance penalty we would like to avoid, if possible. Our current solution is to use the "sync" collective to occasionally insert an MPI_Barrier into the code "behind the scenes" - i.e., to add an MPI_Barrier call every N number of calls to "problem" collectives. The argument in favor of this was that the hang is caused by consuming memory due to "unexpected messages", caused principally by the root process in the collective running slower than other procs. Thus, the notion goes, the root process continues to fall further and further behind, consuming ever more memory until it simply cannot progress. Adding the barrier operation forced the other procs to "hold" until the root process could catch up, thereby relieving the memory backlog. The sync collective has worked for us, but we are now finding a very disconcerting behavior - namely, that the precise value of N required to avoid hanging (a) is very, very sensitive and can still let the app hang even by changing the value by small amounts, (b) flunctuates between runs on an unpredictable basis, and (c) can be different for different collectives. These new problems surfaced this week when we found that a job that previously ran fine with one value of coll_sync_barrier_before suddenly hung when a loop over MPI_Bcast was added to the code. Further investigation has found that the value of N required to make the new loop work is significantly different than the prior value that made Allgather work, creating an exhaustive search for a "sweet spot" for N. Clearly, as codes grow in complexity, this simply is not going to work. It seems to me that we have to begin investigating -why- the 1.3.2 code is encountering this problem whereas the 1.2.x code is not. From our rough measurements, there is a some speed difference between the two releases, so perhaps we are now getting fast enough to create the problem - I don't think we know enough yet to really claim this is true. At this time, we really don't know -why- one process is running slow, or even if it is -always- the root process that is doing so...nor have we confirmed (to my knowledge) that our original analysis of the problem is correct! We would appreciate any help with this problem. I gathered from today's telecon that others are also encountering this, so perhaps there is enough general pain to stimulate a team effort to resolve it! Thanks Ralph