On Thu, Nov 06, 2014 at 04:29:44PM -0500, Joshua Ladd wrote: > On Thursday, November 6, 2014, Nathan Hjelm <hje...@lanl.gov> wrote: > > On Thu, Nov 06, 2014 at 04:06:23PM -0500, Joshua Ladd wrote: > > Nathan, > > Has this bug always been present in OpenIB or is this a recent > addition? > > If this is regression, I would also be inclined to say that this is > a > > The bug is as old as the message coalescing feature in the openib > btl. When the feature was added the openib btl no longer supported > calling btl_free on descriptors allocated by sendi (a serious bug). > > > blocker for 1.8.4. This is a SIGNIFICANT bug. Both Howard and I > were quite > > surprised that all the while this code has been in use at LANL > > in production systems, this issue was never discovered. > > Don't know why it suddenly came up but in 1.8.1 we added a inline send > optimization to the MPI_Isend path. The optimization uses the btl_sendi > function to get the fragment on the wire without allocating a send > request. If this fails the btl fragment returned by sendi is released > with btl_free, a send request is allocated, and the normal send path is > used. The optimization was tested with the openib btl so I don't know > why this wasn't caught earlier. My guess is some other change may have > triggered it. > > We fixed the bug in 1.8.4 by totally disabling message coalescing. The > feature is meant to game the osu_mbw_mr test and does next to nothing > for real apps. Additionally, the inline send optimization makes the > feature less of a win with osu_mbw_mr anyway. We beat mvapich handily on > LANL systems without message coalescing. > > [josh] Can you point to the PR, Nathan? I didn't realize this was already > addressed in the 1.8.4 prerelease. I would seek Howard's guidance as to > whether this is an acceptable solution for LANL. Regardless of your > opinion about the utility of MC, real decisions are made on the basis of > those benchmarks, so I'm not entirely convinced of your argument > here. OMPI, as we are all aware tends to be a target on the basis of > these comparisons.
This was already discussed here. On LANL systems the message rates are the same with and without the message coalescing "feature" so we are turning it off and disabling it for 1.8.4. As for the PR. It looks like Ralph has not merged it into 1.8.4 yet. -Nathan
pgpGxF9bUlzt2.pgp
Description: PGP signature