On Sun, Mar 16, 2014 at 08:19:32AM -0700, Ralph Castain wrote: > > On Mar 15, 2014, at 10:19 PM, Hjelm, Nathan T <hje...@lanl.gov> wrote: > > > On Friday, March 14, 2014 8:48 PM, devel [devel-boun...@open-mpi.org] on > > behalf of Ralph Castain [r...@open-mpi.org] wrote: > >> To: Open MPI Developers > >> Subject: [OMPI devel] 1.7.5 end-of-week status report > >> > >> Hi folks > >> > >> I have both good and bad news to report - first the good. > >> > >> OSHMEM now passes nearly all its tests on my Linux cluster (tcp). My hat > >> is off to the Mellanox guys for getting this done, including getting our > >> MTT repo tests complete. > >> > >> The MPI layer passes nearly all the IBM, Intel, and one-sided tests. Only > >> a few failures. > >> > >> Now the bad. The coll/ml component continues to have problems, including > >> segfaults, and I have discovered that the bcol and coll/ml code remains > >> entangled (I thought it had been separated, but sadly not). I have > >> therefore ompi_ignored coll/ml and bcol/ptpcoll. > > > > No need. I discovered a bug in my last coll/ml fix. It incorrectly handled > > one of the possibly hierarchies. The bug is fixed in trunk and a CMR is > > open for 1.7.5. In the future I will clean up this path but the fix should > > have us working again. > > I'm glad you were able to patch it, but this still begs the question of what > to do with coll/ml. It's disturbing that its existence alone was enough to > break the Java bindings (and yes, I concede those aren't built by default or > part of the MPI standard) without even traversing its code path, and we've > had a lot of problems with errors when we do go thru it. More disturbing, you > can't even cleanly no-build that component due to the unfortunate > cross-linkage with bcol/ptpcoll, so we definitely need a note in NEWS to warn > people they need to no-build both.
I thought ORNL had addresed the cross-linkage as well. I am sure they will get a fix for that in the next couple of days. > It's unclear to me how to handle this situation, so we'll need to discuss it > at the telecon. At the very least, I think we need to ensure coll/ml is not > the default for 1.7.5 as it doesn't appear to be ready for that role. coll/ml is not the default. The issue here is that we have to generate and parse the topology at collective select time. This will happen even if coll/ml is not the highest priority collective component. I fixed the one issue with parsing the topology and then an issue with that fix. To be clear, the original issue only occured on OSX with debug builds. This is a setup LANL (and I am sure ORNL) doesn't test. I really didn't care about the Java problem but the fix was simple enough. It is easy to verify that the code Jeff fixed was the only place in coll/ml where a large buffer was put on the stack. -Nathan
pgp2MTY7XMXvq.pgp
Description: PGP signature