On Sun, Mar 16, 2014 at 08:19:32AM -0700, Ralph Castain wrote:
> 
> On Mar 15, 2014, at 10:19 PM, Hjelm, Nathan T <hje...@lanl.gov> wrote:
> 
> > On Friday, March 14, 2014 8:48 PM, devel [devel-boun...@open-mpi.org] on 
> > behalf of Ralph Castain [r...@open-mpi.org] wrote:
> >> To: Open MPI Developers
> >> Subject: [OMPI devel] 1.7.5 end-of-week status report
> >> 
> >> Hi folks
> >> 
> >> I have both good and bad news to report - first the good.
> >> 
> >> OSHMEM now passes nearly all its tests on my Linux cluster (tcp). My hat 
> >> is off to the Mellanox guys for getting this done, including getting our 
> >> MTT repo tests complete.
> >> 
> >> The MPI layer passes nearly all the IBM, Intel, and one-sided tests. Only 
> >> a few failures.
> >> 
> >> Now the bad. The coll/ml component continues to have problems, including 
> >> segfaults, and I have discovered that the bcol and coll/ml code remains 
> >> entangled (I thought it had been separated, but sadly not). I have 
> >> therefore ompi_ignored coll/ml and bcol/ptpcoll.
> > 
> > No need. I discovered a bug in my last coll/ml fix. It incorrectly handled 
> > one of the possibly hierarchies. The bug is fixed in trunk and a CMR is 
> > open for 1.7.5. In the future I will clean up this path but the fix should 
> > have us working again.
> 
> I'm glad you were able to patch it, but this still begs the question of what 
> to do with coll/ml. It's disturbing that its existence alone was enough to 
> break the Java bindings (and yes, I concede those aren't built by default or 
> part of the MPI standard) without even traversing its code path, and we've 
> had a lot of problems with errors when we do go thru it. More disturbing, you 
> can't even cleanly no-build that component due to the unfortunate 
> cross-linkage with bcol/ptpcoll, so we definitely need a note in NEWS to warn 
> people they need to no-build both.

I thought ORNL had addresed the cross-linkage as well. I am sure they
will get a fix for that in the next couple of days.

> It's unclear to me how to handle this situation, so we'll need to discuss it 
> at the telecon. At the very least, I think we need to ensure coll/ml is not 
> the default for 1.7.5 as it doesn't appear to be ready for that role.

coll/ml is not the default. The issue here is that we have to generate
and parse the topology at collective select time. This will happen even
if coll/ml is not the highest priority collective component. I fixed the
one issue with parsing the topology and then an issue with that
fix. To be clear, the original issue only occured on OSX with debug
builds. This is a setup LANL (and I am sure ORNL) doesn't test.

I really didn't care about the Java problem but the fix was simple
enough. It is easy to verify that the code Jeff fixed was the only place
in coll/ml where a large buffer was put on the stack.

-Nathan

Attachment: pgp2MTY7XMXvq.pgp
Description: PGP signature

Reply via email to