> On Sep 11, 2015, at 10:00 PM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > at first glance, these errors look unrelated to PMIx. > I noticed a bunch of bind() failure. > based on your command line, I guess you are not running your job via a batch > manager, > and I would guess not all unix sockets are always cleaned up.
Yeah, the no-disconnect test was causing mpirun to segfault, which meant that the sockets weren’t cleaned up. So eventually I’d hit a case where they collided. Simply blowing away the session directory tree resolves the problem. > (or this is an old bug and you did not manually clean your nodes when it was > fixed) > > the neighbor_allgather_self failure is discussed at > https://github.com/open-mpi/ompi/pull/790 > <https://github.com/open-mpi/ompi/pull/790> Ah, indeed - thanks! > > I will have a look at the op related failure on Monday > (looks like a MPI conformance issue unrelated to PMIx) Again, thanks! > > Cheers, > > Gilles > > On Saturday, September 12, 2015, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > Hi folks > > I’ve closed all the holes I can find in the PMIx integration, and things look > pretty good overall. There are a handful of failures still being seen - most > of them involving what appear to be unrelated code. I’m not entirely sure I > understand the source of the errors, and could really use some help to > determine (a) if these are in any way related to PMIx, and if so (b) how. > > The errors from my MTT run are here: > http://mtt.open-mpi.org/index.php?do_redir=2256 > <http://mtt.open-mpi.org/index.php?do_redir=2256> > > Any help diagnosing these problems would be greatly appreciated > Ralph > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18015.php