I'm not sure, but I guess it's related to Gilles's ticket. It's a quite bad binding pattern as Ralph pointed out, so checking for that condition and disqualifying coll/ml could be a practical solution as well.
Tetsuya > It is related, but it means that coll/ml has a higher degree of sensitivity to the binding pattern than what you reported (which was that coll/ml doesn't work with unbound processes). What we are now > seeing is that coll/ml also doesn't work when processes are bound across sockets. > > Which means that Nathan's revised tests are going to have to cover a lot more corner cases. Our locality flags don't currently include "bound-to-multiple-sockets", and I'm not sure how he is going to > easily resolve that case. > > > On Jun 19, 2014, at 8:02 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > > > Ralph and Tetsuya, > > > > is this related to the hang i reported at > > http://www.open-mpi.org/community/lists/devel/2014/06/14975.php ? > > > > Nathan already replied he is working on a fix. > > > > Cheers, > > > > Gilles > > > > > > On 2014/06/20 11:54, Ralph Castain wrote: > >> My guess is that the coll/ml component may have problems with binding a single process across multiple cores like that - it might be that we'll have to have it check for that condition and > disqualify itself. It is a particularly bad binding pattern, though, as shared memory gets completely messed up when you split that way. > >> > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15033.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: http://www.open-mpi.org/community/lists/devel/2014/06/15034.php