It is related, but it means that coll/ml has a higher degree of sensitivity to the binding pattern than what you reported (which was that coll/ml doesn't work with unbound processes). What we are now seeing is that coll/ml also doesn't work when processes are bound across sockets.
Which means that Nathan's revised tests are going to have to cover a lot more corner cases. Our locality flags don't currently include "bound-to-multiple-sockets", and I'm not sure how he is going to easily resolve that case. On Jun 19, 2014, at 8:02 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > Ralph and Tetsuya, > > is this related to the hang i reported at > http://www.open-mpi.org/community/lists/devel/2014/06/14975.php ? > > Nathan already replied he is working on a fix. > > Cheers, > > Gilles > > > On 2014/06/20 11:54, Ralph Castain wrote: >> My guess is that the coll/ml component may have problems with binding a >> single process across multiple cores like that - it might be that we'll have >> to have it check for that condition and disqualify itself. It is a >> particularly bad binding pattern, though, as shared memory gets completely >> messed up when you split that way. >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/15033.php