It is related, but it means that coll/ml has a higher degree of sensitivity to 
the binding pattern than what you reported (which was that coll/ml doesn't work 
with unbound processes). What we are now seeing is that coll/ml also doesn't 
work when processes are bound across sockets.

Which means that Nathan's revised tests are going to have to cover a lot more 
corner cases. Our locality flags don't currently include 
"bound-to-multiple-sockets", and I'm not sure how he is going to easily resolve 
that case.


On Jun 19, 2014, at 8:02 PM, Gilles Gouaillardet 
<gilles.gouaillar...@iferc.org> wrote:

> Ralph and Tetsuya,
> 
> is this related to the hang i reported at
> http://www.open-mpi.org/community/lists/devel/2014/06/14975.php ?
> 
> Nathan already replied he is working on a fix.
> 
> Cheers,
> 
> Gilles
> 
> 
> On 2014/06/20 11:54, Ralph Castain wrote:
>> My guess is that the coll/ml component may have problems with binding a 
>> single process across multiple cores like that - it might be that we'll have 
>> to have it check for that condition and disqualify itself. It is a 
>> particularly bad binding pattern, though, as shared memory gets completely 
>> messed up when you split that way.
>> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/15033.php

Reply via email to