I'm not sure, but I guess it's related to Gilles's ticket.
It's a quite bad binding pattern as Ralph pointed out, so
checking for that condition and disqualifying coll/ml could
be a practical solution as well.

Tetsuya

> It is related, but it means that coll/ml has a higher degree of
sensitivity to the binding pattern than what you reported (which was that
coll/ml doesn't work with unbound processes). What we are now
> seeing is that coll/ml also doesn't work when processes are bound across
sockets.
>
> Which means that Nathan's revised tests are going to have to cover a lot
more corner cases. Our locality flags don't currently include
"bound-to-multiple-sockets", and I'm not sure how he is going to
> easily resolve that case.
>
>
> On Jun 19, 2014, at 8:02 PM, Gilles Gouaillardet
<gilles.gouaillar...@iferc.org> wrote:
>
> > Ralph and Tetsuya,
> >
> > is this related to the hang i reported at
> > http://www.open-mpi.org/community/lists/devel/2014/06/14975.php ?
> >
> > Nathan already replied he is working on a fix.
> >
> > Cheers,
> >
> > Gilles
> >
> >
> > On 2014/06/20 11:54, Ralph Castain wrote:
> >> My guess is that the coll/ml component may have problems with binding
a single process across multiple cores like that - it might be that we'll
have to have it check for that condition and
> disqualify itself. It is a particularly bad binding pattern, though, as
shared memory gets completely messed up when you split that way.
> >>
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/06/15033.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
http://www.open-mpi.org/community/lists/devel/2014/06/15034.php

Reply via email to