Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Gilles Gouaillardet
Ralph,

On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain  wrote:

> The design is supposed to be that each node knows precisely how many
> daemons are involved in each collective, and who is going to talk to them.


ok, but in the design does not ensure that things will happen in the right
order :
- enter the allgather
- receive data from the daemon at distance 1
- receive data from the daemon at distance 2
- and so on

with current implementation when 2 daemons are involved, if a daemon enters
the allgather after it received data from the peer, then the mpi processes
local to this daemon will hang

with 4 nodes, it got trickier :
0 enter allgather and send a message to 1
1 receive the message and send to 2 but with data from 0 only
/* 1 did not enter the allgather, so its data cannot be sent to 2 */

this issue did not occur before the persistent receive :
no receive was posted if the daemon did not enter the allgather


The signature contains the info required to ensure the receiver knows which
> collective this message relates to, and just happens to also allow them to
> lookup the number of daemons involved (the base function takes care of that
> for them).
>
>
ok too, this issue was solved with the persistent receive

So there is no need for a "pending" list - if you receive a message about a
> collective you don't yet know about, you just put it on the ongoing
> collective list. You should only receive it if you are going to be involved
> - i.e., you have local procs that are going to participate. So you wait
> until your local procs participate, and then pass your collected bucket
> along.
>
> ok, i did something similar
(e.g. pass all the available data)
some data might be passed twice, but that might not be an issue


> I suspect the link to the local procs isn't being correctly dealt with,
> else you couldn't be hanging. Or the rcd isn't correctly passing incoming
> messages to the base functions to register the collective.
>
> I'll look at it over the weekend and can resolve it then.
>
>
 the attached patch is an illustration of what i was trying to explain.
coll->nreported is used by rcd as a bitmask of the received messages
(bit 0 is for the local daemon, bit n for the daemon at distance n)

i was still debugging a race condition :
if daemons 2 and 3 enter the allgather at the send time, they will sent a
message to each other at the same time and rml fails establishing the
connection.  i could not find whether this is linked to my changes...

Cheers,

Gilles

>
> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
> > Ralph,
> >
> > you are right, this was definetly not the right fix (at least with 4
> > nodes or more)
> >
> > i finally understood what is going wrong here :
> > to make it simple, the allgather recursive doubling algo is not
> > implemented with
> > MPI_Recv(...,peer,...) like functions but with
> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
> > and that makes things slightly more complicated :
> > right now :
> > - with two nodes : if node 1 is late, it gets stuck in the allgather
> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1
> > is still late, then node 0
> > will likely leaves the allgather though it did not receive anything
> > from  node 1
> > - and so on
> >
> > i think i can fix that from now
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/09/11 23:47, Ralph Castain wrote:
> >> Yeah, that's not the right fix, I'm afraid. I've made the direct
> component the default again until I have time to dig into this deeper.
> >>
> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
> >>
> >>> Ralph,
> >>>
> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
> >>> it does not invoke pmix_server_release
> >>> because allgather_stub was not previously invoked since the the fence
> >>> was not yet entered.
> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
> >>>
> >>> the attached patch is likely not the right fix, it was very lightly
> >>> tested, but so far, it works for me ...
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>  Ralph,
> 
>  things got worst indeed :-(
> 
>  now a simple hello world involving two hosts hang in mpi_init.
>  there is still a race condition : if a tasks a call fence long after
> task b,
>  then task b will never leave the fence
> 
>  i ll try to debug this ...
> 
>  Cheers,
> 
>  Gilles
> 
>  On 2014/09/11 2:36, Ralph Castain wrote:
> > I think I now have this fixed - let me know what you see.
> >
> >
> > On Sep 9, 2014, at 6:15 AM, Ralph Castain  wrote:
> >
> >> Yeah, that's not the correct fix. The right way to fix it is for
> all three components to have their own RML tag, and for each of them to
> establish a persistent receive. They then can use the signature to tell
> which c

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Ralph Castain

On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet 
 wrote:

> Ralph,
> 
> On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain  wrote:
> The design is supposed to be that each node knows precisely how many daemons 
> are involved in each collective, and who is going to talk to them.
> 
> ok, but in the design does not ensure that things will happen in the right 
> order :
> - enter the allgather
> - receive data from the daemon at distance 1
> - receive data from the daemon at distance 2
> - and so on
> 
> with current implementation when 2 daemons are involved, if a daemon enters 
> the allgather after it received data from the peer, then the mpi processes 
> local to this daemon will hang
> 
> with 4 nodes, it got trickier :
> 0 enter allgather and send a message to 1
> 1 receive the message and send to 2 but with data from 0 only
> /* 1 did not enter the allgather, so its data cannot be sent to 2 */

It's just a bug in the rcd logic, Gilles. I'll take a look and get it fixed - 
just don't have time right now

> 
> this issue did not occur before the persistent receive :
> no receive was posted if the daemon did not enter the allgather 
> 
> 
> The signature contains the info required to ensure the receiver knows which 
> collective this message relates to, and just happens to also allow them to 
> lookup the number of daemons involved (the base function takes care of that 
> for them).
> 
>  
> ok too, this issue was solved with the persistent receive
> 
> So there is no need for a "pending" list - if you receive a message about a 
> collective you don't yet know about, you just put it on the ongoing 
> collective list. You should only receive it if you are going to be involved - 
> i.e., you have local procs that are going to participate. So you wait until 
> your local procs participate, and then pass your collected bucket along.
> 
> ok, i did something similar
> (e.g. pass all the available data)
> some data might be passed twice, but that might not be an issue
>  
> I suspect the link to the local procs isn't being correctly dealt with, else 
> you couldn't be hanging. Or the rcd isn't correctly passing incoming messages 
> to the base functions to register the collective.
> 
> I'll look at it over the weekend and can resolve it then.
> 
> 
>  the attached patch is an illustration of what i was trying to explain.
> coll->nreported is used by rcd as a bitmask of the received messages
> (bit 0 is for the local daemon, bit n for the daemon at distance n)
> 
> i was still debugging a race condition :
> if daemons 2 and 3 enter the allgather at the send time, they will sent a 
> message to each other at the same time and rml fails establishing the 
> connection.  i could not find whether this is linked to my changes...
> 
> Cheers,
> 
> Gilles
> 
> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet 
>  wrote:
> 
> > Ralph,
> >
> > you are right, this was definetly not the right fix (at least with 4
> > nodes or more)
> >
> > i finally understood what is going wrong here :
> > to make it simple, the allgather recursive doubling algo is not
> > implemented with
> > MPI_Recv(...,peer,...) like functions but with
> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
> > and that makes things slightly more complicated :
> > right now :
> > - with two nodes : if node 1 is late, it gets stuck in the allgather
> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1
> > is still late, then node 0
> > will likely leaves the allgather though it did not receive anything
> > from  node 1
> > - and so on
> >
> > i think i can fix that from now
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2014/09/11 23:47, Ralph Castain wrote:
> >> Yeah, that's not the right fix, I'm afraid. I've made the direct component 
> >> the default again until I have time to dig into this deeper.
> >>
> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet 
> >>  wrote:
> >>
> >>> Ralph,
> >>>
> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
> >>> it does not invoke pmix_server_release
> >>> because allgather_stub was not previously invoked since the the fence
> >>> was not yet entered.
> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
> >>>
> >>> the attached patch is likely not the right fix, it was very lightly
> >>> tested, but so far, it works for me ...
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>  Ralph,
> 
>  things got worst indeed :-(
> 
>  now a simple hello world involving two hosts hang in mpi_init.
>  there is still a race condition : if a tasks a call fence long after 
>  task b,
>  then task b will never leave the fence
> 
>  i ll try to debug this ...
> 
>  Cheers,
> 
>  Gilles
> 
>  On 2014/09/11 2:36, Ralph Castain wrote:
> > I think I now have this fixed - let me know what you see.
> >
> >
> > On Sep 9, 2014, at 6:15 AM, Ralph Castain  wrote:
> >
> 

Re: [OMPI devel] race condition in grpcomm/rcd

2014-09-12 Thread Joshua Ladd
Let me know if Nadia can help here, Ralph.

Josh


On Fri, Sep 12, 2014 at 9:31 AM, Ralph Castain  wrote:

>
> On Sep 12, 2014, at 5:45 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> Ralph,
>
> On Fri, Sep 12, 2014 at 10:54 AM, Ralph Castain  wrote:
>
>> The design is supposed to be that each node knows precisely how many
>> daemons are involved in each collective, and who is going to talk to them.
>
>
> ok, but in the design does not ensure that things will happen in the right
> order :
> - enter the allgather
> - receive data from the daemon at distance 1
> - receive data from the daemon at distance 2
> - and so on
>
> with current implementation when 2 daemons are involved, if a daemon
> enters the allgather after it received data from the peer, then the mpi
> processes local to this daemon will hang
>
> with 4 nodes, it got trickier :
> 0 enter allgather and send a message to 1
> 1 receive the message and send to 2 but with data from 0 only
> /* 1 did not enter the allgather, so its data cannot be sent to 2 */
>
>
> It's just a bug in the rcd logic, Gilles. I'll take a look and get it
> fixed - just don't have time right now
>
>
> this issue did not occur before the persistent receive :
> no receive was posted if the daemon did not enter the allgather
>
>
> The signature contains the info required to ensure the receiver knows
>> which collective this message relates to, and just happens to also allow
>> them to lookup the number of daemons involved (the base function takes care
>> of that for them).
>>
>>
> ok too, this issue was solved with the persistent receive
>
> So there is no need for a "pending" list - if you receive a message about
>> a collective you don't yet know about, you just put it on the ongoing
>> collective list. You should only receive it if you are going to be involved
>> - i.e., you have local procs that are going to participate. So you wait
>> until your local procs participate, and then pass your collected bucket
>> along.
>>
>> ok, i did something similar
> (e.g. pass all the available data)
> some data might be passed twice, but that might not be an issue
>
>
>> I suspect the link to the local procs isn't being correctly dealt with,
>> else you couldn't be hanging. Or the rcd isn't correctly passing incoming
>> messages to the base functions to register the collective.
>>
>> I'll look at it over the weekend and can resolve it then.
>>
>>
>  the attached patch is an illustration of what i was trying to explain.
> coll->nreported is used by rcd as a bitmask of the received messages
> (bit 0 is for the local daemon, bit n for the daemon at distance n)
>
> i was still debugging a race condition :
> if daemons 2 and 3 enter the allgather at the send time, they will sent a
> message to each other at the same time and rml fails establishing the
> connection.  i could not find whether this is linked to my changes...
>
> Cheers,
>
> Gilles
>
>>
>> On Sep 11, 2014, at 5:23 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>>
>> > Ralph,
>> >
>> > you are right, this was definetly not the right fix (at least with 4
>> > nodes or more)
>> >
>> > i finally understood what is going wrong here :
>> > to make it simple, the allgather recursive doubling algo is not
>> > implemented with
>> > MPI_Recv(...,peer,...) like functions but with
>> > MPI_Recv(...,MPI_ANY_SOURCE,...) like functions
>> > and that makes things slightly more complicated :
>> > right now :
>> > - with two nodes : if node 1 is late, it gets stuck in the allgather
>> > - with four nodes : if node 0 is first, then node 2 and 3 while node 1
>> > is still late, then node 0
>> > will likely leaves the allgather though it did not receive anything
>> > from  node 1
>> > - and so on
>> >
>> > i think i can fix that from now
>> >
>> > Cheers,
>> >
>> > Gilles
>> >
>> > On 2014/09/11 23:47, Ralph Castain wrote:
>> >> Yeah, that's not the right fix, I'm afraid. I've made the direct
>> component the default again until I have time to dig into this deeper.
>> >>
>> >> On Sep 11, 2014, at 4:02 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>> >>
>> >>> Ralph,
>> >>>
>> >>> the root cause is when the second orted/mpirun runs rcd_finalize_coll,
>> >>> it does not invoke pmix_server_release
>> >>> because allgather_stub was not previously invoked since the the fence
>> >>> was not yet entered.
>> >>> /* in rcd_finalize_coll, coll->cbfunc is NULL */
>> >>>
>> >>> the attached patch is likely not the right fix, it was very lightly
>> >>> tested, but so far, it works for me ...
>> >>>
>> >>> Cheers,
>> >>>
>> >>> Gilles
>> >>>
>> >>> On 2014/09/11 16:11, Gilles Gouaillardet wrote:
>>  Ralph,
>> 
>>  things got worst indeed :-(
>> 
>>  now a simple hello world involving two hosts hang in mpi_init.
>>  there is still a race condition : if a tasks a call fence long after
>> task b,
>>  then task b will never leave the fence
>> 
>>  i ll try to debug th

Re: [OMPI devel] Need to know your Github ID

2014-09-12 Thread Brad Benton
bbenton -> bbenton

On Wed, Sep 10, 2014 at 5:46 AM, Jeff Squyres (jsquyres)  wrote:

> As the next step of the planned migration to Github, I need to know:
>
> - Your Github ID (so that you can be added to the new OMPI git repo)
> - Your SVN ID (so that I can map SVN->Github IDs, and therefore map Trac
> tickets to appropriate owners)
>
> Here's the list of SVN IDs who have committed over the past year -- I'm
> guessing that most of these people will need Github IDs:
>
>  adrian
>  alekseys
>  alex
>  alinas
>  amikheev
>  bbenton
>  bosilca (done)
>  bouteill
>  brbarret
>  bwesarg
>  devendar
>  dgoodell (done)
>  edgar
>  eugene
>  ggouaillardet
>  hadi
>  hjelmn
>  hpcchris
>  hppritcha
>  igoru
>  jjhursey (done)
>  jladd
>  jroman
>  jsquyres (done)
>  jurenz
>  kliteyn
>  manjugv
>  miked (done)
>  mjbhaskar
>  mpiteam (done)
>  naughtont
>  osvegis
>  pasha
>  regrant
>  rfaucett
>  rhc (done)
>  rolfv (done)
>  samuel
>  shiqing
>  swise
>  tkordenbrock
>  vasily
>  vvenkates
>  vvenkatesan
>  yaeld
>  yosefe
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15788.php
>


[OMPI devel] mpirun + aprun question

2014-09-12 Thread Pritchard Jr., Howard
Hi Folks,

So, I've got a testbed cray system with no batch scheduler, just use the native
alps both as the resource manager and as the job launcher for the orte daemons.

What I'm noticing is that the mpirun command and -host option, or otherwise
trying to specify via an mpirun way, the nodes to run the app on is ignored.

In this sort of environment, ORTE is going to need to figure out how to load up
the aprun -L list_of_nids argument, but apparently doesn't do that.

Is this intended behavior?

Example:

crayadm@buffy:~/hpp> mpirun -np 2 -N 1  --debug-daemons --host 
nid00022,nid00021 ./my_script.sh
plm:alps aprun -n 2 -N 1 -cc none orted -mca orte_debug_daemons 1 -mca 
orte_ess_jobid 337444864 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca 
orte_hnp_uri 337444864.0;tcp://10.128.0.3:39190
Daemon [[5149,0],1] checking in as pid 7398 on host 20
Daemon [[5149,0],1] checking in as pid 6540 on host 21

What's happening is that alps is just doing its default thing of grabbing the 
first set of nodes it can, which on
my little machine starts at nid00020.

What I'd prefer to see with ORTE and alps is that ORTE always build the 
explicit -L list_of_nids
so that a user can control the way the orte's are being launched, just as with 
using aprun directly
one can do either within a non alps managed batch environment and when a batch 
scheduler
is managing things and telling alps where to launch the job.

I had to use this -L feature a lot when debugging large customer system 
problems.

Can I assume LANL owns the alps pml component?

Howard


-
Howard Pritchard
HPC-5
Los Alamos National Laboratory




Re: [OMPI devel] mpirun + aprun question

2014-09-12 Thread Ralph Castain
Odd - I'm pretty sure it does indeed build the -L argument...and indeed, it 
does:

for (nnode=0; nnode < map->nodes->size; nnode++) {
if (NULL == (node = 
(orte_node_t*)opal_pointer_array_get_item(map->nodes, nnode))) {
continue;
}

/* if the daemon already exists on this node, then
 * don't include it
 */
if (ORTE_FLAG_TEST(node, ORTE_NODE_FLAG_DAEMON_LAUNCHED)) {
continue;
}

/* otherwise, add it to the list of nodes upon which
 * we need to launch a daemon
 */
opal_argv_append(&nodelist_argc, &nodelist_argv, node->name);
}
if (0 == opal_argv_count(nodelist_argv)) {
orte_show_help("help-plm-alps.txt", "no-hosts-in-list", true);
rc = ORTE_ERR_FAILED_TO_START;
goto cleanup;
}
nodelist_flat = opal_argv_join(nodelist_argv, ',');
opal_argv_free(nodelist_argv);

/* if we are using all allocated nodes, then alps
 * doesn't need a nodelist
 */
if (map->num_new_daemons < orte_num_allocated_nodes) {
opal_argv_append(&argc, &argv, "-L");
opal_argv_append(&argc, &argv, nodelist_flat);
}


So maybe the --host option isn't working right for this environment? You could 
look at the setup_virtual_machine function in 
orte/mca/plm/base/plm_base_launch_support.c

Set "-mca plm_base_verbose 100 -mca ras_base_verbose 100" and it should tell 
you something about how it processed the allocation to define the VM.

There is also some oddball stuff Nathan inserted to redefine node location - 
maybe that is getting confused when running on partial allocations? It's in the 
same file, in the orte_plm_base_daemon_callback routine. Could be that the 
daemons actually are running on the nodes you specified, but think they are 
somewhere else.


On Sep 12, 2014, at 11:13 AM, Pritchard Jr., Howard  wrote:

> Hi Folks,
>  
> So, I’ve got a testbed cray system with no batch scheduler, just use the 
> native
> alps both as the resource manager and as the job launcher for the orte 
> daemons.
>  
> What I’m noticing is that the mpirun command and –host option, or otherwise
> trying to specify via an mpirun way, the nodes to run the app on is ignored.
>  
> In this sort of environment, ORTE is going to need to figure out how to load 
> up
> the aprun –L list_of_nids argument, but apparently doesn’t do that.
>  
> Is this intended behavior?
>  
> Example:
>  
> crayadm@buffy:~/hpp> mpirun -np 2 -N 1  --debug-daemons --host 
> nid00022,nid00021 ./my_script.sh
> plm:alps aprun -n 2 -N 1 -cc none orted -mca orte_debug_daemons 1 -mca 
> orte_ess_jobid 337444864 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca 
> orte_hnp_uri 337444864.0;tcp://10.128.0.3:39190
> Daemon [[5149,0],1] checking in as pid 7398 on host 20
> Daemon [[5149,0],1] checking in as pid 6540 on host 21
>  
> What’s happening is that alps is just doing its default thing of grabbing the 
> first set of nodes it can, which on
> my little machine starts at nid00020.
>  
> What I’d prefer to see with ORTE and alps is that ORTE always build the 
> explicit –L list_of_nids
> so that a user can control the way the orte’s are being launched, just as 
> with using aprun directly
> one can do either within a non alps managed batch environment and when a 
> batch scheduler
> is managing things and telling alps where to launch the job.
>  
> I had to use this –L feature a lot when debugging large customer system 
> problems.
>  
> Can I assume LANL owns the alps pml component?
>  
> Howard
>  
>  
> -
> Howard Pritchard
> HPC-5
> Los Alamos National Laboratory
>  
>  
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15820.php