On Jun 6, 2014, at 12:50 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:

> Thanks for trying Ralph.   Looks like my issues has to do with coll ml 
> interaction.  If I exclude coll ml, then all my tests pass.  Do you know if 
> there is a bug for this issue?

There is a known issue with coll ml for intercomm_create - Nathan is working on 
a fix. It was reported by Gilles (yesterday?)

> If so, then I can run my nightly tests with coll ml disabled and wait for the 
> bug to be fixed.
>  
> Also, where does simple_spawn and spawn_multiple live?

I have a copy/version in my orte/test/mpi directory that I use - that's where 
these came from. Note that I left coll ml "on" for those as they weren't having 
troubles.


>   I was running “spawn” and “spawn_multiple” from the ibm/dynamic test suite. 
> Your output for spawn_multiple looks different than mine.
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Friday, June 06, 2014 3:19 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple 
> hang on trunk
>  
> Works fine for me:
>  
> [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./simple_spawn
> [pid 22777] starting up!
> [pid 22778] starting up!
> [pid 22779] starting up!
> 1 completed MPI_Init
> Parent [pid 22778] about to spawn!
> 2 completed MPI_Init
> Parent [pid 22779] about to spawn!
> 0 completed MPI_Init
> Parent [pid 22777] about to spawn!
> [pid 22783] starting up!
> [pid 22784] starting up!
> Parent done with spawn
> Parent sending message to child
> Parent done with spawn
> Parent done with spawn
> 0 completed MPI_Init
> Hello from the child 0 of 2 on host bend001 pid 22783
> Child 0 received msg: 38
> 1 completed MPI_Init
> Hello from the child 1 of 2 on host bend001 pid 22784
> Child 1 disconnected
> Parent disconnected
> Parent disconnected
> Parent disconnected
> Child 0 disconnected
> 22784: exiting
> 22778: exiting
> 22779: exiting
> 22777: exiting
> 22783: exiting
> [rhc@bend001 mpi]$ make spawn_multiple
> mpicc -g --openmpi:linkall    spawn_multiple.c   -o spawn_multiple
> [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./spawn_multiple
> Parent [pid 22797] about to spawn!
> Parent [pid 22798] about to spawn!
> Parent [pid 22799] about to spawn!
> Parent done with spawn
> Parent done with spawn
> Parent sending message to children
> Parent done with spawn
> Hello from the child 0 of 2 on host bend001 pid 22803: argv[1] = foo
> Child 0 received msg: 38
> Hello from the child 1 of 2 on host bend001 pid 22804: argv[1] = bar
> Child 1 disconnected
> Parent disconnected
> Parent disconnected
> Parent disconnected
> Child 0 disconnected
> [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 -mca coll ^ml ./intercomm_create
> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 3]
> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 4]
> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 5]
> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 3]
> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 4]
> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 5]
> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, &inter) (0)
> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, &inter) (0)
> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, &inter) (0)
> b: intercomm_create (0)
> b: barrier on inter-comm - before
> b: barrier on inter-comm - after
> b: intercomm_create (0)
> b: barrier on inter-comm - before
> b: barrier on inter-comm - after
> c: intercomm_create (0)
> c: barrier on inter-comm - before
> c: barrier on inter-comm - after
> c: intercomm_create (0)
> c: barrier on inter-comm - before
> c: barrier on inter-comm - after
> a: intercomm_create (0)
> a: barrier on inter-comm - before
> a: barrier on inter-comm - after
> c: intercomm_create (0)
> c: barrier on inter-comm - before
> c: barrier on inter-comm - after
> a: intercomm_create (0)
> a: barrier on inter-comm - before
> a: barrier on inter-comm - after
> a: intercomm_create (0)
> a: barrier on inter-comm - before
> a: barrier on inter-comm - after
> b: intercomm_create (0)
> b: barrier on inter-comm - before
> b: barrier on inter-comm - after
> a: intercomm_merge(0) (0) [rank 2]
> c: intercomm_merge(0) (0) [rank 8]
> a: intercomm_merge(0) (0) [rank 0]
> a: intercomm_merge(0) (0) [rank 1]
> c: intercomm_merge(0) (0) [rank 7]
> b: intercomm_merge(1) (0) [rank 4]
> b: intercomm_merge(1) (0) [rank 5]
> c: intercomm_merge(0) (0) [rank 6]
> b: intercomm_merge(1) (0) [rank 3]
> a: barrier (0)
> b: barrier (0)
> c: barrier (0)
> a: barrier (0)
> c: barrier (0)
> b: barrier (0)
> a: barrier (0)
> c: barrier (0)
> b: barrier (0)
> dpm_base_disconnect_init: error -12 in isend to process 3
> dpm_base_disconnect_init: error -12 in isend to process 3
> dpm_base_disconnect_init: error -12 in isend to process 3
> dpm_base_disconnect_init: error -12 in isend to process 0
> dpm_base_disconnect_init: error -12 in isend to process 3
> dpm_base_disconnect_init: error -12 in isend to process 3
> dpm_base_disconnect_init: error -12 in isend to process 3
> dpm_base_disconnect_init: error -12 in isend to process 0
> dpm_base_disconnect_init: error -12 in isend to process 3
> dpm_base_disconnect_init: error -12 in isend to process 3
> dpm_base_disconnect_init: error -12 in isend to process 3
> dpm_base_disconnect_init: error -12 in isend to process 1
> dpm_base_disconnect_init: error -12 in isend to process 3
> [rhc@bend001 mpi]$ 
>  
>  
>  
> On Jun 6, 2014, at 11:26 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
> 
> 
> I am seeing an interesting failure on trunk.  intercomm_create, spawn, and 
> spawn_multiple from the IBM tests hang if I explicitly list the hostnames to 
> run on.  For example:
> 
> Good:
> $ mpirun -np 2 --mca btl self,sm,tcp spawn_multiple
> Parent: 0 of 2, drossetti-ivy0.nvidia.com (0 in init)
> Parent: 1 of 2, drossetti-ivy0.nvidia.com (0 in init)
> Child: 0 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init)
> Child: 1 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init)
> Child: 2 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init)
> Child: 3 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init)
> $ 
> 
> Bad:
> $ mpirun -np 2 --mca btl self,sm,tcp -host drossetti-ivy0,drossetti-ivy0 
> spawn_multiple
> Parent: 0 of 2, drossetti-ivy0.nvidia.com (1 in init)
> Parent: 1 of 2, drossetti-ivy0.nvidia.com (1 in init)
> Child: 0 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init)
> Child: 1 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init)
> Child: 2 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init)
> Child: 3 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init)
> [..and we are hung here...]
> 
> I see the exact same behavior for spawn and spawn_multiple.  Ralph, any 
> thoughts?  Open MPI 1.8 is fine.  I can provide more information if needed, 
> but I assume this is reproducible. 
> 
> Thanks,
> Rolf
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14990.php
>  
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/06/14992.php

Reply via email to