On Jun 6, 2014, at 12:50 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
> Thanks for trying Ralph. Looks like my issues has to do with coll ml > interaction. If I exclude coll ml, then all my tests pass. Do you know if > there is a bug for this issue? There is a known issue with coll ml for intercomm_create - Nathan is working on a fix. It was reported by Gilles (yesterday?) > If so, then I can run my nightly tests with coll ml disabled and wait for the > bug to be fixed. > > Also, where does simple_spawn and spawn_multiple live? I have a copy/version in my orte/test/mpi directory that I use - that's where these came from. Note that I left coll ml "on" for those as they weren't having troubles. > I was running “spawn” and “spawn_multiple” from the ibm/dynamic test suite. > Your output for spawn_multiple looks different than mine. > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Friday, June 06, 2014 3:19 PM > To: Open MPI Developers > Subject: Re: [OMPI devel] Strange intercomm_create, spawn, spawn_multiple > hang on trunk > > Works fine for me: > > [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./simple_spawn > [pid 22777] starting up! > [pid 22778] starting up! > [pid 22779] starting up! > 1 completed MPI_Init > Parent [pid 22778] about to spawn! > 2 completed MPI_Init > Parent [pid 22779] about to spawn! > 0 completed MPI_Init > Parent [pid 22777] about to spawn! > [pid 22783] starting up! > [pid 22784] starting up! > Parent done with spawn > Parent sending message to child > Parent done with spawn > Parent done with spawn > 0 completed MPI_Init > Hello from the child 0 of 2 on host bend001 pid 22783 > Child 0 received msg: 38 > 1 completed MPI_Init > Hello from the child 1 of 2 on host bend001 pid 22784 > Child 1 disconnected > Parent disconnected > Parent disconnected > Parent disconnected > Child 0 disconnected > 22784: exiting > 22778: exiting > 22779: exiting > 22777: exiting > 22783: exiting > [rhc@bend001 mpi]$ make spawn_multiple > mpicc -g --openmpi:linkall spawn_multiple.c -o spawn_multiple > [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 ./spawn_multiple > Parent [pid 22797] about to spawn! > Parent [pid 22798] about to spawn! > Parent [pid 22799] about to spawn! > Parent done with spawn > Parent done with spawn > Parent sending message to children > Parent done with spawn > Hello from the child 0 of 2 on host bend001 pid 22803: argv[1] = foo > Child 0 received msg: 38 > Hello from the child 1 of 2 on host bend001 pid 22804: argv[1] = bar > Child 1 disconnected > Parent disconnected > Parent disconnected > Parent disconnected > Child 0 disconnected > [rhc@bend001 mpi]$ mpirun -n 3 --host bend001 -mca coll ^ml ./intercomm_create > b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 3] > b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 4] > b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) [rank 5] > c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 3] > c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 4] > c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) [rank 5] > a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, &inter) (0) > a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, &inter) (0) > a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 3, 201, &inter) (0) > b: intercomm_create (0) > b: barrier on inter-comm - before > b: barrier on inter-comm - after > b: intercomm_create (0) > b: barrier on inter-comm - before > b: barrier on inter-comm - after > c: intercomm_create (0) > c: barrier on inter-comm - before > c: barrier on inter-comm - after > c: intercomm_create (0) > c: barrier on inter-comm - before > c: barrier on inter-comm - after > a: intercomm_create (0) > a: barrier on inter-comm - before > a: barrier on inter-comm - after > c: intercomm_create (0) > c: barrier on inter-comm - before > c: barrier on inter-comm - after > a: intercomm_create (0) > a: barrier on inter-comm - before > a: barrier on inter-comm - after > a: intercomm_create (0) > a: barrier on inter-comm - before > a: barrier on inter-comm - after > b: intercomm_create (0) > b: barrier on inter-comm - before > b: barrier on inter-comm - after > a: intercomm_merge(0) (0) [rank 2] > c: intercomm_merge(0) (0) [rank 8] > a: intercomm_merge(0) (0) [rank 0] > a: intercomm_merge(0) (0) [rank 1] > c: intercomm_merge(0) (0) [rank 7] > b: intercomm_merge(1) (0) [rank 4] > b: intercomm_merge(1) (0) [rank 5] > c: intercomm_merge(0) (0) [rank 6] > b: intercomm_merge(1) (0) [rank 3] > a: barrier (0) > b: barrier (0) > c: barrier (0) > a: barrier (0) > c: barrier (0) > b: barrier (0) > a: barrier (0) > c: barrier (0) > b: barrier (0) > dpm_base_disconnect_init: error -12 in isend to process 3 > dpm_base_disconnect_init: error -12 in isend to process 3 > dpm_base_disconnect_init: error -12 in isend to process 3 > dpm_base_disconnect_init: error -12 in isend to process 0 > dpm_base_disconnect_init: error -12 in isend to process 3 > dpm_base_disconnect_init: error -12 in isend to process 3 > dpm_base_disconnect_init: error -12 in isend to process 3 > dpm_base_disconnect_init: error -12 in isend to process 0 > dpm_base_disconnect_init: error -12 in isend to process 3 > dpm_base_disconnect_init: error -12 in isend to process 3 > dpm_base_disconnect_init: error -12 in isend to process 3 > dpm_base_disconnect_init: error -12 in isend to process 1 > dpm_base_disconnect_init: error -12 in isend to process 3 > [rhc@bend001 mpi]$ > > > > On Jun 6, 2014, at 11:26 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote: > > > I am seeing an interesting failure on trunk. intercomm_create, spawn, and > spawn_multiple from the IBM tests hang if I explicitly list the hostnames to > run on. For example: > > Good: > $ mpirun -np 2 --mca btl self,sm,tcp spawn_multiple > Parent: 0 of 2, drossetti-ivy0.nvidia.com (0 in init) > Parent: 1 of 2, drossetti-ivy0.nvidia.com (0 in init) > Child: 0 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init) > Child: 1 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init) > Child: 2 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init) > Child: 3 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init) > $ > > Bad: > $ mpirun -np 2 --mca btl self,sm,tcp -host drossetti-ivy0,drossetti-ivy0 > spawn_multiple > Parent: 0 of 2, drossetti-ivy0.nvidia.com (1 in init) > Parent: 1 of 2, drossetti-ivy0.nvidia.com (1 in init) > Child: 0 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init) > Child: 1 of 4, drossetti-ivy0.nvidia.com (this is job 1) (1 in init) > Child: 2 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init) > Child: 3 of 4, drossetti-ivy0.nvidia.com (this is job 2) (1 in init) > [..and we are hung here...] > > I see the exact same behavior for spawn and spawn_multiple. Ralph, any > thoughts? Open MPI 1.8 is fine. I can provide more information if needed, > but I assume this is reproducible. > > Thanks, > Rolf > ----------------------------------------------------------------------------------- > This email message is for the sole use of the intended recipient(s) and may > contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > ----------------------------------------------------------------------------------- > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14990.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/06/14992.php