Dear all, Really thanks a lot for your efforts. I too downloaded the trunk to check if it works for my case and as of revision 29215, it works for the original case I reported. Although it works, I still see the following in the output. Does it mean anything? [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0]
However, on another topic relevant to my use case, I have another problem to report. I am having problems using the "add-host" info to the MPI_Comm_spawn() when MPI is compiled with support for Torque resource manager. This problem is totally new in the 1.7 series and it worked perfectly until 1.6.5 Basically, I am working on implementing dynamic resource management facilities in the Torque/Maui batch system. Through a new tm call, an application can get new resources for a job. I want to use MPI_Comm_spawn() to spawn new processes in the new hosts. With my extended torque/maui batch system, I was able to perfectly use the "add-host" info argument to MPI_Comm_spawn() to spawn new processes on these hosts. Since MPI and Torque refer to the hosts through the nodeids, I made sure that OpenMPI uses the correct nodeid's for these new hosts. Until 1.6.5, this worked perfectly fine, except that due to the Intercomm_merge problem, I could not really run a real application to its completion. While this is now fixed in the trunk, I found that, however, when using the "add-host" info argument, everything collapses after printing out the following error. [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once. And due to this, I am still not really able to run my application! I also compiled the MPI without any Torque/PBS support and just used the "add-host" argument normally. Again, this worked perfectly in 1.6.5. But in the 1.7 series, it works but after printing out the following error. [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0] [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0] In short, with pbs/torque support, it fails and without pbs/torque support, it runs after spitting the above lines. I would really appreciate some help on this, since I need these features to actually test my case and (at least in my short experience) no other MPI implementation seem friendly to such dynamic scenarios. Thanks a lot! Best, Suraj On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote: > Just to close my end of this loop: as of trunk r29213, it all works for me. > Thanks! > > > On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Thanks George - much appreciated >> >> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >>> The test case was broken. I just pushed a fix. >>> >>> George. >>> >>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> Hangs with any np > 1 >>>> >>>> However, I'm not sure if that's an issue with the test vs the underlying >>>> implementation >>>> >>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" >>>> <jsquy...@cisco.com> wrote: >>>> >>>>> Does it hang when you run with -np 4? >>>>> >>>>> Sent from my phone. No type good. >>>>> >>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> wrote: >>>>> >>>>>> Strange - it works fine for me on my Mac. However, I see one difference >>>>>> - I only run it with np=1 >>>>>> >>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) >>>>>> <jsquy...@cisco.com> wrote: >>>>>> >>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>>>>> >>>>>>>> 1. sm doesn't work between spawned processes. So you must have another >>>>>>>> network enabled. >>>>>>> >>>>>>> I know :-). I have tcp available as well (OMPI will abort if you only >>>>>>> run with sm,self because the comm_spawn will fail with unreachable >>>>>>> errors -- I just tested/proved this to myself). >>>>>>> >>>>>>>> 2. Don't use the test case attached to my email, I left an xterm based >>>>>>>> spawn and the debugging. It can't work without xterm support. Instead >>>>>>>> try using the test case from the trunk, the one committed by Ralph. >>>>>>> >>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I ran with >>>>>>> orte/test/mpi/intercomm_create.c, and that hangs for me as well: >>>>>>> >>>>>>> ----- >>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) >>>>>>> [rank 4] >>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) >>>>>>> [rank 5] >>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) >>>>>>> [rank 6] >>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) >>>>>>> [rank 7] >>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>> [rank 4] >>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>> [rank 5] >>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>> [rank 6] >>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>> [rank 7] >>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>> [hang] >>>>>>> ----- >>>>>>> >>>>>>> Similarly, on my Mac, it hangs with no output: >>>>>>> >>>>>>> ----- >>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>> [hang] >>>>>>> ----- >>>>>>> >>>>>>>> George. >>>>>>>> >>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" >>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>> >>>>>>>>> George -- >>>>>>>>> >>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your attached >>>>>>>>> test case hangs: >>>>>>>>> >>>>>>>>> ----- >>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) >>>>>>>>> [rank 4] >>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) >>>>>>>>> [rank 5] >>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) >>>>>>>>> [rank 6] >>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) >>>>>>>>> [rank 7] >>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>> [rank 4] >>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>> [rank 5] >>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>> [rank 6] >>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>> [rank 7] >>>>>>>>> [hang] >>>>>>>>> ----- >>>>>>>>> >>>>>>>>> On my Mac, it hangs without printing anything: >>>>>>>>> >>>>>>>>> ----- >>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>> [hang] >>>>>>>>> ----- >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Here is a quick (and definitively not the cleanest) patch that >>>>>>>>>> addresses the MPI_Intercomm issue at the MPI level. It should be >>>>>>>>>> applied after removal of 29166. >>>>>>>>>> >>>>>>>>>> I also added the corrected test case stressing the corner cases by >>>>>>>>>> doing barriers at every inter-comm creation and doing a clean >>>>>>>>>> disconnect. >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Jeff Squyres >>>>>>>>> jsquy...@cisco.com >>>>>>>>> For corporate legal information go to: >>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Jeff Squyres >>>>>>> jsquy...@cisco.com >>>>>>> For corporate legal information go to: >>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel