Dear Ralph, This is the output I get when I execute with the verbose option.
[grsacc20:21012] [[23526,0],0] plm:base:receive processing msg [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from [[23526,1],0] [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn [grsacc20:21012] [[23526,0],0] plm:base:receive done processing commands [grsacc20:21012] [[23526,0],0] plm:base:setup_job [grsacc20:21012] [[23526,0],0] plm:base:setup_vm [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon [[23526,0],2] [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon [[23526,0],2] to node grsacc17/1-4 [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon [[23526,0],3] [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon [[23526,0],3] to node grsacc17/0-5 [grsacc20:21012] [[23526,0],0] plm:tm: launching vm [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv: orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5 [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once. [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit commands [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm Says something? Best, Suraj On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote: > I'll still need to look at the intercomm_create issue, but I just tested both > the trunk and current 1.7.3 branch for "add-host" and both worked just fine. > This was on my little test cluster which only has rsh available - no Torque. > > You might add "-mca plm_base_verbose 5" to your cmd line to get some debug > output as to the problem. > > > On Sep 21, 2013, at 5:48 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> >> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> >> wrote: >> >>> Dear all, >>> >>> Really thanks a lot for your efforts. I too downloaded the trunk to check >>> if it works for my case and as of revision 29215, it works for the original >>> case I reported. Although it works, I still see the following in the >>> output. Does it mean anything? >>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0] >> >> Yes - it means we don't quite have this right yet :-( >> >>> >>> However, on another topic relevant to my use case, I have another problem >>> to report. I am having problems using the "add-host" info to the >>> MPI_Comm_spawn() when MPI is compiled with support for Torque resource >>> manager. This problem is totally new in the 1.7 series and it worked >>> perfectly until 1.6.5 >>> >>> Basically, I am working on implementing dynamic resource management >>> facilities in the Torque/Maui batch system. Through a new tm call, an >>> application can get new resources for a job. >> >> FWIW: you'll find that we added an API to the orte RAS framework to support >> precisely that operation. It allows an application to request that we >> dynamically obtain additional resources during execution (e.g., as part of a >> Comm_spawn call via an info_key). We originally implemented this with Slurm, >> but you could add the calls into the Torque component as well if you like. >> >> This is in the trunk now - will come over to 1.7.4 >> >> >>> I want to use MPI_Comm_spawn() to spawn new processes in the new hosts. >>> With my extended torque/maui batch system, I was able to perfectly use the >>> "add-host" info argument to MPI_Comm_spawn() to spawn new processes on >>> these hosts. Since MPI and Torque refer to the hosts through the nodeids, I >>> made sure that OpenMPI uses the correct nodeid's for these new hosts. >>> Until 1.6.5, this worked perfectly fine, except that due to the >>> Intercomm_merge problem, I could not really run a real application to its >>> completion. >>> >>> While this is now fixed in the trunk, I found that, however, when using the >>> "add-host" info argument, everything collapses after printing out the >>> following error. >>> >>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one >>> event_base_loop can run on each event_base at once. >> >> I'll take a look - probably some stale code that hasn't been updated yet for >> async ORTE operations >> >>> >>> And due to this, I am still not really able to run my application! I also >>> compiled the MPI without any Torque/PBS support and just used the >>> "add-host" argument normally. Again, this worked perfectly in 1.6.5. But in >>> the 1.7 series, it works but after printing out the following error. >>> >>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0] >>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0] >> >> Yeah, the 1.7 series doesn't have the reentrant test in it - so we >> "illegally" re-enter libevent. The error again means we don't have >> Intercomm_create correct just yet. >> >> I'll see what I can do about this and get back to you >> >>> >>> In short, with pbs/torque support, it fails and without pbs/torque support, >>> it runs after spitting the above lines. >>> >>> I would really appreciate some help on this, since I need these features to >>> actually test my case and (at least in my short experience) no other MPI >>> implementation seem friendly to such dynamic scenarios. >>> >>> Thanks a lot! >>> >>> Best, >>> Suraj >>> >>> >>> >>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote: >>> >>>> Just to close my end of this loop: as of trunk r29213, it all works for >>>> me. Thanks! >>>> >>>> >>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>>> Thanks George - much appreciated >>>>> >>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >>>>> >>>>>> The test case was broken. I just pushed a fix. >>>>>> >>>>>> George. >>>>>> >>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote: >>>>>> >>>>>>> Hangs with any np > 1 >>>>>>> >>>>>>> However, I'm not sure if that's an issue with the test vs the >>>>>>> underlying implementation >>>>>>> >>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" >>>>>>> <jsquy...@cisco.com> wrote: >>>>>>> >>>>>>>> Does it hang when you run with -np 4? >>>>>>>> >>>>>>>> Sent from my phone. No type good. >>>>>>>> >>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> wrote: >>>>>>>> >>>>>>>>> Strange - it works fine for me on my Mac. However, I see one >>>>>>>>> difference - I only run it with np=1 >>>>>>>>> >>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) >>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>> >>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have >>>>>>>>>>> another network enabled. >>>>>>>>>> >>>>>>>>>> I know :-). I have tcp available as well (OMPI will abort if you >>>>>>>>>> only run with sm,self because the comm_spawn will fail with >>>>>>>>>> unreachable errors -- I just tested/proved this to myself). >>>>>>>>>> >>>>>>>>>>> 2. Don't use the test case attached to my email, I left an xterm >>>>>>>>>>> based spawn and the debugging. It can't work without xterm support. >>>>>>>>>>> Instead try using the test case from the trunk, the one committed >>>>>>>>>>> by Ralph. >>>>>>>>>> >>>>>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I ran with >>>>>>>>>> orte/test/mpi/intercomm_create.c, and that hangs for me as well: >>>>>>>>>> >>>>>>>>>> ----- >>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>> &inter) [rank 4] >>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>> &inter) [rank 5] >>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>> &inter) [rank 6] >>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>> &inter) [rank 7] >>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>> [rank 4] >>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>> [rank 5] >>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>> [rank 6] >>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>> [rank 7] >>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>> [hang] >>>>>>>>>> ----- >>>>>>>>>> >>>>>>>>>> Similarly, on my Mac, it hangs with no output: >>>>>>>>>> >>>>>>>>>> ----- >>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>> [hang] >>>>>>>>>> ----- >>>>>>>>>> >>>>>>>>>>> George. >>>>>>>>>>> >>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" >>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> George -- >>>>>>>>>>>> >>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your attached >>>>>>>>>>>> test case hangs: >>>>>>>>>>>> >>>>>>>>>>>> ----- >>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0) >>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>>>> [rank 4] >>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>>>> [rank 5] >>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>>>> [rank 6] >>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) >>>>>>>>>>>> [rank 7] >>>>>>>>>>>> [hang] >>>>>>>>>>>> ----- >>>>>>>>>>>> >>>>>>>>>>>> On my Mac, it hangs without printing anything: >>>>>>>>>>>> >>>>>>>>>>>> ----- >>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>> [hang] >>>>>>>>>>>> ----- >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch that >>>>>>>>>>>>> addresses the MPI_Intercomm issue at the MPI level. It should be >>>>>>>>>>>>> applied after removal of 29166. >>>>>>>>>>>>> >>>>>>>>>>>>> I also added the corrected test case stressing the corner cases >>>>>>>>>>>>> by doing barriers at every inter-comm creation and doing a clean >>>>>>>>>>>>> disconnect. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Jeff Squyres >>>>>>>>>> jsquy...@cisco.com >>>>>>>>>> For corporate legal information go to: >>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel