Hi Ralph, I always got this output from any MPI job that ran on our nodes. There seems to be a problem somewhere but it never stopped the applications from running. But anyway, I ran it again now with only tcp and excluded the infiniband and I get the same output again. Except that this time, the error related to this openib is not there anymore. Printing out the log again.
[grsacc20:04578] [[6160,0],0] plm:base:receive processing msg [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from [[6160,1],0] [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands [grsacc20:04578] [[6160,0],0] plm:base:setup_job [grsacc20:04578] [[6160,0],0] plm:base:setup_vm [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2] [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon [[6160,0],2] to node grsacc18 [grsacc20:04578] [[6160,0],0] plm:tm: launching vm [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv: orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19 [grsacc20:04578] [[6160,0],0] plm:tm: executing: orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18 [grsacc20:04578] [[6160,0],0] plm:tm: executing: orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 -mca orte_hnp_uri "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl tcp,sm,self [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds [grsacc19:28821] mca:base:select:( plm) Querying component [rsh] [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL [grsacc19:28821] mca:base:select:( plm) Query of component [rsh] set priority to 10 [grsacc19:28821] mca:base:select:( plm) Selected component [rsh] [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL [grsacc19:28821] [[6160,0],1] plm:base:receive start comm [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm [grsacc18:16717] mca:base:select:( plm) Querying component [rsh] [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL [grsacc18:16717] mca:base:select:( plm) Query of component [rsh] set priority to 10 [grsacc18:16717] mca:base:select:( plm) Selected component [rsh] [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL [grsacc18:16717] [[6160,0],2] plm:base:receive start comm [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon [[6160,0],2] [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon [[6160,0],2] on node grsacc18 [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229 [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2] [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command from [[6160,0],2] [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for job [6160,2] [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for vpid 0 state RUNNING exit_code 0 [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job [6160,2] [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands [grsacc20:04578] [[6160,0],0] plm:base:launch registered event [grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job [6160,2] to [[6160,1],0] [grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit commands [grsacc19:28815] [[6160,0],1] plm:base:receive stop comm [grsacc20:04578] [[6160,0],0] plm:base:receive stop comm -bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm Best, Suraj On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote: > Your output shows that it launched your apps, but they exited. The error is > reported here, though it appears we aren't flushing the message out before > exiting due to a race condition: > >> [grsacc20:04511] 1 more process has sent help message >> help-mpi-btl-openib.txt / no active ports found > > Here is the full text: > [no active ports found] > WARNING: There is at least non-excluded one OpenFabrics device found, > but there are no active ports detected (or Open MPI was unable to use > them). This is most certainly not what you wanted. Check your > cables, subnet manager configuration, etc. The openib BTL will be > ignored for this job. > > Local host: %s > > Looks like at least one node being used doesn't have an active Infiniband > port on it? > > > On Sep 24, 2013, at 6:11 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> > wrote: > >> Hi Ralph, >> >> I tested it with the trunk r29228. I still have the following problem. Now, >> it even spawns the daemon on the new node through torque but then suddently >> quits. The following is the output. Can you please have a look? >> >> Thanks >> Suraj >> >> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >> [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from >> [[6253,1],0] >> [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts >> [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn >> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >> [grsacc20:04511] [[6253,0],0] plm:base:setup_job >> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm >> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon [[6253,0],2] >> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon >> [[6253,0],2] to node grsacc18 >> [grsacc20:04511] [[6253,0],0] plm:tm: launching vm >> [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv: >> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid >> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19 >> [grsacc20:04511] [[6253,0],0] plm:tm: executing: >> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 1 >> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18 >> [grsacc20:04511] [[6253,0],0] plm:tm: executing: >> orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 2 >> -mca orte_ess_num_procs 3 -mca orte_hnp_uri >> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6 >> [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds >> [grsacc19:28754] mca:base:select:( plm) Querying component [rsh] >> [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path NULL >> [grsacc19:28754] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [grsacc19:28754] mca:base:select:( plm) Selected component [rsh] >> [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL >> [grsacc19:28754] [[6253,0],1] plm:base:receive start comm >> [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm >> [grsacc18:16648] mca:base:select:( plm) Querying component [rsh] >> [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path NULL >> [grsacc18:16648] mca:base:select:( plm) Query of component [rsh] set >> priority to 10 >> [grsacc18:16648] mca:base:select:( plm) Selected component [rsh] >> [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL >> [grsacc18:16648] [[6253,0],2] plm:base:receive start comm >> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon >> [[6253,0],2] >> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon >> [[6253,0],2] on node grsacc18 >> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed for >> daemon [[6253,0],2] at contact 409796608.2;tcp://192.168.222.18:47974 >> [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2] >> [grsacc20:04511] 1 more process has sent help message >> help-mpi-btl-openib.txt / no active ports found >> [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> all help / error messages >> [grsacc20:04511] 1 more process has sent help message help-mpi-btl-base.txt >> / btl:no-nics >> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >> [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state command >> from [[6253,0],2] >> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for job >> [6253,2] >> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for >> vpid 0 state RUNNING exit_code 0 >> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >> [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job [6253,2] >> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg >> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands >> [grsacc20:04511] [[6253,0],0] plm:base:launch registered event >> [grsacc20:04511] [[6253,0],0] plm:base:launch sending dyn release of job >> [6253,2] to [[6253,1],0] >> [grsacc20:04511] [[6253,0],0] plm:base:orted_cmd sending orted_exit commands >> [grsacc19:28747] [[6253,0],1] plm:base:receive stop comm >> [grsacc20:04511] [[6253,0],0] plm:base:receive stop comm >> -bash-4.1$ [grsacc18:16648] [[6253,0],2] plm:base:receive stop comm >> >> >> >> >> On Sep 23, 2013, at 1:55 AM, Ralph Castain wrote: >> >>> Found a bug in the Torque support - we were trying to connect to the MOM >>> again, which would hang (I imagine). I pushed a fix to the trunk (r29227) >>> and scheduled it to come to 1.7.3 if you want to try it again. >>> >>> >>> On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran >>> <suraj.prabhaka...@gmail.com> wrote: >>> >>>> Dear Ralph, >>>> >>>> This is the output I get when I execute with the verbose option. >>>> >>>> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg >>>> [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from >>>> [[23526,1],0] >>>> [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts >>>> [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn >>>> [grsacc20:21012] [[23526,0],0] plm:base:receive done processing commands >>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_job >>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm >>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon >>>> [[23526,0],2] >>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon >>>> [[23526,0],2] to node grsacc17/1-4 >>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon >>>> [[23526,0],3] >>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon >>>> [[23526,0],3] to node grsacc17/0-5 >>>> [grsacc20:21012] [[23526,0],0] plm:tm: launching vm >>>> [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv: >>>> orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca orte_ess_vpid >>>> <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri >>>> "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5 >>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only one >>>> event_base_loop can run on each event_base at once. >>>> [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit >>>> commands >>>> [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm >>>> >>>> Says something? >>>> >>>> Best, >>>> Suraj >>>> >>>> >>>> >>>> >>>> On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote: >>>> >>>>> I'll still need to look at the intercomm_create issue, but I just tested >>>>> both the trunk and current 1.7.3 branch for "add-host" and both worked >>>>> just fine. This was on my little test cluster which only has rsh >>>>> available - no Torque. >>>>> >>>>> You might add "-mca plm_base_verbose 5" to your cmd line to get some >>>>> debug output as to the problem. >>>>> >>>>> >>>>> On Sep 21, 2013, at 5:48 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>>> >>>>>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran >>>>>> <suraj.prabhaka...@gmail.com> wrote: >>>>>> >>>>>>> Dear all, >>>>>>> >>>>>>> Really thanks a lot for your efforts. I too downloaded the trunk to >>>>>>> check if it works for my case and as of revision 29215, it works for >>>>>>> the original case I reported. Although it works, I still see the >>>>>>> following in the output. Does it mean anything? >>>>>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0] >>>>>> >>>>>> Yes - it means we don't quite have this right yet :-( >>>>>> >>>>>>> >>>>>>> However, on another topic relevant to my use case, I have another >>>>>>> problem to report. I am having problems using the "add-host" info to >>>>>>> the MPI_Comm_spawn() when MPI is compiled with support for Torque >>>>>>> resource manager. This problem is totally new in the 1.7 series and it >>>>>>> worked perfectly until 1.6.5 >>>>>>> >>>>>>> Basically, I am working on implementing dynamic resource management >>>>>>> facilities in the Torque/Maui batch system. Through a new tm call, an >>>>>>> application can get new resources for a job. >>>>>> >>>>>> FWIW: you'll find that we added an API to the orte RAS framework to >>>>>> support precisely that operation. It allows an application to request >>>>>> that we dynamically obtain additional resources during execution (e.g., >>>>>> as part of a Comm_spawn call via an info_key). We originally implemented >>>>>> this with Slurm, but you could add the calls into the Torque component >>>>>> as well if you like. >>>>>> >>>>>> This is in the trunk now - will come over to 1.7.4 >>>>>> >>>>>> >>>>>>> I want to use MPI_Comm_spawn() to spawn new processes in the new hosts. >>>>>>> With my extended torque/maui batch system, I was able to perfectly use >>>>>>> the "add-host" info argument to MPI_Comm_spawn() to spawn new processes >>>>>>> on these hosts. Since MPI and Torque refer to the hosts through the >>>>>>> nodeids, I made sure that OpenMPI uses the correct nodeid's for these >>>>>>> new hosts. >>>>>>> Until 1.6.5, this worked perfectly fine, except that due to the >>>>>>> Intercomm_merge problem, I could not really run a real application to >>>>>>> its completion. >>>>>>> >>>>>>> While this is now fixed in the trunk, I found that, however, when using >>>>>>> the "add-host" info argument, everything collapses after printing out >>>>>>> the following error. >>>>>>> >>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation. Only >>>>>>> one event_base_loop can run on each event_base at once. >>>>>> >>>>>> I'll take a look - probably some stale code that hasn't been updated yet >>>>>> for async ORTE operations >>>>>> >>>>>>> >>>>>>> And due to this, I am still not really able to run my application! I >>>>>>> also compiled the MPI without any Torque/PBS support and just used the >>>>>>> "add-host" argument normally. Again, this worked perfectly in 1.6.5. >>>>>>> But in the 1.7 series, it works but after printing out the following >>>>>>> error. >>>>>>> >>>>>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0] >>>>>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0] >>>>>> >>>>>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we >>>>>> "illegally" re-enter libevent. The error again means we don't have >>>>>> Intercomm_create correct just yet. >>>>>> >>>>>> I'll see what I can do about this and get back to you >>>>>> >>>>>>> >>>>>>> In short, with pbs/torque support, it fails and without pbs/torque >>>>>>> support, it runs after spitting the above lines. >>>>>>> >>>>>>> I would really appreciate some help on this, since I need these >>>>>>> features to actually test my case and (at least in my short experience) >>>>>>> no other MPI implementation seem friendly to such dynamic scenarios. >>>>>>> >>>>>>> Thanks a lot! >>>>>>> >>>>>>> Best, >>>>>>> Suraj >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote: >>>>>>> >>>>>>>> Just to close my end of this loop: as of trunk r29213, it all works >>>>>>>> for me. Thanks! >>>>>>>> >>>>>>>> >>>>>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>> >>>>>>>>> Thanks George - much appreciated >>>>>>>>> >>>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> The test case was broken. I just pushed a fix. >>>>>>>>>> >>>>>>>>>> George. >>>>>>>>>> >>>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>>>> >>>>>>>>>>> Hangs with any np > 1 >>>>>>>>>>> >>>>>>>>>>> However, I'm not sure if that's an issue with the test vs the >>>>>>>>>>> underlying implementation >>>>>>>>>>> >>>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" >>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Does it hang when you run with -np 4? >>>>>>>>>>>> >>>>>>>>>>>> Sent from my phone. No type good. >>>>>>>>>>>> >>>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one >>>>>>>>>>>>> difference - I only run it with np=1 >>>>>>>>>>>>> >>>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) >>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca >>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have >>>>>>>>>>>>>>> another network enabled. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I know :-). I have tcp available as well (OMPI will abort if >>>>>>>>>>>>>> you only run with sm,self because the comm_spawn will fail with >>>>>>>>>>>>>> unreachable errors -- I just tested/proved this to myself). >>>>>>>>>>>>>> >>>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an >>>>>>>>>>>>>>> xterm based spawn and the debugging. It can't work without >>>>>>>>>>>>>>> xterm support. Instead try using the test case from the trunk, >>>>>>>>>>>>>>> the one committed by Ralph. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok. :-) I ran >>>>>>>>>>>>>> with orte/test/mpi/intercomm_create.c, and that hangs for me as >>>>>>>>>>>>>> well: >>>>>>>>>>>>>> >>>>>>>>>>>>>> ----- >>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>>> (0) >>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>>> (0) >>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>>> (0) >>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) >>>>>>>>>>>>>> (0) >>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>> ----- >>>>>>>>>>>>>> >>>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output: >>>>>>>>>>>>>> >>>>>>>>>>>>>> ----- >>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>> ----- >>>>>>>>>>>>>> >>>>>>>>>>>>>>> George. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" >>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> George -- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your >>>>>>>>>>>>>>>> attached test case hangs: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, >>>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, >>>>>>>>>>>>>>>> &inter) (0) >>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>> &inter) [rank 4] >>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>> &inter) [rank 5] >>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>> &inter) [rank 6] >>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, >>>>>>>>>>>>>>>> &inter) [rank 7] >>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On my Mac, it hangs without printing anything: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create >>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create >>>>>>>>>>>>>>>> [hang] >>>>>>>>>>>>>>>> ----- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca >>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch >>>>>>>>>>>>>>>>> that addresses the MPI_Intercomm issue at the MPI level. It >>>>>>>>>>>>>>>>> should be applied after removal of 29166. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I also added the corrected test case stressing the corner >>>>>>>>>>>>>>>>> cases by doing barriers at every inter-comm creation and >>>>>>>>>>>>>>>>> doing a clean disconnect. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Jeff Squyres >>>>>>>>>>>>>> jsquy...@cisco.com >>>>>>>>>>>>>> For corporate legal information go to: >>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Jeff Squyres >>>>>>>> jsquy...@cisco.com >>>>>>>> For corporate legal information go to: >>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel