Hi Ralph, 

I always got this output from any MPI job that ran on our nodes. There seems to 
be a problem somewhere but it never stopped the applications from running. But 
anyway, I ran it again now with only tcp and excluded the infiniband and I get 
the same output again. Except that this time, the error related to this openib 
is not there anymore. Printing out the log again. 

[grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
[grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from 
[[6160,1],0]
[grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts
[grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn
[grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
[grsacc20:04578] [[6160,0],0] plm:base:setup_job
[grsacc20:04578] [[6160,0],0] plm:base:setup_vm
[grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2]
[grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon 
[[6160,0],2] to node grsacc18
[grsacc20:04578] [[6160,0],0] plm:tm: launching vm
[grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv:
        orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 
<template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
"403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
tcp,sm,self
[grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19
[grsacc20:04578] [[6160,0],0] plm:tm: executing:
        orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 
-mca orte_ess_num_procs 3 -mca orte_hnp_uri 
"403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
tcp,sm,self
[grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18
[grsacc20:04578] [[6160,0],0] plm:tm: executing:
        orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 
-mca orte_ess_num_procs 3 -mca orte_hnp_uri 
"403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
tcp,sm,self
[grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds
[grsacc19:28821] mca:base:select:(  plm) Querying component [rsh]
[grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[grsacc19:28821] mca:base:select:(  plm) Query of component [rsh] set priority 
to 10
[grsacc19:28821] mca:base:select:(  plm) Selected component [rsh]
[grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[grsacc19:28821] [[6160,0],1] plm:base:receive start comm
[grsacc19:28821] [[6160,0],1] plm:base:receive stop comm
[grsacc18:16717] mca:base:select:(  plm) Querying component [rsh]
[grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL
[grsacc18:16717] mca:base:select:(  plm) Query of component [rsh] set priority 
to 10
[grsacc18:16717] mca:base:select:(  plm) Selected component [rsh]
[grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL
[grsacc18:16717] [[6160,0],2] plm:base:receive start comm
[grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon 
[[6160,0],2]
[grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon 
[[6160,0],2] on node grsacc18
[grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for daemon 
[[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229
[grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2]
[grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
[grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command from 
[[6160,0],2]
[grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for job 
[6160,2]
[grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for vpid 0 
state RUNNING exit_code 0
[grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
[grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job [6160,2]
[grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
[grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
[grsacc20:04578] [[6160,0],0] plm:base:launch registered event
[grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job 
[6160,2] to [[6160,1],0]
[grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit commands
[grsacc19:28815] [[6160,0],1] plm:base:receive stop comm
[grsacc20:04578] [[6160,0],0] plm:base:receive stop comm
-bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm

Best,
Suraj
On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote:

> Your output shows that it launched your apps, but they exited. The error is 
> reported here, though it appears we aren't flushing the message out before 
> exiting due to a race condition:
> 
>> [grsacc20:04511] 1 more process has sent help message 
>> help-mpi-btl-openib.txt / no active ports found
> 
> Here is the full text:
> [no active ports found]
> WARNING: There is at least non-excluded one OpenFabrics device found,
> but there are no active ports detected (or Open MPI was unable to use
> them).  This is most certainly not what you wanted.  Check your
> cables, subnet manager configuration, etc.  The openib BTL will be
> ignored for this job.
> 
>  Local host: %s
> 
> Looks like at least one node being used doesn't have an active Infiniband 
> port on it?
> 
> 
> On Sep 24, 2013, at 6:11 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
> wrote:
> 
>> Hi Ralph,
>> 
>> I tested it with the trunk r29228. I still have the following problem. Now, 
>> it even spawns the daemon on the new node through torque but then suddently 
>> quits. The following is the output. Can you please have a look? 
>> 
>> Thanks
>> Suraj
>> 
>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>> [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from 
>> [[6253,1],0]
>> [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts
>> [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn
>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>> [grsacc20:04511] [[6253,0],0] plm:base:setup_job
>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm
>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon [[6253,0],2]
>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon 
>> [[6253,0],2] to node grsacc18
>> [grsacc20:04511] [[6253,0],0] plm:tm: launching vm
>> [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv:
>>      orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 
>> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19
>> [grsacc20:04511] [[6253,0],0] plm:tm: executing:
>>      orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 1 
>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18
>> [grsacc20:04511] [[6253,0],0] plm:tm: executing:
>>      orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 2 
>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>> [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds
>> [grsacc19:28754] mca:base:select:(  plm) Querying component [rsh]
>> [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
>> [grsacc19:28754] mca:base:select:(  plm) Query of component [rsh] set 
>> priority to 10
>> [grsacc19:28754] mca:base:select:(  plm) Selected component [rsh]
>> [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL
>> [grsacc19:28754] [[6253,0],1] plm:base:receive start comm
>> [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm
>> [grsacc18:16648] mca:base:select:(  plm) Querying component [rsh]
>> [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path NULL
>> [grsacc18:16648] mca:base:select:(  plm) Query of component [rsh] set 
>> priority to 10
>> [grsacc18:16648] mca:base:select:(  plm) Selected component [rsh]
>> [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL
>> [grsacc18:16648] [[6253,0],2] plm:base:receive start comm
>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon 
>> [[6253,0],2]
>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon 
>> [[6253,0],2] on node grsacc18
>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed for 
>> daemon [[6253,0],2] at contact 409796608.2;tcp://192.168.222.18:47974
>> [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2]
>> [grsacc20:04511] 1 more process has sent help message 
>> help-mpi-btl-openib.txt / no active ports found
>> [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>> all help / error messages
>> [grsacc20:04511] 1 more process has sent help message help-mpi-btl-base.txt 
>> / btl:no-nics
>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>> [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state command 
>> from [[6253,0],2]
>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for job 
>> [6253,2]
>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for 
>> vpid 0 state RUNNING exit_code 0
>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>> [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job [6253,2]
>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>> [grsacc20:04511] [[6253,0],0] plm:base:launch registered event
>> [grsacc20:04511] [[6253,0],0] plm:base:launch sending dyn release of job 
>> [6253,2] to [[6253,1],0]
>> [grsacc20:04511] [[6253,0],0] plm:base:orted_cmd sending orted_exit commands
>> [grsacc19:28747] [[6253,0],1] plm:base:receive stop comm
>> [grsacc20:04511] [[6253,0],0] plm:base:receive stop comm
>> -bash-4.1$ [grsacc18:16648] [[6253,0],2] plm:base:receive stop comm
>> 
>> 
>> 
>> 
>> On Sep 23, 2013, at 1:55 AM, Ralph Castain wrote:
>> 
>>> Found a bug in the Torque support - we were trying to connect to the MOM 
>>> again, which would hang (I imagine). I pushed a fix to the trunk (r29227) 
>>> and scheduled it to come to 1.7.3 if you want to try it again.
>>> 
>>> 
>>> On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran 
>>> <suraj.prabhaka...@gmail.com> wrote:
>>> 
>>>> Dear Ralph,
>>>> 
>>>> This is the output I get when I execute with the verbose option.
>>>> 
>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg
>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from 
>>>> [[23526,1],0]
>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts
>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn
>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive done processing commands
>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_job
>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm
>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon 
>>>> [[23526,0],2]
>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon 
>>>> [[23526,0],2] to node grsacc17/1-4
>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon 
>>>> [[23526,0],3]
>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon 
>>>> [[23526,0],3] to node grsacc17/0-5
>>>> [grsacc20:21012] [[23526,0],0] plm:tm: launching vm
>>>> [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv:
>>>>    orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca orte_ess_vpid 
>>>> <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri 
>>>> "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5
>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only one 
>>>> event_base_loop can run on each event_base at once.
>>>> [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit 
>>>> commands
>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm
>>>> 
>>>> Says something?
>>>> 
>>>> Best,
>>>> Suraj
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote:
>>>> 
>>>>> I'll still need to look at the intercomm_create issue, but I just tested 
>>>>> both the trunk and current 1.7.3 branch for "add-host" and both worked 
>>>>> just fine. This was on my little test cluster which only has rsh 
>>>>> available - no Torque.
>>>>> 
>>>>> You might add "-mca plm_base_verbose 5" to your cmd line to get some 
>>>>> debug output as to the problem.
>>>>> 
>>>>> 
>>>>> On Sep 21, 2013, at 5:48 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> 
>>>>>> 
>>>>>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran 
>>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>>> 
>>>>>>> Dear all,
>>>>>>> 
>>>>>>> Really thanks a lot for your efforts. I too downloaded the trunk to 
>>>>>>> check if it works for my case and as of revision 29215, it works for 
>>>>>>> the original case I reported. Although it works, I still see the 
>>>>>>> following in the output. Does it mean anything?
>>>>>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0]
>>>>>> 
>>>>>> Yes - it means we don't quite have this right yet :-(
>>>>>> 
>>>>>>> 
>>>>>>> However, on another topic relevant to my use case, I have another 
>>>>>>> problem to report. I am having problems using the "add-host" info to 
>>>>>>> the MPI_Comm_spawn() when MPI is compiled with support for Torque 
>>>>>>> resource manager. This problem is totally new in the 1.7 series and it 
>>>>>>> worked perfectly until 1.6.5 
>>>>>>> 
>>>>>>> Basically, I am working on implementing dynamic resource management 
>>>>>>> facilities in the Torque/Maui batch system. Through a new tm call, an 
>>>>>>> application can get new resources for a job.
>>>>>> 
>>>>>> FWIW: you'll find that we added an API to the orte RAS framework to 
>>>>>> support precisely that operation. It allows an application to request 
>>>>>> that we dynamically obtain additional resources during execution (e.g., 
>>>>>> as part of a Comm_spawn call via an info_key). We originally implemented 
>>>>>> this with Slurm, but you could add the calls into the Torque component 
>>>>>> as well if you like.
>>>>>> 
>>>>>> This is in the trunk now - will come over to 1.7.4
>>>>>> 
>>>>>> 
>>>>>>> I want to use MPI_Comm_spawn() to spawn new processes in the new hosts. 
>>>>>>> With my extended torque/maui batch system, I was able to perfectly use 
>>>>>>> the "add-host" info argument to MPI_Comm_spawn() to spawn new processes 
>>>>>>> on these hosts. Since MPI and Torque refer to the hosts through the 
>>>>>>> nodeids, I made sure that OpenMPI uses the correct nodeid's for these 
>>>>>>> new hosts. 
>>>>>>> Until 1.6.5, this worked perfectly fine, except that due to the 
>>>>>>> Intercomm_merge problem, I could not really run a real application to 
>>>>>>> its completion.
>>>>>>> 
>>>>>>> While this is now fixed in the trunk, I found that, however, when using 
>>>>>>> the "add-host" info argument, everything collapses after printing out 
>>>>>>> the following error. 
>>>>>>> 
>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only 
>>>>>>> one event_base_loop can run on each event_base at once.
>>>>>> 
>>>>>> I'll take a look - probably some stale code that hasn't been updated yet 
>>>>>> for async ORTE operations
>>>>>> 
>>>>>>> 
>>>>>>> And due to this, I am still not really able to run my application! I 
>>>>>>> also compiled the MPI without any Torque/PBS support and just used the 
>>>>>>> "add-host" argument normally. Again, this worked perfectly in 1.6.5. 
>>>>>>> But in the 1.7 series, it works but after printing out the following 
>>>>>>> error.
>>>>>>> 
>>>>>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>>>>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>>>>> 
>>>>>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we 
>>>>>> "illegally" re-enter libevent. The error again means we don't have 
>>>>>> Intercomm_create correct just yet.
>>>>>> 
>>>>>> I'll see what I can do about this and get back to you
>>>>>> 
>>>>>>> 
>>>>>>> In short, with pbs/torque support, it fails and without pbs/torque 
>>>>>>> support, it runs after spitting the above lines. 
>>>>>>> 
>>>>>>> I would really appreciate some help on this, since I need these 
>>>>>>> features to actually test my case and (at least in my short experience) 
>>>>>>> no other MPI implementation seem friendly to such dynamic scenarios. 
>>>>>>> 
>>>>>>> Thanks a lot!
>>>>>>> 
>>>>>>> Best,
>>>>>>> Suraj
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote:
>>>>>>> 
>>>>>>>> Just to close my end of this loop: as of trunk r29213, it all works 
>>>>>>>> for me.  Thanks!
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>> 
>>>>>>>>> Thanks George - much appreciated
>>>>>>>>> 
>>>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> The test case was broken. I just pushed a fix.
>>>>>>>>>> 
>>>>>>>>>> George.
>>>>>>>>>> 
>>>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hangs with any np > 1
>>>>>>>>>>> 
>>>>>>>>>>> However, I'm not sure if that's an issue with the test vs the 
>>>>>>>>>>> underlying implementation
>>>>>>>>>>> 
>>>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" 
>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Does it hang when you run with -np 4?
>>>>>>>>>>>> 
>>>>>>>>>>>> Sent from my phone. No type good. 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one 
>>>>>>>>>>>>> difference - I only run it with np=1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) 
>>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca 
>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have 
>>>>>>>>>>>>>>> another network enabled.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I know :-).  I have tcp available as well (OMPI will abort if 
>>>>>>>>>>>>>> you only run with sm,self because the comm_spawn will fail with 
>>>>>>>>>>>>>> unreachable errors -- I just tested/proved this to myself).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an 
>>>>>>>>>>>>>>> xterm based spawn and the debugging. It can't work without 
>>>>>>>>>>>>>>> xterm support. Instead try using the test case from the trunk, 
>>>>>>>>>>>>>>> the one committed by Ralph.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok.  :-)  I ran 
>>>>>>>>>>>>>> with orte/test/mpi/intercomm_create.c, and that hangs for me as 
>>>>>>>>>>>>>> well:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>>> (0)
>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>>> (0)
>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>>> (0)
>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>>> (0)
>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> George.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" 
>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> George --
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your 
>>>>>>>>>>>>>>>> attached test case hangs:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On my Mac, it hangs without printing anything:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create   
>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca 
>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch 
>>>>>>>>>>>>>>>>> that addresses the MPI_Intercomm issue at the MPI level. It 
>>>>>>>>>>>>>>>>> should be applied after removal of 29166.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> I also added the corrected test case stressing the corner 
>>>>>>>>>>>>>>>>> cases by doing barriers at every inter-comm creation and 
>>>>>>>>>>>>>>>>> doing a clean disconnect.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Jeff Squyres
>>>>>>>> jsquy...@cisco.com
>>>>>>>> For corporate legal information go to: 
>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to