Afraid I don't see the problem offhand - can you add the following to your cmd 
line?

-mca state_base_verbose 10 -mca errmgr_base_verbose 10

Thanks
Ralph

On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
wrote:

> Hi Ralph, 
> 
> I always got this output from any MPI job that ran on our nodes. There seems 
> to be a problem somewhere but it never stopped the applications from running. 
> But anyway, I ran it again now with only tcp and excluded the infiniband and 
> I get the same output again. Except that this time, the error related to this 
> openib is not there anymore. Printing out the log again. 
> 
> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
> [grsacc20:04578] [[6160,0],0] plm:base:receive job launch command from 
> [[6160,1],0]
> [grsacc20:04578] [[6160,0],0] plm:base:receive adding hosts
> [grsacc20:04578] [[6160,0],0] plm:base:receive calling spawn
> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
> [grsacc20:04578] [[6160,0],0] plm:base:setup_job
> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm
> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm add new daemon [[6160,0],2]
> [grsacc20:04578] [[6160,0],0] plm:base:setup_vm assigning new daemon 
> [[6160,0],2] to node grsacc18
> [grsacc20:04578] [[6160,0],0] plm:tm: launching vm
> [grsacc20:04578] [[6160,0],0] plm:tm: final top-level argv:
>       orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 
> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
> tcp,sm,self
> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc19
> [grsacc20:04578] [[6160,0],0] plm:tm: executing:
>       orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 1 
> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
> tcp,sm,self
> [grsacc20:04578] [[6160,0],0] plm:tm: launching on node grsacc18
> [grsacc20:04578] [[6160,0],0] plm:tm: executing:
>       orted -mca ess tm -mca orte_ess_jobid 403701760 -mca orte_ess_vpid 2 
> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
> "403701760.0;tcp://192.168.222.20:35163" -mca plm_base_verbose 5 -mca btl 
> tcp,sm,self
> [grsacc20:04578] [[6160,0],0] plm:tm:launch: finished spawning orteds
> [grsacc19:28821] mca:base:select:(  plm) Querying component [rsh]
> [grsacc19:28821] [[6160,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
> [grsacc19:28821] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [grsacc19:28821] mca:base:select:(  plm) Selected component [rsh]
> [grsacc19:28821] [[6160,0],1] plm:rsh_setup on agent ssh : rsh path NULL
> [grsacc19:28821] [[6160,0],1] plm:base:receive start comm
> [grsacc19:28821] [[6160,0],1] plm:base:receive stop comm
> [grsacc18:16717] mca:base:select:(  plm) Querying component [rsh]
> [grsacc18:16717] [[6160,0],2] plm:rsh_lookup on agent ssh : rsh path NULL
> [grsacc18:16717] mca:base:select:(  plm) Query of component [rsh] set 
> priority to 10
> [grsacc18:16717] mca:base:select:(  plm) Selected component [rsh]
> [grsacc18:16717] [[6160,0],2] plm:rsh_setup on agent ssh : rsh path NULL
> [grsacc18:16717] [[6160,0],2] plm:base:receive start comm
> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon 
> [[6160,0],2]
> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch from daemon 
> [[6160,0],2] on node grsacc18
> [grsacc20:04578] [[6160,0],0] plm:base:orted_report_launch completed for 
> daemon [[6160,0],2] at contact 403701760.2;tcp://192.168.222.18:44229
> [grsacc20:04578] [[6160,0],0] plm:base:launch_apps for job [6160,2]
> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
> [grsacc20:04578] [[6160,0],0] plm:base:receive update proc state command from 
> [[6160,0],2]
> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for job 
> [6160,2]
> [grsacc20:04578] [[6160,0],0] plm:base:receive got update_proc_state for vpid 
> 0 state RUNNING exit_code 0
> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
> [grsacc20:04578] [[6160,0],0] plm:base:launch wiring up iof for job [6160,2]
> [grsacc20:04578] [[6160,0],0] plm:base:receive processing msg
> [grsacc20:04578] [[6160,0],0] plm:base:receive done processing commands
> [grsacc20:04578] [[6160,0],0] plm:base:launch registered event
> [grsacc20:04578] [[6160,0],0] plm:base:launch sending dyn release of job 
> [6160,2] to [[6160,1],0]
> [grsacc20:04578] [[6160,0],0] plm:base:orted_cmd sending orted_exit commands
> [grsacc19:28815] [[6160,0],1] plm:base:receive stop comm
> [grsacc20:04578] [[6160,0],0] plm:base:receive stop comm
> -bash-4.1$ [grsacc18:16717] [[6160,0],2] plm:base:receive stop comm
> 
> Best,
> Suraj
> On Sep 24, 2013, at 3:24 PM, Ralph Castain wrote:
> 
>> Your output shows that it launched your apps, but they exited. The error is 
>> reported here, though it appears we aren't flushing the message out before 
>> exiting due to a race condition:
>> 
>>> [grsacc20:04511] 1 more process has sent help message 
>>> help-mpi-btl-openib.txt / no active ports found
>> 
>> Here is the full text:
>> [no active ports found]
>> WARNING: There is at least non-excluded one OpenFabrics device found,
>> but there are no active ports detected (or Open MPI was unable to use
>> them).  This is most certainly not what you wanted.  Check your
>> cables, subnet manager configuration, etc.  The openib BTL will be
>> ignored for this job.
>> 
>> Local host: %s
>> 
>> Looks like at least one node being used doesn't have an active Infiniband 
>> port on it?
>> 
>> 
>> On Sep 24, 2013, at 6:11 AM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
>> wrote:
>> 
>>> Hi Ralph,
>>> 
>>> I tested it with the trunk r29228. I still have the following problem. Now, 
>>> it even spawns the daemon on the new node through torque but then suddently 
>>> quits. The following is the output. Can you please have a look? 
>>> 
>>> Thanks
>>> Suraj
>>> 
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive job launch command from 
>>> [[6253,1],0]
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive adding hosts
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive calling spawn
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_job
>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm
>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm add new daemon [[6253,0],2]
>>> [grsacc20:04511] [[6253,0],0] plm:base:setup_vm assigning new daemon 
>>> [[6253,0],2] to node grsacc18
>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching vm
>>> [grsacc20:04511] [[6253,0],0] plm:tm: final top-level argv:
>>>     orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 
>>> <template> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc19
>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing:
>>>     orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 1 
>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>> [grsacc20:04511] [[6253,0],0] plm:tm: launching on node grsacc18
>>> [grsacc20:04511] [[6253,0],0] plm:tm: executing:
>>>     orted -mca ess tm -mca orte_ess_jobid 409796608 -mca orte_ess_vpid 2 
>>> -mca orte_ess_num_procs 3 -mca orte_hnp_uri 
>>> "409796608.0;tcp://192.168.222.20:53097" -mca plm_base_verbose 6
>>> [grsacc20:04511] [[6253,0],0] plm:tm:launch: finished spawning orteds
>>> [grsacc19:28754] mca:base:select:(  plm) Querying component [rsh]
>>> [grsacc19:28754] [[6253,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
>>> [grsacc19:28754] mca:base:select:(  plm) Query of component [rsh] set 
>>> priority to 10
>>> [grsacc19:28754] mca:base:select:(  plm) Selected component [rsh]
>>> [grsacc19:28754] [[6253,0],1] plm:rsh_setup on agent ssh : rsh path NULL
>>> [grsacc19:28754] [[6253,0],1] plm:base:receive start comm
>>> [grsacc19:28754] [[6253,0],1] plm:base:receive stop comm
>>> [grsacc18:16648] mca:base:select:(  plm) Querying component [rsh]
>>> [grsacc18:16648] [[6253,0],2] plm:rsh_lookup on agent ssh : rsh path NULL
>>> [grsacc18:16648] mca:base:select:(  plm) Query of component [rsh] set 
>>> priority to 10
>>> [grsacc18:16648] mca:base:select:(  plm) Selected component [rsh]
>>> [grsacc18:16648] [[6253,0],2] plm:rsh_setup on agent ssh : rsh path NULL
>>> [grsacc18:16648] [[6253,0],2] plm:base:receive start comm
>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon 
>>> [[6253,0],2]
>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch from daemon 
>>> [[6253,0],2] on node grsacc18
>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_report_launch completed for 
>>> daemon [[6253,0],2] at contact 409796608.2;tcp://192.168.222.18:47974
>>> [grsacc20:04511] [[6253,0],0] plm:base:launch_apps for job [6253,2]
>>> [grsacc20:04511] 1 more process has sent help message 
>>> help-mpi-btl-openib.txt / no active ports found
>>> [grsacc20:04511] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>>> all help / error messages
>>> [grsacc20:04511] 1 more process has sent help message help-mpi-btl-base.txt 
>>> / btl:no-nics
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive update proc state command 
>>> from [[6253,0],2]
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for 
>>> job [6253,2]
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive got update_proc_state for 
>>> vpid 0 state RUNNING exit_code 0
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>> [grsacc20:04511] [[6253,0],0] plm:base:launch wiring up iof for job [6253,2]
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive processing msg
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive done processing commands
>>> [grsacc20:04511] [[6253,0],0] plm:base:launch registered event
>>> [grsacc20:04511] [[6253,0],0] plm:base:launch sending dyn release of job 
>>> [6253,2] to [[6253,1],0]
>>> [grsacc20:04511] [[6253,0],0] plm:base:orted_cmd sending orted_exit commands
>>> [grsacc19:28747] [[6253,0],1] plm:base:receive stop comm
>>> [grsacc20:04511] [[6253,0],0] plm:base:receive stop comm
>>> -bash-4.1$ [grsacc18:16648] [[6253,0],2] plm:base:receive stop comm
>>> 
>>> 
>>> 
>>> 
>>> On Sep 23, 2013, at 1:55 AM, Ralph Castain wrote:
>>> 
>>>> Found a bug in the Torque support - we were trying to connect to the MOM 
>>>> again, which would hang (I imagine). I pushed a fix to the trunk (r29227) 
>>>> and scheduled it to come to 1.7.3 if you want to try it again.
>>>> 
>>>> 
>>>> On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran 
>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>> 
>>>>> Dear Ralph,
>>>>> 
>>>>> This is the output I get when I execute with the verbose option.
>>>>> 
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from 
>>>>> [[23526,1],0]
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive done processing commands
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_job
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon 
>>>>> [[23526,0],2]
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon 
>>>>> [[23526,0],2] to node grsacc17/1-4
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon 
>>>>> [[23526,0],3]
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon 
>>>>> [[23526,0],3] to node grsacc17/0-5
>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: launching vm
>>>>> [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv:
>>>>>   orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca orte_ess_vpid 
>>>>> <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri 
>>>>> "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5
>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only one 
>>>>> event_base_loop can run on each event_base at once.
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit 
>>>>> commands
>>>>> [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm
>>>>> 
>>>>> Says something?
>>>>> 
>>>>> Best,
>>>>> Suraj
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote:
>>>>> 
>>>>>> I'll still need to look at the intercomm_create issue, but I just tested 
>>>>>> both the trunk and current 1.7.3 branch for "add-host" and both worked 
>>>>>> just fine. This was on my little test cluster which only has rsh 
>>>>>> available - no Torque.
>>>>>> 
>>>>>> You might add "-mca plm_base_verbose 5" to your cmd line to get some 
>>>>>> debug output as to the problem.
>>>>>> 
>>>>>> 
>>>>>> On Sep 21, 2013, at 5:48 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran 
>>>>>>> <suraj.prabhaka...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Dear all,
>>>>>>>> 
>>>>>>>> Really thanks a lot for your efforts. I too downloaded the trunk to 
>>>>>>>> check if it works for my case and as of revision 29215, it works for 
>>>>>>>> the original case I reported. Although it works, I still see the 
>>>>>>>> following in the output. Does it mean anything?
>>>>>>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0]
>>>>>>> 
>>>>>>> Yes - it means we don't quite have this right yet :-(
>>>>>>> 
>>>>>>>> 
>>>>>>>> However, on another topic relevant to my use case, I have another 
>>>>>>>> problem to report. I am having problems using the "add-host" info to 
>>>>>>>> the MPI_Comm_spawn() when MPI is compiled with support for Torque 
>>>>>>>> resource manager. This problem is totally new in the 1.7 series and it 
>>>>>>>> worked perfectly until 1.6.5 
>>>>>>>> 
>>>>>>>> Basically, I am working on implementing dynamic resource management 
>>>>>>>> facilities in the Torque/Maui batch system. Through a new tm call, an 
>>>>>>>> application can get new resources for a job.
>>>>>>> 
>>>>>>> FWIW: you'll find that we added an API to the orte RAS framework to 
>>>>>>> support precisely that operation. It allows an application to request 
>>>>>>> that we dynamically obtain additional resources during execution (e.g., 
>>>>>>> as part of a Comm_spawn call via an info_key). We originally 
>>>>>>> implemented this with Slurm, but you could add the calls into the 
>>>>>>> Torque component as well if you like.
>>>>>>> 
>>>>>>> This is in the trunk now - will come over to 1.7.4
>>>>>>> 
>>>>>>> 
>>>>>>>> I want to use MPI_Comm_spawn() to spawn new processes in the new 
>>>>>>>> hosts. With my extended torque/maui batch system, I was able to 
>>>>>>>> perfectly use the "add-host" info argument to MPI_Comm_spawn() to 
>>>>>>>> spawn new processes on these hosts. Since MPI and Torque refer to the 
>>>>>>>> hosts through the nodeids, I made sure that OpenMPI uses the correct 
>>>>>>>> nodeid's for these new hosts. 
>>>>>>>> Until 1.6.5, this worked perfectly fine, except that due to the 
>>>>>>>> Intercomm_merge problem, I could not really run a real application to 
>>>>>>>> its completion.
>>>>>>>> 
>>>>>>>> While this is now fixed in the trunk, I found that, however, when 
>>>>>>>> using the "add-host" info argument, everything collapses after 
>>>>>>>> printing out the following error. 
>>>>>>>> 
>>>>>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only 
>>>>>>>> one event_base_loop can run on each event_base at once.
>>>>>>> 
>>>>>>> I'll take a look - probably some stale code that hasn't been updated 
>>>>>>> yet for async ORTE operations
>>>>>>> 
>>>>>>>> 
>>>>>>>> And due to this, I am still not really able to run my application! I 
>>>>>>>> also compiled the MPI without any Torque/PBS support and just used the 
>>>>>>>> "add-host" argument normally. Again, this worked perfectly in 1.6.5. 
>>>>>>>> But in the 1.7 series, it works but after printing out the following 
>>>>>>>> error.
>>>>>>>> 
>>>>>>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>>>>>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>>>>  [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>>>>>> 
>>>>>>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we 
>>>>>>> "illegally" re-enter libevent. The error again means we don't have 
>>>>>>> Intercomm_create correct just yet.
>>>>>>> 
>>>>>>> I'll see what I can do about this and get back to you
>>>>>>> 
>>>>>>>> 
>>>>>>>> In short, with pbs/torque support, it fails and without pbs/torque 
>>>>>>>> support, it runs after spitting the above lines. 
>>>>>>>> 
>>>>>>>> I would really appreciate some help on this, since I need these 
>>>>>>>> features to actually test my case and (at least in my short 
>>>>>>>> experience) no other MPI implementation seem friendly to such dynamic 
>>>>>>>> scenarios. 
>>>>>>>> 
>>>>>>>> Thanks a lot!
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Suraj
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote:
>>>>>>>> 
>>>>>>>>> Just to close my end of this loop: as of trunk r29213, it all works 
>>>>>>>>> for me.  Thanks!
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>>> 
>>>>>>>>>> Thanks George - much appreciated
>>>>>>>>>> 
>>>>>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> The test case was broken. I just pushed a fix.
>>>>>>>>>>> 
>>>>>>>>>>> George.
>>>>>>>>>>> 
>>>>>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hangs with any np > 1
>>>>>>>>>>>> 
>>>>>>>>>>>> However, I'm not sure if that's an issue with the test vs the 
>>>>>>>>>>>> underlying implementation
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" 
>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Does it hang when you run with -np 4?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sent from my phone. No type good. 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one 
>>>>>>>>>>>>>> difference - I only run it with np=1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) 
>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca 
>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have 
>>>>>>>>>>>>>>>> another network enabled.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I know :-).  I have tcp available as well (OMPI will abort if 
>>>>>>>>>>>>>>> you only run with sm,self because the comm_spawn will fail with 
>>>>>>>>>>>>>>> unreachable errors -- I just tested/proved this to myself).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an 
>>>>>>>>>>>>>>>> xterm based spawn and the debugging. It can't work without 
>>>>>>>>>>>>>>>> xterm support. Instead try using the test case from the trunk, 
>>>>>>>>>>>>>>>> the one committed by Ralph.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok.  :-)  I ran 
>>>>>>>>>>>>>>> with orte/test/mpi/intercomm_create.c, and that hangs for me as 
>>>>>>>>>>>>>>> well:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>>>> (0)
>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>>>> (0)
>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>>>> (0)
>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>>>> (0)
>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Similarly, on my Mac, it hangs with no output:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> George.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" 
>>>>>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> George --
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your 
>>>>>>>>>>>>>>>>> attached test case hangs:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, 
>>>>>>>>>>>>>>>>> &inter) (0)
>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On my Mac, it hangs without printing anything:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create   
>>>>>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>>>>>> -----
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca 
>>>>>>>>>>>>>>>>> <bosi...@icl.utk.edu> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch 
>>>>>>>>>>>>>>>>>> that addresses the MPI_Intercomm issue at the MPI level. It 
>>>>>>>>>>>>>>>>>> should be applied after removal of 29166.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I also added the corrected test case stressing the corner 
>>>>>>>>>>>>>>>>>> cases by doing barriers at every inter-comm creation and 
>>>>>>>>>>>>>>>>>> doing a clean disconnect.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Jeff Squyres
>>>>>>>>> jsquy...@cisco.com
>>>>>>>>> For corporate legal information go to: 
>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to