Dear all,

Really thanks a lot for your efforts. I too downloaded the trunk to check if it 
works for my case and as of revision 29215, it works for the original case I 
reported. Although it works, I still see the following in the output. Does it 
mean anything?
[grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
[btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0]

However, on another topic relevant to my use case, I have another problem to 
report. I am having problems using the "add-host" info to the MPI_Comm_spawn() 
when MPI is compiled with support for Torque resource manager. This problem is 
totally new in the 1.7 series and it worked perfectly until 1.6.5 

Basically, I am working on implementing dynamic resource management facilities 
in the Torque/Maui batch system. Through a new tm call, an application can get 
new resources for a job. I want to use MPI_Comm_spawn() to spawn new processes 
in the new hosts. With my extended torque/maui batch system, I was able to 
perfectly use the "add-host" info argument to MPI_Comm_spawn() to spawn new 
processes on these hosts. Since MPI and Torque refer to the hosts through the 
nodeids, I made sure that OpenMPI uses the correct nodeid's for these new 
hosts. 
Until 1.6.5, this worked perfectly fine, except that due to the Intercomm_merge 
problem, I could not really run a real application to its completion.

While this is now fixed in the trunk, I found that, however, when using the 
"add-host" info argument, everything collapses after printing out the following 
error. 

[warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only one 
event_base_loop can run on each event_base at once.

And due to this, I am still not really able to run my application! I also 
compiled the MPI without any Torque/PBS support and just used the "add-host" 
argument normally. Again, this worked perfectly in 1.6.5. But in the 1.7 
series, it works but after printing out the following error.

[grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
[btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
[grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
[btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]

In short, with pbs/torque support, it fails and without pbs/torque support, it 
runs after spitting the above lines. 

I would really appreciate some help on this, since I need these features to 
actually test my case and (at least in my short experience) no other MPI 
implementation seem friendly to such dynamic scenarios. 

Thanks a lot!

Best,
Suraj



On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote:

> Just to close my end of this loop: as of trunk r29213, it all works for me.  
> Thanks!
> 
> 
> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
>> Thanks George - much appreciated
>> 
>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>> 
>>> The test case was broken. I just pushed a fix.
>>> 
>>> George.
>>> 
>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote:
>>> 
>>>> Hangs with any np > 1
>>>> 
>>>> However, I'm not sure if that's an issue with the test vs the underlying 
>>>> implementation
>>>> 
>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" 
>>>> <jsquy...@cisco.com> wrote:
>>>> 
>>>>> Does it hang when you run with -np 4?
>>>>> 
>>>>> Sent from my phone. No type good. 
>>>>> 
>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> wrote:
>>>>> 
>>>>>> Strange - it works fine for me on my Mac. However, I see one difference 
>>>>>> - I only run it with np=1
>>>>>> 
>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) 
>>>>>> <jsquy...@cisco.com> wrote:
>>>>>> 
>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>>>>> 
>>>>>>>> 1. sm doesn't work between spawned processes. So you must have another 
>>>>>>>> network enabled.
>>>>>>> 
>>>>>>> I know :-).  I have tcp available as well (OMPI will abort if you only 
>>>>>>> run with sm,self because the comm_spawn will fail with unreachable 
>>>>>>> errors -- I just tested/proved this to myself).
>>>>>>> 
>>>>>>>> 2. Don't use the test case attached to my email, I left an xterm based 
>>>>>>>> spawn and the debugging. It can't work without xterm support. Instead 
>>>>>>>> try using the test case from the trunk, the one committed by Ralph.
>>>>>>> 
>>>>>>> I didn't see any "xterm" strings in there, but ok.  :-)  I ran with 
>>>>>>> orte/test/mpi/intercomm_create.c, and that hangs for me as well:
>>>>>>> 
>>>>>>> -----
>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>> [rank 4]
>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>> [rank 5]
>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>> [rank 6]
>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>> [rank 7]
>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>> [rank 4]
>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>> [rank 5]
>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>> [rank 6]
>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>> [rank 7]
>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>> [hang]
>>>>>>> -----
>>>>>>> 
>>>>>>> Similarly, on my Mac, it hangs with no output:
>>>>>>> 
>>>>>>> -----
>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>> [hang]
>>>>>>> -----
>>>>>>> 
>>>>>>>> George.
>>>>>>>> 
>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" 
>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>> 
>>>>>>>>> George --
>>>>>>>>> 
>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your attached 
>>>>>>>>> test case hangs:
>>>>>>>>> 
>>>>>>>>> -----
>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>>>> [rank 4]
>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>>>> [rank 5]
>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>>>> [rank 6]
>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, &inter) 
>>>>>>>>> [rank 7]
>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>> [rank 4]
>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>> [rank 5]
>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>> [rank 6]
>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>> [rank 7]
>>>>>>>>> [hang]
>>>>>>>>> -----
>>>>>>>>> 
>>>>>>>>> On my Mac, it hangs without printing anything:
>>>>>>>>> 
>>>>>>>>> -----
>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create   
>>>>>>>>> [hang]
>>>>>>>>> -----
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca <bosi...@icl.utk.edu> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch that 
>>>>>>>>>> addresses the MPI_Intercomm issue at the MPI level. It should be 
>>>>>>>>>> applied after removal of 29166.
>>>>>>>>>> 
>>>>>>>>>> I also added the corrected test case stressing the corner cases by 
>>>>>>>>>> doing barriers at every inter-comm creation and doing a clean 
>>>>>>>>>> disconnect.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- 
>>>>>>>>> Jeff Squyres
>>>>>>>>> jsquy...@cisco.com
>>>>>>>>> For corporate legal information go to: 
>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> Jeff Squyres
>>>>>>> jsquy...@cisco.com
>>>>>>> For corporate legal information go to: 
>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to