Found a bug in the Torque support - we were trying to connect to the MOM again, 
which would hang (I imagine). I pushed a fix to the trunk (r29227) and 
scheduled it to come to 1.7.3 if you want to try it again.


On Sep 22, 2013, at 4:21 PM, Suraj Prabhakaran <suraj.prabhaka...@gmail.com> 
wrote:

> Dear Ralph,
> 
> This is the output I get when I execute with the verbose option.
> 
> [grsacc20:21012] [[23526,0],0] plm:base:receive processing msg
> [grsacc20:21012] [[23526,0],0] plm:base:receive job launch command from 
> [[23526,1],0]
> [grsacc20:21012] [[23526,0],0] plm:base:receive adding hosts
> [grsacc20:21012] [[23526,0],0] plm:base:receive calling spawn
> [grsacc20:21012] [[23526,0],0] plm:base:receive done processing commands
> [grsacc20:21012] [[23526,0],0] plm:base:setup_job
> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm
> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon [[23526,0],2]
> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon 
> [[23526,0],2] to node grsacc17/1-4
> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm add new daemon [[23526,0],3]
> [grsacc20:21012] [[23526,0],0] plm:base:setup_vm assigning new daemon 
> [[23526,0],3] to node grsacc17/0-5
> [grsacc20:21012] [[23526,0],0] plm:tm: launching vm
> [grsacc20:21012] [[23526,0],0] plm:tm: final top-level argv:
>       orted -mca ess tm -mca orte_ess_jobid 1541799936 -mca orte_ess_vpid 
> <template> -mca orte_ess_num_procs 4 -mca orte_hnp_uri 
> "1541799936.0;tcp://192.168.222.20:49049" -mca plm_base_verbose 5
> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only one 
> event_base_loop can run on each event_base at once.
> [grsacc20:21012] [[23526,0],0] plm:base:orted_cmd sending orted_exit commands
> [grsacc20:21012] [[23526,0],0] plm:base:receive stop comm
> 
> Says something?
> 
> Best,
> Suraj
> 
> 
> 
> 
> On Sep 22, 2013, at 9:45 PM, Ralph Castain wrote:
> 
>> I'll still need to look at the intercomm_create issue, but I just tested 
>> both the trunk and current 1.7.3 branch for "add-host" and both worked just 
>> fine. This was on my little test cluster which only has rsh available - no 
>> Torque.
>> 
>> You might add "-mca plm_base_verbose 5" to your cmd line to get some debug 
>> output as to the problem.
>> 
>> 
>> On Sep 21, 2013, at 5:48 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>>> 
>>> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran 
>>> <suraj.prabhaka...@gmail.com> wrote:
>>> 
>>>> Dear all,
>>>> 
>>>> Really thanks a lot for your efforts. I too downloaded the trunk to check 
>>>> if it works for my case and as of revision 29215, it works for the 
>>>> original case I reported. Although it works, I still see the following in 
>>>> the output. Does it mean anything?
>>>> [grsacc17][[13611,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13611,2],0]
>>> 
>>> Yes - it means we don't quite have this right yet :-(
>>> 
>>>> 
>>>> However, on another topic relevant to my use case, I have another problem 
>>>> to report. I am having problems using the "add-host" info to the 
>>>> MPI_Comm_spawn() when MPI is compiled with support for Torque resource 
>>>> manager. This problem is totally new in the 1.7 series and it worked 
>>>> perfectly until 1.6.5 
>>>> 
>>>> Basically, I am working on implementing dynamic resource management 
>>>> facilities in the Torque/Maui batch system. Through a new tm call, an 
>>>> application can get new resources for a job.
>>> 
>>> FWIW: you'll find that we added an API to the orte RAS framework to support 
>>> precisely that operation. It allows an application to request that we 
>>> dynamically obtain additional resources during execution (e.g., as part of 
>>> a Comm_spawn call via an info_key). We originally implemented this with 
>>> Slurm, but you could add the calls into the Torque component as well if you 
>>> like.
>>> 
>>> This is in the trunk now - will come over to 1.7.4
>>> 
>>> 
>>>> I want to use MPI_Comm_spawn() to spawn new processes in the new hosts. 
>>>> With my extended torque/maui batch system, I was able to perfectly use the 
>>>> "add-host" info argument to MPI_Comm_spawn() to spawn new processes on 
>>>> these hosts. Since MPI and Torque refer to the hosts through the nodeids, 
>>>> I made sure that OpenMPI uses the correct nodeid's for these new hosts. 
>>>> Until 1.6.5, this worked perfectly fine, except that due to the 
>>>> Intercomm_merge problem, I could not really run a real application to its 
>>>> completion.
>>>> 
>>>> While this is now fixed in the trunk, I found that, however, when using 
>>>> the "add-host" info argument, everything collapses after printing out the 
>>>> following error. 
>>>> 
>>>> [warn] opal_libevent2021_event_base_loop: reentrant invocation.  Only one 
>>>> event_base_loop can run on each event_base at once.
>>> 
>>> I'll take a look - probably some stale code that hasn't been updated yet 
>>> for async ORTE operations
>>> 
>>>> 
>>>> And due to this, I am still not really able to run my application! I also 
>>>> compiled the MPI without any Torque/PBS support and just used the 
>>>> "add-host" argument normally. Again, this worked perfectly in 1.6.5. But 
>>>> in the 1.7 series, it works but after printing out the following error.
>>>> 
>>>> [grsacc17][[13731,1],0][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>>> [grsacc17][[13731,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[13731,2],0]
>>> 
>>> Yeah, the 1.7 series doesn't have the reentrant test in it - so we 
>>> "illegally" re-enter libevent. The error again means we don't have 
>>> Intercomm_create correct just yet.
>>> 
>>> I'll see what I can do about this and get back to you
>>> 
>>>> 
>>>> In short, with pbs/torque support, it fails and without pbs/torque 
>>>> support, it runs after spitting the above lines. 
>>>> 
>>>> I would really appreciate some help on this, since I need these features 
>>>> to actually test my case and (at least in my short experience) no other 
>>>> MPI implementation seem friendly to such dynamic scenarios. 
>>>> 
>>>> Thanks a lot!
>>>> 
>>>> Best,
>>>> Suraj
>>>> 
>>>> 
>>>> 
>>>> On Sep 20, 2013, at 4:58 PM, Jeff Squyres (jsquyres) wrote:
>>>> 
>>>>> Just to close my end of this loop: as of trunk r29213, it all works for 
>>>>> me.  Thanks!
>>>>> 
>>>>> 
>>>>> On Sep 18, 2013, at 12:52 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>> 
>>>>>> Thanks George - much appreciated
>>>>>> 
>>>>>> On Sep 18, 2013, at 9:49 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>>>>> 
>>>>>>> The test case was broken. I just pushed a fix.
>>>>>>> 
>>>>>>> George.
>>>>>>> 
>>>>>>> On Sep 18, 2013, at 16:49 , Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>> 
>>>>>>>> Hangs with any np > 1
>>>>>>>> 
>>>>>>>> However, I'm not sure if that's an issue with the test vs the 
>>>>>>>> underlying implementation
>>>>>>>> 
>>>>>>>> On Sep 18, 2013, at 7:40 AM, "Jeff Squyres (jsquyres)" 
>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>> 
>>>>>>>>> Does it hang when you run with -np 4?
>>>>>>>>> 
>>>>>>>>> Sent from my phone. No type good. 
>>>>>>>>> 
>>>>>>>>> On Sep 18, 2013, at 4:10 PM, "Ralph Castain" <r...@open-mpi.org> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Strange - it works fine for me on my Mac. However, I see one 
>>>>>>>>>> difference - I only run it with np=1
>>>>>>>>>> 
>>>>>>>>>> On Sep 18, 2013, at 2:22 AM, Jeff Squyres (jsquyres) 
>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> On Sep 18, 2013, at 9:33 AM, George Bosilca <bosi...@icl.utk.edu> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 1. sm doesn't work between spawned processes. So you must have 
>>>>>>>>>>>> another network enabled.
>>>>>>>>>>> 
>>>>>>>>>>> I know :-).  I have tcp available as well (OMPI will abort if you 
>>>>>>>>>>> only run with sm,self because the comm_spawn will fail with 
>>>>>>>>>>> unreachable errors -- I just tested/proved this to myself).
>>>>>>>>>>> 
>>>>>>>>>>>> 2. Don't use the test case attached to my email, I left an xterm 
>>>>>>>>>>>> based spawn and the debugging. It can't work without xterm 
>>>>>>>>>>>> support. Instead try using the test case from the trunk, the one 
>>>>>>>>>>>> committed by Ralph.
>>>>>>>>>>> 
>>>>>>>>>>> I didn't see any "xterm" strings in there, but ok.  :-)  I ran with 
>>>>>>>>>>> orte/test/mpi/intercomm_create.c, and that hangs for me as well:
>>>>>>>>>>> 
>>>>>>>>>>> -----
>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>>>> [rank 4]
>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>>>> [rank 5]
>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>>>> [rank 6]
>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, &inter) 
>>>>>>>>>>> [rank 7]
>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) (0)
>>>>>>>>>>> [hang]
>>>>>>>>>>> -----
>>>>>>>>>>> 
>>>>>>>>>>> Similarly, on my Mac, it hangs with no output:
>>>>>>>>>>> 
>>>>>>>>>>> -----
>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>> [hang]
>>>>>>>>>>> -----
>>>>>>>>>>> 
>>>>>>>>>>>> George.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sep 18, 2013, at 07:53 , "Jeff Squyres (jsquyres)" 
>>>>>>>>>>>> <jsquy...@cisco.com> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> George --
>>>>>>>>>>>>> 
>>>>>>>>>>>>> When I build the SVN trunk (r29201) on 64 bit linux, your 
>>>>>>>>>>>>> attached test case hangs:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -----
>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create
>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>> b: MPI_Intercomm_create( intra, 0, intra, MPI_COMM_NULL, 201, 
>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>> (0)
>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>> (0)
>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>> (0)
>>>>>>>>>>>>> a: MPI_Intercomm_create( ab_intra, 0, ac_intra, 0, 201, &inter) 
>>>>>>>>>>>>> (0)
>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>> &inter) [rank 4]
>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>> &inter) [rank 5]
>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>> &inter) [rank 6]
>>>>>>>>>>>>> c: MPI_Intercomm_create( MPI_COMM_WORLD, 0, intra, 0, 201, 
>>>>>>>>>>>>> &inter) [rank 7]
>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>> -----
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On my Mac, it hangs without printing anything:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -----
>>>>>>>>>>>>> ❯❯❯ mpicc intercomm_create.c -o intercomm_create
>>>>>>>>>>>>> ❯❯❯ mpirun -np 4 intercomm_create   
>>>>>>>>>>>>> [hang]
>>>>>>>>>>>>> -----
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sep 18, 2013, at 1:48 AM, George Bosilca <bosi...@icl.utk.edu> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Here is a quick (and definitively not the cleanest) patch that 
>>>>>>>>>>>>>> addresses the MPI_Intercomm issue at the MPI level. It should be 
>>>>>>>>>>>>>> applied after removal of 29166.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I also added the corrected test case stressing the corner cases 
>>>>>>>>>>>>>> by doing barriers at every inter-comm creation and doing a clean 
>>>>>>>>>>>>>> disconnect.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> -- 
>>>>>>>>>>> Jeff Squyres
>>>>>>>>>>> jsquy...@cisco.com
>>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to: 
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to