I can't get it to fail, even with hostfile arguments. I'll try again in the 
morning.

On Nov 17, 2011, at 8:49 PM, George Bosilca wrote:

> Maybe the issue is generated by how the hostile is specified. I used 
> orte_default_hostfile= in my mca-params.conf.
> 
>  george.
> 
> On Nov 17, 2011, at 19:17 , Ralph Castain wrote:
> 
>> I'm still building on odin, but will check there again to see if I can 
>> replicate - perhaps something didn't get committed cleanly.
>> 
>>> 
>>> george.
>>> 
>>> On Nov 17, 2011, at 19:06 , Ralph Castain wrote:
>>> 
>>>> Hmmm...well, things seem to work just fine for me:
>>>> 
>>>> [rhc@odin ~/ompi-hwloc]$ mpirun -np 2 -bynode -mca plm rsh hostname
>>>> odin090.cs.indiana.edu
>>>> odin091.cs.indiana.edu
>>>> 
>>>> [rhc@odin mpi]$ mpirun -np 2 -bynode -mca plm rsh ./hello_nodename
>>>> Hello, World, I am 1 of 2 on host odin091.cs.indiana.edu from app number 0 
>>>> universe size 8
>>>> Hello, World, I am 0 of 2 on host odin090.cs.indiana.edu from app number 0 
>>>> universe size 8
>>>> 
>>>> 
>>>> I'll get a fresh checkout and see if I can replicate from that...
>>>> 
>>>> On Nov 17, 2011, at 7:42 PM, George Bosilca wrote:
>>>> 
>>>>> I guess I reach one of these corner-cases that didn't got tested. I can't 
>>>>> start any apps (not even a hostname) after this commit using the rsh PLM 
>>>>> (as soon as I add a hostile). The mpirun is blocked in an infinite loop 
>>>>> (after it spawned the daemons) in orte_rmaps_base_compute_vpids. 
>>>>> Attaching with gdb indicates that cnt is never incremented, thus the 
>>>>> mpirun is stuck forever in the while loop at line 397.
>>>>> 
>>>>> I used "mpirun -np 2 --bynode ./tp_lb_ub_ng" to start my application, and 
>>>>> I have a machine file containing two nodes:
>>>>> 
>>>>> node01 slots=8
>>>>> node02 slots=8
>>>>> 
>>>>> In addition CTRL+C seems to be broken …
>>>>> 
>>>>> george.
>>>>> 
>>>>> Begin forwarded message:
>>>>> 
>>>>>> Author: rhc
>>>>>> Date: 2011-11-14 22:40:11 EST (Mon, 14 Nov 2011)
>>>>>> New Revision: 25476
>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/25476
>>>>>> 
>>>>>> Log:
>>>>>> At long last, the fabled revision to the affinity system has arrived. A 
>>>>>> more detailed explanation of how this all works will be presented here:
>>>>>> 
>>>>>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement
>>>>>> 
>>>>>> The wiki page is incomplete at the moment, but I hope to complete it 
>>>>>> over the next few days. I will provide updates on the devel list. As the 
>>>>>> wiki page states, the default and most commonly used options remain 
>>>>>> unchanged (except as noted below). New, esoteric and complex options 
>>>>>> have been added, but unless you are a true masochist, you are unlikely 
>>>>>> to use many of them beyond perhaps an initial curiosity-motivated 
>>>>>> experimentation.
>>>>>> 
>>>>>> In a nutshell, this commit revamps the map/rank/bind procedure to take 
>>>>>> into account topology info on the compute nodes. I have, for the most 
>>>>>> part, preserved the default behaviors, with three notable exceptions:
>>>>>> 
>>>>>> 1. I have at long last bowed my head in submission to the system admin's 
>>>>>> of managed clusters. For years, they have complained about our default 
>>>>>> of allowing users to oversubscribe nodes - i.e., to run more processes 
>>>>>> on a node than allocated slots. Accordingly, I have modified the default 
>>>>>> behavior: if you are running off of hostfile/dash-host allocated nodes, 
>>>>>> then the default is to allow oversubscription. If you are running off of 
>>>>>> RM-allocated nodes, then the default is to NOT allow oversubscription. 
>>>>>> Flags to override these behaviors are provided, so this only affects the 
>>>>>> default behavior.
>>>>>> 
>>>>>> 2. both cpus/rank and stride have been removed. The latter was demanded 
>>>>>> by those who didn't understand the purpose behind it - and I agreed as 
>>>>>> the users who requested it are no longer using it. The former was 
>>>>>> removed temporarily pending implementation.
>>>>>> 
>>>>>> 3. vm launch is now the sole method for starting OMPI. It was just too 
>>>>>> darned hard to maintain multiple launch procedures - maybe someday, 
>>>>>> provided someone can demonstrate a reason to do so.
>>>>>> 
>>>>>> As Jeff stated, it is impossible to fully test a change of this size. I 
>>>>>> have tested it on Linux and Mac, covering all the default and simple 
>>>>>> options, singletons, and comm_spawn. That said, I'm sure others will 
>>>>>> find problems, so I'll be watching MTT results until this stabilizes.
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to