I can't get it to fail, even with hostfile arguments. I'll try again in the morning.
On Nov 17, 2011, at 8:49 PM, George Bosilca wrote: > Maybe the issue is generated by how the hostile is specified. I used > orte_default_hostfile= in my mca-params.conf. > > george. > > On Nov 17, 2011, at 19:17 , Ralph Castain wrote: > >> I'm still building on odin, but will check there again to see if I can >> replicate - perhaps something didn't get committed cleanly. >> >>> >>> george. >>> >>> On Nov 17, 2011, at 19:06 , Ralph Castain wrote: >>> >>>> Hmmm...well, things seem to work just fine for me: >>>> >>>> [rhc@odin ~/ompi-hwloc]$ mpirun -np 2 -bynode -mca plm rsh hostname >>>> odin090.cs.indiana.edu >>>> odin091.cs.indiana.edu >>>> >>>> [rhc@odin mpi]$ mpirun -np 2 -bynode -mca plm rsh ./hello_nodename >>>> Hello, World, I am 1 of 2 on host odin091.cs.indiana.edu from app number 0 >>>> universe size 8 >>>> Hello, World, I am 0 of 2 on host odin090.cs.indiana.edu from app number 0 >>>> universe size 8 >>>> >>>> >>>> I'll get a fresh checkout and see if I can replicate from that... >>>> >>>> On Nov 17, 2011, at 7:42 PM, George Bosilca wrote: >>>> >>>>> I guess I reach one of these corner-cases that didn't got tested. I can't >>>>> start any apps (not even a hostname) after this commit using the rsh PLM >>>>> (as soon as I add a hostile). The mpirun is blocked in an infinite loop >>>>> (after it spawned the daemons) in orte_rmaps_base_compute_vpids. >>>>> Attaching with gdb indicates that cnt is never incremented, thus the >>>>> mpirun is stuck forever in the while loop at line 397. >>>>> >>>>> I used "mpirun -np 2 --bynode ./tp_lb_ub_ng" to start my application, and >>>>> I have a machine file containing two nodes: >>>>> >>>>> node01 slots=8 >>>>> node02 slots=8 >>>>> >>>>> In addition CTRL+C seems to be broken … >>>>> >>>>> george. >>>>> >>>>> Begin forwarded message: >>>>> >>>>>> Author: rhc >>>>>> Date: 2011-11-14 22:40:11 EST (Mon, 14 Nov 2011) >>>>>> New Revision: 25476 >>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/25476 >>>>>> >>>>>> Log: >>>>>> At long last, the fabled revision to the affinity system has arrived. A >>>>>> more detailed explanation of how this all works will be presented here: >>>>>> >>>>>> https://svn.open-mpi.org/trac/ompi/wiki/ProcessPlacement >>>>>> >>>>>> The wiki page is incomplete at the moment, but I hope to complete it >>>>>> over the next few days. I will provide updates on the devel list. As the >>>>>> wiki page states, the default and most commonly used options remain >>>>>> unchanged (except as noted below). New, esoteric and complex options >>>>>> have been added, but unless you are a true masochist, you are unlikely >>>>>> to use many of them beyond perhaps an initial curiosity-motivated >>>>>> experimentation. >>>>>> >>>>>> In a nutshell, this commit revamps the map/rank/bind procedure to take >>>>>> into account topology info on the compute nodes. I have, for the most >>>>>> part, preserved the default behaviors, with three notable exceptions: >>>>>> >>>>>> 1. I have at long last bowed my head in submission to the system admin's >>>>>> of managed clusters. For years, they have complained about our default >>>>>> of allowing users to oversubscribe nodes - i.e., to run more processes >>>>>> on a node than allocated slots. Accordingly, I have modified the default >>>>>> behavior: if you are running off of hostfile/dash-host allocated nodes, >>>>>> then the default is to allow oversubscription. If you are running off of >>>>>> RM-allocated nodes, then the default is to NOT allow oversubscription. >>>>>> Flags to override these behaviors are provided, so this only affects the >>>>>> default behavior. >>>>>> >>>>>> 2. both cpus/rank and stride have been removed. The latter was demanded >>>>>> by those who didn't understand the purpose behind it - and I agreed as >>>>>> the users who requested it are no longer using it. The former was >>>>>> removed temporarily pending implementation. >>>>>> >>>>>> 3. vm launch is now the sole method for starting OMPI. It was just too >>>>>> darned hard to maintain multiple launch procedures - maybe someday, >>>>>> provided someone can demonstrate a reason to do so. >>>>>> >>>>>> As Jeff stated, it is impossible to fully test a change of this size. I >>>>>> have tested it on Linux and Mac, covering all the default and simple >>>>>> options, singletons, and comm_spawn. That said, I'm sure others will >>>>>> find problems, so I'll be watching MTT results until this stabilizes. >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel