I don’t think your stated analysis is quite correct. First, the topology is -always- retrieved. The only question is whether or not we set the #slots equal to the number of detected cpus.
If the user specifies the #slots, then we respect that designation. If the user does not specify #slots, then we default to the number of detected cores (or HTs, if —use-hwthread-cpus was given) map-by socket means we will fill the first node, going round-robin across the sockets on that node. Once the first node is full, we will then move to the next node. If you want to map across -all- sockets in the round-robin, then you have to specify “-map-by socket:span”. This tells us to treat all the sockets across all nodes as a single collection for mapping purposes. So it looks to me like things are working as designed on master - we can debate if the legacy behavior on 2.x is what we want or not. I honestly forget what we decided to do about assigning #slots when given a simple “-host foo” and no hostfile behind it, but Jeff created a Google-docs spreadsheet and posted it for comment that described all the use-cases. We verified that we do indeed conform to that specification, so you might want to go back and check it. I forget where it is located - I don’t see it on the wiki page, so perhaps Jeff or someone can put it there? Ralph > On May 17, 2016, at 10:28 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote: > > Folks, > > > i ran some more test and found this > > with both master and v2.x : > > mpirun --host n0:16,n1:16 -np 4 --tag-output hostname | sort > [1,0]<stdout>:n0 > [1,1]<stdout>:n0 > [1,2]<stdout>:n0 > [1,3]<stdout>:n0 > > and same output with the --map-by socket option. > > > now, without specifying the number of slots per hosts, and the > --oversubscribe option (mandatory for v2.x) > > v2.x : > > mpirun --host n0,n1 -np 4 --tag-output --oversubscribe hostname | sort > [1,0]<stdout>:n0 > [1,1]<stdout>:n0 > [1,2]<stdout>:n1 > [1,3]<stdout>:n1 > > > master : > > mpirun --host n0,n1 -np 4 --tag-output --oversubscribe hostname | sort > [1,0]<stdout>:n0 > [1,1]<stdout>:n0 > [1,2]<stdout>:n0 > [1,3]<stdout>:n0 > > > no change is the --map-by socket is used > > > my observation is hardware topology is not retrieved when the number of slots > is specified (both v2.x and master). the default policy is --map-by slot, > *and* the --map-by socket option seems ignored, should we instead abort > instead of ignoring this option ? > > when the number of slots is not specified (and --oversubscribe is used), it > seems the hardware topology is retrieved on v2.x, but not on master. instead, > master only retrieves the number of slots and use them. > > from an end user point of view, the default mapping policy is --map-by socket > on v2.x, and --map-by slot on master. --map-by socket seems ignored on master. > > > i re-read previous discussions, and i do not think this level of detail was > ever discussed. > > > fwiw, --map-by node option is correctly interpreted on both master and v2.x > > mpirun --host n0,n1 -np 4 --tag-output --oversubscribe --map-by node hostname > | sort > [1,0]<stdout>:n0 > [1,1]<stdout>:n1 > [1,2]<stdout>:n0 > [1,3]<stdout>:n1 > > also, i can get the mapping i wished/expected with --map-by ppr:2:node > > > bottom line : > > 1) should we abort if the number of slots is explicitly specified and > --map-by socket and the like option is requested ? > > 2) in master only, when the number of slots per host is not specified, should > we retrieve the hardware topology instead of the number of slots ? if not, > should we abort if --map-by socket is specified > > > if there is a consensus and changes are desired, i am fine trying to > implement them > > > Cheers, > > > Gilles > > > On 5/17/2016 11:01 AM, Gilles Gouaillardet wrote: >> Folks, >> >> currently, default mapping policy on master is different than v2.x. >> >> my preliminary question is : when will the master mapping policy land into >> the release branch ? >> >> v2.0.0 ? v2.x ? v3.0.0 ? >> >> >> here are some commands and their output (both n0 and n1 have 16 cores each, >> mpirun runs on n0) >> >> first, let's force 2 slots per node via the --host parameter, and play with >> mapping >> >> [gilles@n0 ~]$ mpirun --tag-output --host n0:2,n1:2 -np 4 hostname | sort >> [1,0]<stdout>:n0 >> [1,1]<stdout>:n0 >> [1,2]<stdout>:n1 >> [1,3]<stdout>:n1 >> [gilles@n0 ~]$ mpirun --tag-output --host n0:2,n1:2 -np 4 --map-by socket >> hostname | sort >> [1,0]<stdout>:n0 >> [1,1]<stdout>:n0 >> [1,2]<stdout>:n1 >> [1,3]<stdout>:n1 >> >> /* so far so good, default mapping is --map-by socket, and mapping looks >> correct to me */ >> >> [gilles@n0 ~]$ mpirun --tag-output --host n0:2,n1:2 -np 4 --map-by node >> hostname | sort >> [1,0]<stdout>:n0 >> [1,1]<stdout>:n1 >> [1,2]<stdout>:n0 >> [1,3]<stdout>:n1 >> >> /* mapping looks correct to me too */ >> >> now let's force 4 slots per node >> >> [gilles@n0 ~]$ mpirun --tag-output --host n0:4,n1:4 -np 4 --map-by node >> hostname | sort >> [1,0]<stdout>:n0 >> [1,1]<stdout>:n1 >> [1,2]<stdout>:n0 >> [1,3]<stdout>:n1 >> >> /* same output than previously, looks correct to me */ >> >> [gilles@n0 ~]$ mpirun --tag-output --host n0:4,n1:4 -np 4 --map-by socket >> hostname | sort >> [1,0]<stdout>:n0 >> [1,1]<stdout>:n0 >> [1,2]<stdout>:n0 >> [1,3]<stdout>:n0 >> >> /* all tasks run on n0, even if i explicitly requested --map-by socket, that >> looks wrong to me */ >> >> [gilles@n0 ~]$ mpirun --tag-output --host n0:4,n1:4 -np 4 hostname | sort >> [1,0]<stdout>:n0 >> [1,1]<stdout>:n0 >> [1,2]<stdout>:n0 >> [1,3]<stdout>:n0 >> >> /* same output than previously, which makes sense to me since the default >> mapping policy is --map-by socket, >> >> but all tasks run on n0, which still looks wrong to me */ >> >> >> if i do not force the number of slots, i get the same output (16 cores are >> detected on each node) regardless the --map-by socket option. >> >> it seems --map-by core is used, regardless what we pass on the command line. >> >> in the last cases, is running all tasks on one node the intended behavior ? >> >> if yes, which mapping option can be used to run the first 2 tasks on the >> first node, and the last 2 tasks on the second nodes ? >> >> >> Cheers, >> >> >> Gilles >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/05/18990.php >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/05/18994.php