I don’t think your stated analysis is quite correct. First, the topology is 
-always- retrieved. The only question is whether or not we set the #slots equal 
to the number of detected cpus.

If the user specifies the #slots, then we respect that designation. If the user 
does not specify #slots, then we default to the number of detected cores (or 
HTs, if —use-hwthread-cpus was given)

map-by socket means we will fill the first node, going round-robin across the 
sockets on that node. Once the first node is full, we will then move to the 
next node.

If you want to map across -all- sockets in the round-robin, then you have to 
specify “-map-by socket:span”. This tells us to treat all the sockets across 
all nodes as a single collection for mapping purposes.

So it looks to me like things are working as designed on master - we can debate 
if the legacy behavior on 2.x is what we want or not.

I honestly forget what we decided to do about assigning #slots when given a 
simple “-host foo” and no hostfile behind it, but Jeff created a Google-docs 
spreadsheet and posted it for comment that described all the use-cases. We 
verified that we do indeed conform to that specification, so you might want to 
go back and check it. I forget where it is located - I don’t see it on the wiki 
page, so perhaps Jeff or someone can put it there?

Ralph

> On May 17, 2016, at 10:28 PM, Gilles Gouaillardet <gil...@rist.or.jp> wrote:
> 
> Folks,
> 
> 
> i ran some more test and found this
> 
> with both master and v2.x  :
> 
> mpirun --host n0:16,n1:16 -np 4 --tag-output hostname | sort
> [1,0]<stdout>:n0
> [1,1]<stdout>:n0
> [1,2]<stdout>:n0
> [1,3]<stdout>:n0
> 
> and same output with the --map-by socket option.
> 
> 
> now, without specifying the number of slots per hosts, and the 
> --oversubscribe option (mandatory for v2.x)
> 
> v2.x :
> 
> mpirun --host n0,n1 -np 4 --tag-output --oversubscribe hostname | sort
> [1,0]<stdout>:n0
> [1,1]<stdout>:n0
> [1,2]<stdout>:n1
> [1,3]<stdout>:n1
> 
> 
> master :
> 
> mpirun --host n0,n1 -np 4 --tag-output --oversubscribe hostname | sort
> [1,0]<stdout>:n0
> [1,1]<stdout>:n0
> [1,2]<stdout>:n0
> [1,3]<stdout>:n0
> 
> 
> no change is the --map-by socket is used
> 
> 
> my observation is hardware topology is not retrieved when the number of slots 
> is specified (both v2.x and master). the default policy is --map-by slot, 
> *and* the --map-by socket option seems ignored, should we instead abort 
> instead of ignoring this option ?
> 
> when the number of slots is not specified (and --oversubscribe is used), it 
> seems the hardware topology is retrieved on v2.x, but not on master. instead, 
> master only retrieves the number of slots and use them.
> 
> from an end user point of view, the default mapping policy is --map-by socket 
> on v2.x, and --map-by slot on master. --map-by socket seems ignored on master.
> 
> 
> i re-read previous discussions, and i do not think this level of detail was 
> ever discussed.
> 
> 
> fwiw, --map-by node option is correctly interpreted on both master and v2.x
> 
> mpirun --host n0,n1 -np 4 --tag-output --oversubscribe --map-by node hostname 
> | sort
> [1,0]<stdout>:n0
> [1,1]<stdout>:n1
> [1,2]<stdout>:n0
> [1,3]<stdout>:n1
> 
> also, i can get the mapping i wished/expected with --map-by ppr:2:node
> 
> 
> bottom line :
> 
> 1) should we abort if the number of slots is explicitly specified and 
> --map-by socket and the like option is requested ?
> 
> 2) in master only, when the number of slots per host is not specified, should 
> we retrieve the hardware topology instead of the number of slots ? if not, 
> should we abort if --map-by socket is specified
> 
> 
> if there is a consensus and changes are desired, i am fine trying to 
> implement them
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> On 5/17/2016 11:01 AM, Gilles Gouaillardet wrote:
>> Folks,
>> 
>> currently, default mapping policy on master is different than v2.x.
>> 
>> my preliminary question is : when will the master mapping policy land into 
>> the release branch ?
>> 
>> v2.0.0 ? v2.x ? v3.0.0 ?
>> 
>> 
>> here are some commands and their output (both n0 and n1 have 16 cores each, 
>> mpirun runs on n0)
>> 
>> first, let's force 2 slots per node via the --host parameter, and play with 
>> mapping
>> 
>> [gilles@n0 ~]$ mpirun --tag-output --host n0:2,n1:2 -np 4 hostname  | sort
>> [1,0]<stdout>:n0
>> [1,1]<stdout>:n0
>> [1,2]<stdout>:n1
>> [1,3]<stdout>:n1
>> [gilles@n0 ~]$ mpirun --tag-output --host n0:2,n1:2 -np 4 --map-by socket 
>> hostname  | sort
>> [1,0]<stdout>:n0
>> [1,1]<stdout>:n0
>> [1,2]<stdout>:n1
>> [1,3]<stdout>:n1
>> 
>> /* so far so good, default mapping is --map-by socket, and mapping looks 
>> correct to me */
>> 
>> [gilles@n0 ~]$ mpirun --tag-output --host n0:2,n1:2 -np 4 --map-by node 
>> hostname  | sort
>> [1,0]<stdout>:n0
>> [1,1]<stdout>:n1
>> [1,2]<stdout>:n0
>> [1,3]<stdout>:n1
>> 
>> /* mapping looks correct to me too */
>> 
>> now let's force 4 slots per node
>> 
>> [gilles@n0 ~]$ mpirun --tag-output --host n0:4,n1:4 -np 4 --map-by node 
>> hostname  | sort
>> [1,0]<stdout>:n0
>> [1,1]<stdout>:n1
>> [1,2]<stdout>:n0
>> [1,3]<stdout>:n1
>> 
>> /* same output than previously, looks correct to me */
>> 
>> [gilles@n0 ~]$ mpirun --tag-output --host n0:4,n1:4 -np 4 --map-by socket 
>> hostname  | sort
>> [1,0]<stdout>:n0
>> [1,1]<stdout>:n0
>> [1,2]<stdout>:n0
>> [1,3]<stdout>:n0
>> 
>> /* all tasks run on n0, even if i explicitly requested --map-by socket, that 
>> looks wrong to me */
>> 
>> [gilles@n0 ~]$ mpirun --tag-output --host n0:4,n1:4 -np 4 hostname  | sort
>> [1,0]<stdout>:n0
>> [1,1]<stdout>:n0
>> [1,2]<stdout>:n0
>> [1,3]<stdout>:n0
>> 
>> /* same output than previously, which makes sense to me since the default 
>> mapping policy is --map-by socket,
>> 
>> but all tasks run on n0, which still looks wrong to me */
>> 
>> 
>> if i do not force the number of slots, i get the same output (16 cores are 
>> detected on each node) regardless the --map-by socket option.
>> 
>> it seems --map-by core is used, regardless what we pass on the command line.
>> 
>> in the last cases, is running all tasks on one node the intended behavior ?
>> 
>> if yes, which mapping option can be used to run the first 2 tasks on the 
>> first node, and the last 2 tasks on the second nodes ?
>> 
>> 
>> Cheers,
>> 
>> 
>> Gilles
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/05/18990.php
>> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/05/18994.php

Reply via email to