Ralph's the authority on this one, but just to be sure: are all nodes the same 
topology? E.g., does adding "--hetero-nodes" to the mpirun command line fix the 
problem?


> On Apr 20, 2015, at 9:29 AM, Elena Elkina <elena.elk...@itseez.com> wrote:
> 
> Hi guys,
> 
> I faced with an issue on our cluster related to mapping & binding policies on 
> 1.8.5.
> 
> The matter is that --report-bindings output doesn't correspond to the locale. 
> It looks like there is a mistake on the output itself, because it just puts 
> serial core number while that core can be on another socket. For example,
> 
> mpirun -np 2 --display-devel-map --report-bindings --map-by socket hostname
>  Data for JOB [43064,1] offset 0
> 
>  Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYSOCKET  
> Ranking policy: SOCKET
>  Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
>       Num new daemons: 0      New daemon starting vpid INVALID
>       Num nodes: 1
> 
>  Data for node: clx-orion-001         Launch id: -1   State: 2
>       Daemon: [[43064,0],0]   Daemon launched: True
>       Num slots: 28   Slots in use: 2 Oversubscribed: FALSE
>       Num slots allocated: 28 Max slots: 0
>       Username on node: NULL
>       Num procs: 2    Next node_rank: 2
>       Data for proc: [[43064,1],0]
>               Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 0-6,14-20       Bind location: 0        Binding: 0
>       Data for proc: [[43064,1],1]
>               Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 7-13,21-27      Bind location: 7        Binding: 7
> [clx-orion-001:26951] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././././././././././.][./././././././././././././.]
> [clx-orion-001:26951] MCW rank 1 bound to socket 1[core 14[hwt 0]]: 
> [./././././././././././././.][B/././././././././././././.]
> 
> The second process should be bound at core 7 (not core 14).
> 
> 
> Another example:
> mpirun -np 8 --display-devel-map --report-bindings --map-by core hostname
>  Data for JOB [43202,1] offset 0
> 
>  Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYCORE  
> Ranking policy: CORE
>  Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
>       Num new daemons: 0      New daemon starting vpid INVALID
>       Num nodes: 1
> 
>  Data for node: clx-orion-001         Launch id: -1   State: 2
>       Daemon: [[43202,0],0]   Daemon launched: True
>       Num slots: 28   Slots in use: 8 Oversubscribed: FALSE
>       Num slots allocated: 28 Max slots: 0
>       Username on node: NULL
>       Num procs: 8    Next node_rank: 8
>       Data for proc: [[43202,1],0]
>               Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 0       Bind location: 0        Binding: 0
>       Data for proc: [[43202,1],1]
>               Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 1       Bind location: 1        Binding: 1
>       Data for proc: [[43202,1],2]
>               Pid: 0  Local rank: 2   Node rank: 2    App rank: 2
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 2       Bind location: 2        Binding: 2
>       Data for proc: [[43202,1],3]
>               Pid: 0  Local rank: 3   Node rank: 3    App rank: 3
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 3       Bind location: 3        Binding: 3
>       Data for proc: [[43202,1],4]
>               Pid: 0  Local rank: 4   Node rank: 4    App rank: 4
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 4       Bind location: 4        Binding: 4
>       Data for proc: [[43202,1],5]
>               Pid: 0  Local rank: 5   Node rank: 5    App rank: 5
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 5       Bind location: 5        Binding: 5
>       Data for proc: [[43202,1],6]
>               Pid: 0  Local rank: 6   Node rank: 6    App rank: 6
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 6       Bind location: 6        Binding: 6
>       Data for proc: [[43202,1],7]
>               Pid: 0  Local rank: 7   Node rank: 7    App rank: 7
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> 14      Bind location: 14       Binding: 14
> [clx-orion-001:27069] MCW rank 0 bound to socket 0[core 0[hwt 0]]: 
> [B/././././././././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 1 bound to socket 0[core 1[hwt 0]]: 
> [./B/./././././././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 2 bound to socket 0[core 2[hwt 0]]: 
> [././B/././././././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 3 bound to socket 0[core 3[hwt 0]]: 
> [./././B/./././././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 4 bound to socket 0[core 4[hwt 0]]: 
> [././././B/././././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 5 bound to socket 0[core 5[hwt 0]]: 
> [./././././B/./././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 6 bound to socket 0[core 6[hwt 0]]: 
> [././././././B/././././././.][./././././././././././././.]
> [clx-orion-001:27069] MCW rank 7 bound to socket 0[core 7[hwt 0]]: 
> [./././././././B/./././././.][./././././././././././././.]
> 
> Rank 7 should be bound at core 14 instead of core 7 since core 7 is at 
> another socket.
> 
> Best regards,
> Elena
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/04/17273.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to