Ralph's the authority on this one, but just to be sure: are all nodes the same topology? E.g., does adding "--hetero-nodes" to the mpirun command line fix the problem?
> On Apr 20, 2015, at 9:29 AM, Elena Elkina <elena.elk...@itseez.com> wrote: > > Hi guys, > > I faced with an issue on our cluster related to mapping & binding policies on > 1.8.5. > > The matter is that --report-bindings output doesn't correspond to the locale. > It looks like there is a mistake on the output itself, because it just puts > serial core number while that core can be on another socket. For example, > > mpirun -np 2 --display-devel-map --report-bindings --map-by socket hostname > Data for JOB [43064,1] offset 0 > > Mapper requested: NULL Last mapper: round_robin Mapping policy: BYSOCKET > Ranking policy: SOCKET > Binding policy: CORE Cpu set: NULL PPR: NULL Cpus-per-rank: 1 > Num new daemons: 0 New daemon starting vpid INVALID > Num nodes: 1 > > Data for node: clx-orion-001 Launch id: -1 State: 2 > Daemon: [[43064,0],0] Daemon launched: True > Num slots: 28 Slots in use: 2 Oversubscribed: FALSE > Num slots allocated: 28 Max slots: 0 > Username on node: NULL > Num procs: 2 Next node_rank: 2 > Data for proc: [[43064,1],0] > Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 0-6,14-20 Bind location: 0 Binding: 0 > Data for proc: [[43064,1],1] > Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 7-13,21-27 Bind location: 7 Binding: 7 > [clx-orion-001:26951] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > [B/././././././././././././.][./././././././././././././.] > [clx-orion-001:26951] MCW rank 1 bound to socket 1[core 14[hwt 0]]: > [./././././././././././././.][B/././././././././././././.] > > The second process should be bound at core 7 (not core 14). > > > Another example: > mpirun -np 8 --display-devel-map --report-bindings --map-by core hostname > Data for JOB [43202,1] offset 0 > > Mapper requested: NULL Last mapper: round_robin Mapping policy: BYCORE > Ranking policy: CORE > Binding policy: CORE Cpu set: NULL PPR: NULL Cpus-per-rank: 1 > Num new daemons: 0 New daemon starting vpid INVALID > Num nodes: 1 > > Data for node: clx-orion-001 Launch id: -1 State: 2 > Daemon: [[43202,0],0] Daemon launched: True > Num slots: 28 Slots in use: 8 Oversubscribed: FALSE > Num slots allocated: 28 Max slots: 0 > Username on node: NULL > Num procs: 8 Next node_rank: 8 > Data for proc: [[43202,1],0] > Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 0 Bind location: 0 Binding: 0 > Data for proc: [[43202,1],1] > Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 1 Bind location: 1 Binding: 1 > Data for proc: [[43202,1],2] > Pid: 0 Local rank: 2 Node rank: 2 App rank: 2 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 2 Bind location: 2 Binding: 2 > Data for proc: [[43202,1],3] > Pid: 0 Local rank: 3 Node rank: 3 App rank: 3 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 3 Bind location: 3 Binding: 3 > Data for proc: [[43202,1],4] > Pid: 0 Local rank: 4 Node rank: 4 App rank: 4 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 4 Bind location: 4 Binding: 4 > Data for proc: [[43202,1],5] > Pid: 0 Local rank: 5 Node rank: 5 App rank: 5 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 5 Bind location: 5 Binding: 5 > Data for proc: [[43202,1],6] > Pid: 0 Local rank: 6 Node rank: 6 App rank: 6 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 6 Bind location: 6 Binding: 6 > Data for proc: [[43202,1],7] > Pid: 0 Local rank: 7 Node rank: 7 App rank: 7 > State: INITIALIZED Restarts: 0 App_context: 0 Locale: > 14 Bind location: 14 Binding: 14 > [clx-orion-001:27069] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > [B/././././././././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 1 bound to socket 0[core 1[hwt 0]]: > [./B/./././././././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 2 bound to socket 0[core 2[hwt 0]]: > [././B/././././././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 3 bound to socket 0[core 3[hwt 0]]: > [./././B/./././././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 4 bound to socket 0[core 4[hwt 0]]: > [././././B/././././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 5 bound to socket 0[core 5[hwt 0]]: > [./././././B/./././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 6 bound to socket 0[core 6[hwt 0]]: > [././././././B/././././././.][./././././././././././././.] > [clx-orion-001:27069] MCW rank 7 bound to socket 0[core 7[hwt 0]]: > [./././././././B/./././././.][./././././././././././././.] > > Rank 7 should be bound at core 14 instead of core 7 since core 7 is at > another socket. > > Best regards, > Elena > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/04/17273.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/