Hmmm…well, it seems to be working fine in 1.8.4rc1 (I only have 12 cores on my
humble machine). However, I can’t test any interactions with LSF, though that
shouldn’t be an issue:
$ mpirun -host bend001 -rf ./rankfile --report-bindings --display-devel-map
hostname
Data for JOB [60677,1] offset 0
Mapper requested: NULL Last mapper: rank_file Mapping policy: BYUSER
Ranking policy: SLOT
Binding policy: CPUSET Cpu set: NULL PPR: NULL Cpus-per-rank: 1
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 1
Data for node: bend001 Launch id: -1 State: 2
Daemon: [[60677,0],0] Daemon launched: True
Num slots: 12 Slots in use: 12 Oversubscribed: FALSE
Num slots allocated: 12 Max slots: 0
Username on node: NULL
Num procs: 12 Next node_rank: 12
Data for proc: [[60677,1],0]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 0,12
Data for proc: [[60677,1],1]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 8,20
Data for proc: [[60677,1],2]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 2
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 5,17
Data for proc: [[60677,1],3]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 3
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 9,21
Data for proc: [[60677,1],4]
Pid: 0 Local rank: 4 Node rank: 4 App rank: 4
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 11,23
Data for proc: [[60677,1],5]
Pid: 0 Local rank: 5 Node rank: 5 App rank: 5
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 7,19
Data for proc: [[60677,1],6]
Pid: 0 Local rank: 6 Node rank: 6 App rank: 6
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 3,15
Data for proc: [[60677,1],7]
Pid: 0 Local rank: 7 Node rank: 7 App rank: 7
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 6,18
Data for proc: [[60677,1],8]
Pid: 0 Local rank: 8 Node rank: 8 App rank: 8
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 2,14
Data for proc: [[60677,1],9]
Pid: 0 Local rank: 9 Node rank: 9 App rank: 9
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 4,16
Data for proc: [[60677,1],10]
Pid: 0 Local rank: 10 Node rank: 10 App rank: 10
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 10,22
Data for proc: [[60677,1],11]
Pid: 0 Local rank: 11 Node rank: 11 App rank: 11
State: INITIALIZED Restarts: 0 App_context: 0 Locale:
UNKNOWN Bind location: (null) Binding: 1,13
[bend001:24667] MCW rank 1 bound to socket 0[core 4[hwt 0-1]]:
[../../../../BB/..][../../../../../..]
[bend001:24667] MCW rank 2 bound to socket 1[core 8[hwt 0-1]]:
[../../../../../..][../../BB/../../..]
[bend001:24667] MCW rank 3 bound to socket 1[core 10[hwt 0-1]]:
[../../../../../..][../../../../BB/..]
[bend001:24667] MCW rank 4 bound to socket 1[core 11[hwt 0-1]]:
[../../../../../..][../../../../../BB]
[bend001:24667] MCW rank 5 bound to socket 1[core 9[hwt 0-1]]:
[../../../../../..][../../../BB/../..]
[bend001:24667] MCW rank 6 bound to socket 1[core 7[hwt 0-1]]:
[../../../../../..][../BB/../../../..]
[bend001:24667] MCW rank 7 bound to socket 0[core 3[hwt 0-1]]:
[../../../BB/../..][../../../../../..]
[bend001:24667] MCW rank 8 bound to socket 0[core 1[hwt 0-1]]:
[../BB/../../../..][../../../../../..]
[bend001:24667] MCW rank 9 bound to socket 0[core 2[hwt 0-1]]:
[../../BB/../../..][../../../../../..]
[bend001:24667] MCW rank 10 bound to socket 0[core 5[hwt 0-1]]:
[../../../../../BB][../../../../../..]
[bend001:24667] MCW rank 11 bound to socket 1[core 6[hwt 0-1]]:
[../../../../../..][BB/../../../../..]
[bend001:24667] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]:
[BB/../../../../..][../../../../../..]
Can you try with the latest nightly 1.8 tarball?
http://www.open-mpi.org/nightly/v1.8/ <http://www.open-mpi.org/nightly/v1.8/>
Note that it is also possible that hwloc isn’t correctly identifying the cores
here. Can you tell us something about the hardware? Do you have hardware
threads enabled?
I ask because the binding being reported by us is the cpu numbers as identified
by hwloc - which may not be the same you are expecting from some hardware
vendor’s map. We are using logical processor assignments, not physical. You can
use the —report-bindings option to show the resulting map, as above.
> On Nov 5, 2014, at 7:21 AM, [email protected] wrote:
>
> I am using openmpi v 1.8.3 and LSF 9.1.3.
>
> LSF creates a rankfile that looks like:
>
> RANK_FILE:
> ======================================================================
> rank 0=mach1 slot=0
> rank 1=mach1 slot=4
> rank 2=mach1 slot=8
> rank 3=mach1 slot=12
> rank 4=mach1 slot=16
> rank 5=mach1 slot=20
> rank 6=mach1 slot=24
> rank 7=mach1 slot=28
> rank 8=mach1 slot=32
> rank 9=mach1 slot=36
> rank 10=mach1 slot=40
> rank 11=mach1 slot=44
> rank 12=mach1 slot=1
> rank 13=mach1 slot=5
> rank 14=mach1 slot=9
> rank 15=mach1 slot=13
>
> which really are the cores I want to use, in order.
>
> I logon to this machine and type (all on one line):
>
> /apps/share/openmpi/1.8.3.I1217913/bin/mpirun \
> --mca orte_base_help_aggregate 0 \
> -v -display-devel-allocation \
> -display-devel-map \
> --rankfile RANK_FILE \
> --mca btl openib,tcp,sm,self \
> --x LD_LIBRARY_PATH \
> --np 16 \
> my_executable \
> -i model.i \
> -l model.o
>
> And I get the following on the screen:
>
> ====================== ALLOCATED NODES ======================
> mach1: slots=16 max_slots=0 slots_inuse=0 state=UP
> =================================================================
> Data for JOB [52387,1] offset 0
>
> Mapper requested: NULL Last mapper: rank_file Mapping policy: BYUSER
> Ranking policy: SLOT
> Binding policy: CPUSET Cpu set: NULL PPR: NULL Cpus-per-rank: 1
> Num new daemons: 0 New daemon starting vpid INVALID
> Num nodes: 1
>
> Data for node: mach1 Launch id: -1 State: 2
> Daemon: [[52387,0],0] Daemon launched: True
> Num slots: 16 Slots in use: 16 Oversubscribed: FALSE
> Num slots allocated: 16 Max slots: 0
> Username on node: NULL
> Num procs: 16 Next node_rank: 16
> Data for proc: [[52387,1],0]
> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 0
> Data for proc: [[52387,1],1]
> Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 16
> Data for proc: [[52387,1],2]
> Pid: 0 Local rank: 2 Node rank: 2 App rank: 2
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 32
> Data for proc: [[52387,1],3]
> Pid: 0 Local rank: 3 Node rank: 3 App rank: 3
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 1
> Data for proc: [[52387,1],4]
> Pid: 0 Local rank: 4 Node rank: 4 App rank: 4
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 17
> Data for proc: [[52387,1],5]
> Pid: 0 Local rank: 5 Node rank: 5 App rank: 5
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 33
> Data for proc: [[52387,1],6]
> Pid: 0 Local rank: 6 Node rank: 6 App rank: 6
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 2
> Data for proc: [[52387,1],7]
> Pid: 0 Local rank: 7 Node rank: 7 App rank: 7
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 18
> Data for proc: [[52387,1],8]
> Pid: 0 Local rank: 8 Node rank: 8 App rank: 8
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 34
> Data for proc: [[52387,1],9]
> Pid: 0 Local rank: 9 Node rank: 9 App rank: 9
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 3
> Data for proc: [[52387,1],10]
> Pid: 0 Local rank: 10 Node rank: 10 App rank: 10
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 19
> Data for proc: [[52387,1],11]
> Pid: 0 Local rank: 11 Node rank: 11 App rank: 11
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 35
> Data for proc: [[52387,1],12]
> Pid: 0 Local rank: 12 Node rank: 12 App rank: 12
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 4
> Data for proc: [[52387,1],13]
> Pid: 0 Local rank: 13 Node rank: 13 App rank: 13
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 20
> Data for proc: [[52387,1],14]
> Pid: 0 Local rank: 14 Node rank: 14 App rank: 14
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 36
> Data for proc: [[52387,1],15]
> Pid: 0 Local rank: 15 Node rank: 15 App rank: 15
> State: INITIALIZED Restarts: 0 App_context: 0 Locale:
> UNKNOWN Bind location: (null) Binding: 5
>
> And a numa-map of the node shows:
>
> PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 N4
> N5 N6 N7 ]
> 31044 my_executable 0 443.3M [ 443.3M 0 0 0 0
> 0 0 0 ]
> 31045 my_executable 16 459.7M [ 459.7M 0 0 0 0
> 0 0 0 ]
> 31046 my_executable 32 435.0M [ 0 435.0M 0 0 0
> 0 0 0 ]
> 31047 my_executable 1 468.8M [ 0 0 468.8M 0 0
> 0 0 0 ]
> 31048 my_executable 17 493.2M [ 0 0 493.2M 0 0
> 0 0 0 ]
> 31049 my_executable 33 498.0M [ 0 0 0 498.0M 0
> 0 0 0 ]
> 31050 my_executable 2 501.2M [ 0 0 0 0 501.2M
> 0 0 0 ]
> 31051 my_executable 18 502.4M [ 0 0 0 0 502.4M
> 0 0 0 ]
> 31052 my_executable 34 500.5M [ 0 0 0 0 0
> 500.5M 0 0 ]
> 31053 my_executable 3 515.6M [ 0 0 0 0 0
> 0 515.6M 0 ]
> 31054 my_executable 19 508.1M [ 0 0 0 0 0
> 0 508.1M 0 ]
> 31055 my_executable 35 503.9M [ 0 0 0 0 0
> 0 0 503.9M ]
> 31056 my_executable 4 502.1M [ 502.1M 0 0 0 0
> 0 0 0 ]
> 31057 my_executable 20 515.2M [ 515.2M 0 0 0 0
> 0 0 0 ]
> 31058 my_executable 36 508.1M [ 0 508.1M 0 0 0
> 0 0 0 ]
> 31059 my_executable 5 446.7M [ 0 0 446.7M 0 0
> 0 0 0 ]
> --
>
> Why didn't mpirun honor the ranfile and put the processes on the correct
> cores in
> the proper order? It looks to me like mpirun doesn't like the rankfile...??
>
> Thanks for any help.
>
> Tom
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/11/16199.php