Hmmm…well, it seems to be working fine in 1.8.4rc1 (I only have 12 cores on my 
humble machine). However, I can’t test any interactions with LSF, though that 
shouldn’t be an issue:

$ mpirun -host bend001 -rf ./rankfile --report-bindings --display-devel-map 
hostname
 Data for JOB [60677,1] offset 0

 Mapper requested: NULL  Last mapper: rank_file  Mapping policy: BYUSER  
Ranking policy: SLOT
 Binding policy: CPUSET  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
        Num new daemons: 0      New daemon starting vpid INVALID
        Num nodes: 1

 Data for node: bend001         Launch id: -1   State: 2
        Daemon: [[60677,0],0]   Daemon launched: True
        Num slots: 12   Slots in use: 12        Oversubscribed: FALSE
        Num slots allocated: 12 Max slots: 0
        Username on node: NULL
        Num procs: 12   Next node_rank: 12
        Data for proc: [[60677,1],0]
                Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 0,12
        Data for proc: [[60677,1],1]
                Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 8,20
        Data for proc: [[60677,1],2]
                Pid: 0  Local rank: 2   Node rank: 2    App rank: 2
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 5,17
        Data for proc: [[60677,1],3]
                Pid: 0  Local rank: 3   Node rank: 3    App rank: 3
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 9,21
        Data for proc: [[60677,1],4]
                Pid: 0  Local rank: 4   Node rank: 4    App rank: 4
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 11,23
        Data for proc: [[60677,1],5]
                Pid: 0  Local rank: 5   Node rank: 5    App rank: 5
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 7,19
        Data for proc: [[60677,1],6]
                Pid: 0  Local rank: 6   Node rank: 6    App rank: 6
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 3,15
        Data for proc: [[60677,1],7]
                Pid: 0  Local rank: 7   Node rank: 7    App rank: 7
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 6,18
        Data for proc: [[60677,1],8]
                Pid: 0  Local rank: 8   Node rank: 8    App rank: 8
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 2,14
        Data for proc: [[60677,1],9]
                Pid: 0  Local rank: 9   Node rank: 9    App rank: 9
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 4,16
        Data for proc: [[60677,1],10]
                Pid: 0  Local rank: 10  Node rank: 10   App rank: 10
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 10,22
        Data for proc: [[60677,1],11]
                Pid: 0  Local rank: 11  Node rank: 11   App rank: 11
                State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
UNKNOWN Bind location: (null)   Binding: 1,13
[bend001:24667] MCW rank 1 bound to socket 0[core 4[hwt 0-1]]: 
[../../../../BB/..][../../../../../..]
[bend001:24667] MCW rank 2 bound to socket 1[core 8[hwt 0-1]]: 
[../../../../../..][../../BB/../../..]
[bend001:24667] MCW rank 3 bound to socket 1[core 10[hwt 0-1]]: 
[../../../../../..][../../../../BB/..]
[bend001:24667] MCW rank 4 bound to socket 1[core 11[hwt 0-1]]: 
[../../../../../..][../../../../../BB]
[bend001:24667] MCW rank 5 bound to socket 1[core 9[hwt 0-1]]: 
[../../../../../..][../../../BB/../..]
[bend001:24667] MCW rank 6 bound to socket 1[core 7[hwt 0-1]]: 
[../../../../../..][../BB/../../../..]
[bend001:24667] MCW rank 7 bound to socket 0[core 3[hwt 0-1]]: 
[../../../BB/../..][../../../../../..]
[bend001:24667] MCW rank 8 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../../../..][../../../../../..]
[bend001:24667] MCW rank 9 bound to socket 0[core 2[hwt 0-1]]: 
[../../BB/../../..][../../../../../..]
[bend001:24667] MCW rank 10 bound to socket 0[core 5[hwt 0-1]]: 
[../../../../../BB][../../../../../..]
[bend001:24667] MCW rank 11 bound to socket 1[core 6[hwt 0-1]]: 
[../../../../../..][BB/../../../../..]
[bend001:24667] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../..][../../../../../..]

Can you try with the latest nightly 1.8 tarball?

http://www.open-mpi.org/nightly/v1.8/ <http://www.open-mpi.org/nightly/v1.8/>

Note that it is also possible that hwloc isn’t correctly identifying the cores 
here. Can you tell us something about the hardware? Do you have hardware 
threads enabled?

I ask because the binding being reported by us is the cpu numbers as identified 
by hwloc - which may not be the same you are expecting from some hardware 
vendor’s map. We are using logical processor assignments, not physical. You can 
use the —report-bindings option to show the resulting map, as above.



> On Nov 5, 2014, at 7:21 AM, twu...@goodyear.com wrote:
> 
> I am using openmpi v 1.8.3 and LSF 9.1.3.
> 
> LSF creates a rankfile that looks like:
> 
> RANK_FILE:
> ======================================================================
> rank 0=mach1 slot=0
> rank 1=mach1 slot=4
> rank 2=mach1 slot=8
> rank 3=mach1 slot=12
> rank 4=mach1 slot=16
> rank 5=mach1 slot=20
> rank 6=mach1 slot=24
> rank 7=mach1 slot=28
> rank 8=mach1 slot=32
> rank 9=mach1 slot=36
> rank 10=mach1 slot=40
> rank 11=mach1 slot=44
> rank 12=mach1 slot=1
> rank 13=mach1 slot=5
> rank 14=mach1 slot=9
> rank 15=mach1 slot=13
> 
> which really are the cores I want to use, in order. 
> 
> I logon to this machine and type (all on one line):
> 
> /apps/share/openmpi/1.8.3.I1217913/bin/mpirun \
>  --mca orte_base_help_aggregate 0 \
>  -v -display-devel-allocation \
>  -display-devel-map \
>  --rankfile RANK_FILE \
>  --mca btl openib,tcp,sm,self \
>  --x LD_LIBRARY_PATH \
>  --np 16 \
>  my_executable \
>  -i model.i \
>  -l model.o
> 
> And I get the following on the screen:
> 
> ======================   ALLOCATED NODES   ======================
>       mach1: slots=16 max_slots=0 slots_inuse=0 state=UP
> =================================================================
> Data for JOB [52387,1] offset 0
> 
> Mapper requested: NULL  Last mapper: rank_file  Mapping policy: BYUSER  
> Ranking policy: SLOT
> Binding policy: CPUSET  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
>       Num new daemons: 0      New daemon starting vpid INVALID
>       Num nodes: 1
> 
> Data for node: mach1          Launch id: -1   State: 2
>       Daemon: [[52387,0],0]   Daemon launched: True
>       Num slots: 16   Slots in use: 16        Oversubscribed: FALSE
>       Num slots allocated: 16 Max slots: 0
>       Username on node: NULL
>       Num procs: 16   Next node_rank: 16
>       Data for proc: [[52387,1],0]
>               Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 0
>       Data for proc: [[52387,1],1]
>               Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 16
>       Data for proc: [[52387,1],2]
>               Pid: 0  Local rank: 2   Node rank: 2    App rank: 2
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 32
>       Data for proc: [[52387,1],3]
>               Pid: 0  Local rank: 3   Node rank: 3    App rank: 3
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 1
>       Data for proc: [[52387,1],4]
>               Pid: 0  Local rank: 4   Node rank: 4    App rank: 4
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 17
>       Data for proc: [[52387,1],5]
>               Pid: 0  Local rank: 5   Node rank: 5    App rank: 5
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 33
>       Data for proc: [[52387,1],6]
>               Pid: 0  Local rank: 6   Node rank: 6    App rank: 6
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 2
>       Data for proc: [[52387,1],7]
>               Pid: 0  Local rank: 7   Node rank: 7    App rank: 7
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 18
>       Data for proc: [[52387,1],8]
>               Pid: 0  Local rank: 8   Node rank: 8    App rank: 8
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 34
>       Data for proc: [[52387,1],9]
>               Pid: 0  Local rank: 9   Node rank: 9    App rank: 9
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 3
>       Data for proc: [[52387,1],10]
>               Pid: 0  Local rank: 10  Node rank: 10   App rank: 10
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 19
>       Data for proc: [[52387,1],11]
>               Pid: 0  Local rank: 11  Node rank: 11   App rank: 11
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 35
>       Data for proc: [[52387,1],12]
>               Pid: 0  Local rank: 12  Node rank: 12   App rank: 12
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 4
>       Data for proc: [[52387,1],13]
>               Pid: 0  Local rank: 13  Node rank: 13   App rank: 13
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 20
>       Data for proc: [[52387,1],14]
>               Pid: 0  Local rank: 14  Node rank: 14   App rank: 14
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 36
>       Data for proc: [[52387,1],15]
>               Pid: 0  Local rank: 15  Node rank: 15   App rank: 15
>               State: INITIALIZED      Restarts: 0     App_context: 0  Locale: 
> UNKNOWN Bind location: (null)   Binding: 5
> 
> And a numa-map of the node shows:
> 
>  PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     N4   
>   N5     N6     N7 ]
> 31044 my_executable         0    443.3M [ 443.3M     0      0      0      0   
>    0      0      0  ]
> 31045 my_executable        16    459.7M [ 459.7M     0      0      0      0   
>    0      0      0  ]
> 31046 my_executable        32    435.0M [     0  435.0M     0      0      0   
>    0      0      0  ]
> 31047 my_executable         1    468.8M [     0      0  468.8M     0      0   
>    0      0      0  ]
> 31048 my_executable        17    493.2M [     0      0  493.2M     0      0   
>    0      0      0  ]
> 31049 my_executable        33    498.0M [     0      0      0  498.0M     0   
>    0      0      0  ]
> 31050 my_executable         2    501.2M [     0      0      0      0  501.2M  
>    0      0      0  ]
> 31051 my_executable        18    502.4M [     0      0      0      0  502.4M  
>    0      0      0  ]
> 31052 my_executable        34    500.5M [     0      0      0      0      0  
> 500.5M     0      0  ]
> 31053 my_executable         3    515.6M [     0      0      0      0      0   
>    0  515.6M     0  ]
> 31054 my_executable        19    508.1M [     0      0      0      0      0   
>    0  508.1M     0  ]
> 31055 my_executable        35    503.9M [     0      0      0      0      0   
>    0      0  503.9M ]
> 31056 my_executable         4    502.1M [ 502.1M     0      0      0      0   
>    0      0      0  ]
> 31057 my_executable        20    515.2M [ 515.2M     0      0      0      0   
>    0      0      0  ]
> 31058 my_executable        36    508.1M [     0  508.1M     0      0      0   
>    0      0      0  ]
> 31059 my_executable         5    446.7M [     0      0  446.7M     0      0   
>    0      0      0  ]
> -- 
> 
> Why didn't mpirun honor the ranfile and put the processes on the correct 
> cores in
> the proper order?  It looks to me like mpirun doesn't like the rankfile...??
> 
> Thanks for any help.
> 
> Tom
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16199.php

Reply via email to