Well, further investigation found this:

If I edit the rank file and change it like this:


before:

rank 0=mach1 slot=0
rank 1=mach1 slot=4
rank 2=mach1 slot=8
rank 3=mach1 slot=12
rank 4=mach1 slot=16
rank 5=mach1 slot=20
rank 6=mach1 slot=24
rank 7=mach1 slot=28
rank 8=mach1 slot=32
rank 9=mach1 slot=36
rank 10=mach1 slot=40
rank 11=mach1 slot=44
rank 12=mach1 slot=1
rank 13=mach1 slot=5
rank 14=mach1 slot=9
rank 15=mach1 slot=13


after:

rank 0=mach1 slot=0
rank 1=mach1 slot=1
rank 2=mach1 slot=2
rank 3=mach1 slot=3
rank 4=mach1 slot=4
rank 5=mach1 slot=5
rank 6=mach1 slot=6
rank 7=mach1 slot=7
rank 8=mach1 slot=8
rank 9=mach1 slot=9
rank 10=mach1 slot=10
rank 11=mach1 slot=11
rank 12=mach1 slot=12
rank 13=mach1 slot=13
rank 14=mach1 slot=14
rank 15=mach1 slot=15


It does what I expect:

  PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     N4    
 N5     N6     N7 ]
12192 my_executable         0                   472.0M [ 472.0M     0      0    
  0      0      0      0      0  ]
12193 my_executable         4                   358.0M [ 358.0M     0      0    
  0      0      0      0      0  ]
12194 my_executable         8                   450.4M [ 450.4M     0      0    
  0      0      0      0      0  ]
12195 my_executable        12                  439.1M [ 439.1M     0      0     
 0      0      0      0      0  ]
12196 my_executable        16                  392.1M [ 392.1M     0      0     
 0      0      0      0      0  ]
12197 my_executable        20                  420.6M [ 420.6M     0      0     
 0      0      0      0      0  ]
12198 my_executable        24                  414.9M [     0  414.9M     0     
 0      0      0      0      0  ]
12199 my_executable        28                  388.9M [     0  388.9M     0     
 0      0      0      0      0  ]
12200 my_executable        32                  452.7M [     0  452.7M     0     
 0      0      0      0      0  ]
12201 my_executable        36                  438.9M [     0  438.9M     0     
 0      0      0      0      0  ]
12202 my_executable        40                  369.3M [     0  369.3M     0     
 0      0      0      0      0  ]
12203 my_executable        44                  440.5M [     0  440.5M     0     
 0      0      0      0      0  ]
12204 my_executable         1                   447.7M [     0      0  447.7M   
  0      0      0      0      0  ]
12205 my_executable         5                   367.1M [     0      0  367.1M   
  0      0      0      0      0  ]
12206 my_executable         9                   426.5M [     0      0  426.5M   
  0      0      0      0      0  ]
12207 my_executable        13                  414.2M [     0      0  414.2M    
 0      0      0      0      0  ]


We use hwloc 1.4 to generate a layout of the cores etc.


So either LSF created the wrong rankfile (via my config errors, most likely) or 
mpirun can't deal with that rankfile.


I can try the nightly tarball as well.  The hardware is 48 core AMD:  4 
sockets, 2 Numa nodes per socket with 6 cores each.


thanks

tom





________________________________
From: devel <devel-boun...@open-mpi.org> on behalf of Ralph Castain 
<rhc.open...@gmail.com>
Sent: Wednesday, November 5, 2014 4:27 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] mpirun does not honor rankfile

Hmmm…well, it seems to be working fine in 1.8.4rc1 (I only have 12 cores on my 
humble machine). However, I can’t test any interactions with LSF, though that 
shouldn’t be an issue:

$ mpirun -host bend001 -rf ./rankfile --report-bindings --display-devel-map 
hostname
 Data for JOB [60677,1] offset 0

 Mapper requested: NULL  Last mapper: rank_file  Mapping policy: BYUSER  
Ranking policy: SLOT
 Binding policy: CPUSET  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
  Num new daemons: 0 New daemon starting vpid INVALID
  Num nodes: 1

 Data for node: bend001 Launch id: -1 State: 2
  Daemon: [[60677,0],0] Daemon launched: True
  Num slots: 12 Slots in use: 12 Oversubscribed: FALSE
  Num slots allocated: 12 Max slots: 0
  Username on node: NULL
  Num procs: 12 Next node_rank: 12
  Data for proc: [[60677,1],0]
  Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 0,12
  Data for proc: [[60677,1],1]
  Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 8,20
  Data for proc: [[60677,1],2]
  Pid: 0 Local rank: 2 Node rank: 2 App rank: 2
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 5,17
  Data for proc: [[60677,1],3]
  Pid: 0 Local rank: 3 Node rank: 3 App rank: 3
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 9,21
  Data for proc: [[60677,1],4]
  Pid: 0 Local rank: 4 Node rank: 4 App rank: 4
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 11,23
  Data for proc: [[60677,1],5]
  Pid: 0 Local rank: 5 Node rank: 5 App rank: 5
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 7,19
  Data for proc: [[60677,1],6]
  Pid: 0 Local rank: 6 Node rank: 6 App rank: 6
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 3,15
  Data for proc: [[60677,1],7]
  Pid: 0 Local rank: 7 Node rank: 7 App rank: 7
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 6,18
  Data for proc: [[60677,1],8]
  Pid: 0 Local rank: 8 Node rank: 8 App rank: 8
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 2,14
  Data for proc: [[60677,1],9]
  Pid: 0 Local rank: 9 Node rank: 9 App rank: 9
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 4,16
  Data for proc: [[60677,1],10]
  Pid: 0 Local rank: 10 Node rank: 10 App rank: 10
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 10,22
  Data for proc: [[60677,1],11]
  Pid: 0 Local rank: 11 Node rank: 11 App rank: 11
  State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 1,13
[bend001:24667] MCW rank 1 bound to socket 0[core 4[hwt 0-1]]: 
[../../../../BB/..][../../../../../..]
[bend001:24667] MCW rank 2 bound to socket 1[core 8[hwt 0-1]]: 
[../../../../../..][../../BB/../../..]
[bend001:24667] MCW rank 3 bound to socket 1[core 10[hwt 0-1]]: 
[../../../../../..][../../../../BB/..]
[bend001:24667] MCW rank 4 bound to socket 1[core 11[hwt 0-1]]: 
[../../../../../..][../../../../../BB]
[bend001:24667] MCW rank 5 bound to socket 1[core 9[hwt 0-1]]: 
[../../../../../..][../../../BB/../..]
[bend001:24667] MCW rank 6 bound to socket 1[core 7[hwt 0-1]]: 
[../../../../../..][../BB/../../../..]
[bend001:24667] MCW rank 7 bound to socket 0[core 3[hwt 0-1]]: 
[../../../BB/../..][../../../../../..]
[bend001:24667] MCW rank 8 bound to socket 0[core 1[hwt 0-1]]: 
[../BB/../../../..][../../../../../..]
[bend001:24667] MCW rank 9 bound to socket 0[core 2[hwt 0-1]]: 
[../../BB/../../..][../../../../../..]
[bend001:24667] MCW rank 10 bound to socket 0[core 5[hwt 0-1]]: 
[../../../../../BB][../../../../../..]
[bend001:24667] MCW rank 11 bound to socket 1[core 6[hwt 0-1]]: 
[../../../../../..][BB/../../../../..]
[bend001:24667] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
[BB/../../../../..][../../../../../..]

Can you try with the latest nightly 1.8 tarball?

http://www.open-mpi.org/nightly/v1.8/

Note that it is also possible that hwloc isn’t correctly identifying the cores 
here. Can you tell us something about the hardware? Do you have hardware 
threads enabled?

I ask because the binding being reported by us is the cpu numbers as identified 
by hwloc - which may not be the same you are expecting from some hardware 
vendor’s map. We are using logical processor assignments, not physical. You can 
use the —report-bindings option to show the resulting map, as above.



On Nov 5, 2014, at 7:21 AM, twu...@goodyear.com<mailto:twu...@goodyear.com> 
wrote:

I am using openmpi v 1.8.3 and LSF 9.1.3.

LSF creates a rankfile that looks like:

RANK_FILE:
======================================================================
rank 0=mach1 slot=0
rank 1=mach1 slot=4
rank 2=mach1 slot=8
rank 3=mach1 slot=12
rank 4=mach1 slot=16
rank 5=mach1 slot=20
rank 6=mach1 slot=24
rank 7=mach1 slot=28
rank 8=mach1 slot=32
rank 9=mach1 slot=36
rank 10=mach1 slot=40
rank 11=mach1 slot=44
rank 12=mach1 slot=1
rank 13=mach1 slot=5
rank 14=mach1 slot=9
rank 15=mach1 slot=13

which really are the cores I want to use, in order.

I logon to this machine and type (all on one line):

/apps/share/openmpi/1.8.3.I1217913/bin/mpirun \
 --mca orte_base_help_aggregate 0 \
 -v -display-devel-allocation \
 -display-devel-map \
 --rankfile RANK_FILE \
 --mca btl openib,tcp,sm,self \
 --x LD_LIBRARY_PATH \
 --np 16 \
 my_executable \
 -i model.i \
 -l model.o

And I get the following on the screen:

======================   ALLOCATED NODES   ======================
mach1: slots=16 max_slots=0 slots_inuse=0 state=UP
=================================================================
Data for JOB [52387,1] offset 0

Mapper requested: NULL  Last mapper: rank_file  Mapping policy: BYUSER  Ranking 
policy: SLOT
Binding policy: CPUSET  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 1

Data for node: mach1 Launch id: -1 State: 2
Daemon: [[52387,0],0] Daemon launched: True
Num slots: 16 Slots in use: 16 Oversubscribed: FALSE
Num slots allocated: 16 Max slots: 0
Username on node: NULL
Num procs: 16 Next node_rank: 16
Data for proc: [[52387,1],0]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 0
Data for proc: [[52387,1],1]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 16
Data for proc: [[52387,1],2]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 2
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 32
Data for proc: [[52387,1],3]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 3
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 1
Data for proc: [[52387,1],4]
Pid: 0 Local rank: 4 Node rank: 4 App rank: 4
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 17
Data for proc: [[52387,1],5]
Pid: 0 Local rank: 5 Node rank: 5 App rank: 5
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 33
Data for proc: [[52387,1],6]
Pid: 0 Local rank: 6 Node rank: 6 App rank: 6
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 2
Data for proc: [[52387,1],7]
Pid: 0 Local rank: 7 Node rank: 7 App rank: 7
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 18
Data for proc: [[52387,1],8]
Pid: 0 Local rank: 8 Node rank: 8 App rank: 8
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 34
Data for proc: [[52387,1],9]
Pid: 0 Local rank: 9 Node rank: 9 App rank: 9
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 3
Data for proc: [[52387,1],10]
Pid: 0 Local rank: 10 Node rank: 10 App rank: 10
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 19
Data for proc: [[52387,1],11]
Pid: 0 Local rank: 11 Node rank: 11 App rank: 11
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 35
Data for proc: [[52387,1],12]
Pid: 0 Local rank: 12 Node rank: 12 App rank: 12
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 4
Data for proc: [[52387,1],13]
Pid: 0 Local rank: 13 Node rank: 13 App rank: 13
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 20
Data for proc: [[52387,1],14]
Pid: 0 Local rank: 14 Node rank: 14 App rank: 14
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 36
Data for proc: [[52387,1],15]
Pid: 0 Local rank: 15 Node rank: 15 App rank: 15
State: INITIALIZED Restarts: 0 App_context: 0 Locale: UNKNOWN Bind location: 
(null) Binding: 5

And a numa-map of the node shows:

 PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     N4     
N5     N6     N7 ]
31044 my_executable         0    443.3M [ 443.3M     0      0      0      0     
 0      0      0  ]
31045 my_executable        16    459.7M [ 459.7M     0      0      0      0     
 0      0      0  ]
31046 my_executable        32    435.0M [     0  435.0M     0      0      0     
 0      0      0  ]
31047 my_executable         1    468.8M [     0      0  468.8M     0      0     
 0      0      0  ]
31048 my_executable        17    493.2M [     0      0  493.2M     0      0     
 0      0      0  ]
31049 my_executable        33    498.0M [     0      0      0  498.0M     0     
 0      0      0  ]
31050 my_executable         2    501.2M [     0      0      0      0  501.2M    
 0      0      0  ]
31051 my_executable        18    502.4M [     0      0      0      0  502.4M    
 0      0      0  ]
31052 my_executable        34    500.5M [     0      0      0      0      0  
500.5M     0      0  ]
31053 my_executable         3    515.6M [     0      0      0      0      0     
 0  515.6M     0  ]
31054 my_executable        19    508.1M [     0      0      0      0      0     
 0  508.1M     0  ]
31055 my_executable        35    503.9M [     0      0      0      0      0     
 0      0  503.9M ]
31056 my_executable         4    502.1M [ 502.1M     0      0      0      0     
 0      0      0  ]
31057 my_executable        20    515.2M [ 515.2M     0      0      0      0     
 0      0      0  ]
31058 my_executable        36    508.1M [     0  508.1M     0      0      0     
 0      0      0  ]
31059 my_executable         5    446.7M [     0      0  446.7M     0      0     
 0      0      0  ]
--

Why didn't mpirun honor the ranfile and put the processes on the correct cores 
in
the proper order?  It looks to me like mpirun doesn't like the rankfile...??

Thanks for any help.

Tom
_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/11/16199.php

Reply via email to