Re: [OMPI devel] mpirun does not honor rankfile

Ralph Castain Mon, 10 Nov 2014 14:09:29 -0500 (EST)

Hmmm….and those are, of course, intended to be physical core numbers. I wonder 
how they are numbering them? The OS index won’t be unique, which is what is 
causing us trouble, so they must have some way of translating them to provide a 
unique number.



> On Nov 10, 2014, at 10:42 AM, Tom Wurgler <twu...@goodyear.com> wrote:
> 
> LSF gives this, for example, over which we (LSF users) have no control.
> 
> rank 0=mach1 slot=0
> rank 1=mach1 slot=4
> rank 2=mach1 slot=8
> rank 3=mach1 slot=12
> rank 4=mach1 slot=16
> rank 5=mach1 slot=20
> rank 6=mach1 slot=24
> rank 7=mach1 slot=28
> rank 8=mach1 slot=32
> rank 9=mach1 slot=36
> rank 10=mach1 slot=40
> rank 11=mach1 slot=44
> rank 12=mach1 slot=1
> rank 13=mach1 slot=5
> rank 14=mach1 slot=9
> rank 15=mach1 slot=13
> 
> I have also filed a service ticket with LSF to see if they can change to 
> logical numbering etc.
> 
> In the meantime we have written a translator, but this is cluster, actually 
> node, specific and should not be called a solution.  Running lstopo on the 
> whole cluster found 2 nodes giving logical numbering and the rest giving 
> physical.
> Which is interesting in itself.  We find those 2 nodes having a newer bios 
> level. Still investigating this...
> 
> thanks
> tom
> 
> Tom Wurgler
> Application Systems Principal
> The Goodyear Tire & Rubber Company
> 200 Innovation Way, Akron, OH  44316
> phone.330.796.1656
> twu...@goodyear.com <mailto:twu...@goodyear.com>
> 
> 
> 
> 
> 
> From: devel <devel-boun...@open-mpi.org> on behalf of Ralph Castain 
> <r...@open-mpi.org>
> Sent: Monday, November 10, 2014 1:16 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] mpirun does not honor rankfile
>  
> I’ve been taking a look at this, and I believe I can get something 
> implemented shortly. However, one problem I’ve encountered is that physical 
> core indexes are NOT unique in many systems, e.g., x86 when hyperthreads are 
> enabled. So you would have to specify socket:core in order to get a unique 
> location. Alternatively, when hyperthreads are enabled, the physical 
> hyperthread number is unique.
> 
> My question, therefore, is whether or not this is going to work for you? I 
> don’t know what LSF is giving you - can you provide a socket:core, or a 
> physical hyperthread number?
> 
> 
>> On Nov 6, 2014, at 8:34 AM, Ralph Castain <rhc.open...@gmail.com 
>> <mailto:rhc.open...@gmail.com>> wrote:
>> 
>> IIRC, you prefix the core number with a P to indicate physical
>> 
>> I’ll see what I can do about getting the physical notation re-implemented - 
>> just can’t promise when that will happen
>> 
>> 
>>> On Nov 6, 2014, at 8:30 AM, Tom Wurgler <twu...@goodyear.com 
>>> <mailto:twu...@goodyear.com>> wrote:
>>> 
>>> Well, unless we can get LSF to use physical numbering, we are dead in the 
>>> water without a translator of some sort.
>>> 
>>> We are trying to figure how we can automate the translation in the 
>>> meantime, but we have a mix of clusters and the mapping is different 
>>> between them.
>>> 
>>> We daily use 1.6.4 openmpi (vs all this current testing has been with 
>>> 1.8.3).  In reading the 1.8.1 man page for mpirun, it states that
>>> 
>>> "Starting with Open MPI v1.7, all socket/core slot locations are be 
>>> specified as logical indexes (the Open MPI v1.6 series used physical 
>>> indexes)."
>>> 
>>> But testing using rankfiles with 1.6.4, it behaves like 1.8.3, ie using 
>>> logical indexes.  Is there maybe a switch in 1.6.4 to use physical indexes? 
>>>  I am not seeing it in the mpirun --help...
>>> thanks
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> From: devel <devel-boun...@open-mpi.org 
>>> <mailto:devel-boun...@open-mpi.org>> on behalf of Ralph Castain 
>>> <rhc.open...@gmail.com <mailto:rhc.open...@gmail.com>>
>>> Sent: Thursday, November 6, 2014 11:08 AM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] mpirun does not honor rankfile
>>>  
>>> Ugh….we used to have a switch for that purpose, but it became hard to 
>>> manage the code. I could reimplement at some point, but it won’t be in the 
>>> immediate future.
>>> 
>>> I gather the issue is that the system tools report physical numbering, and 
>>> so you have to mentally translate to create the rankfile? Or is there an 
>>> automated script you run to do the translation?
>>> 
>>> In other words, is it possible to simplify the translation in the interim? 
>>> Or is this a show-stopper for you?
>>> 
>>> 
>>>> On Nov 6, 2014, at 7:21 AM, Tom Wurgler <twu...@goodyear.com 
>>>> <mailto:twu...@goodyear.com>> wrote:
>>>> 
>>>> So we used lstopo with a arg of "--logical" and the output showed the core 
>>>> numbering 0,1,2,3...47 instead of
>>>> 0,4,8,12 etc.
>>>> 
>>>> The multiplying by 4 you speak of falls apart when you get to the second 
>>>> socket as its physical numbers are
>>>> 1,5,9,13... and its logical is 12,13,14,15....
>>>> 
>>>> So the question is can we get mpirun to honor the physical numbering?
>>>> 
>>>> thanks!
>>>> tom
>>>>  
>>>> From: devel <devel-boun...@open-mpi.org 
>>>> <mailto:devel-boun...@open-mpi.org>> on behalf of Ralph Castain 
>>>> <rhc.open...@gmail.com <mailto:rhc.open...@gmail.com>>
>>>> Sent: Wednesday, November 5, 2014 6:30 PM
>>>> To: Open MPI Developers
>>>> Subject: Re: [OMPI devel] mpirun does not honor rankfile
>>>>  
>>>> I suspect the issue may be with physical vs logical numbering. As I said, 
>>>> we use logical numbering in the rankfile, not physical. So I’m not 
>>>> entirely sure how to translate the cpumask in your final table into the 
>>>> numbering shown in your rankfile listings. Is the cpumask showing a 
>>>> physical core number?
>>>> 
>>>> I ask because it sure looks like the logical numbering we use is getting 
>>>> multiplied by 4 to become the cpumask you show. If they logically number 
>>>> their cores by socket (i.e., core 0 is first core in first socket, core 1 
>>>> is first core in second socket, etc.), then that would explain the output.
>>>> 
>>>> 
>>>>> On Nov 5, 2014, at 2:23 PM, Tom Wurgler <twu...@goodyear.com 
>>>>> <mailto:twu...@goodyear.com>> wrote:
>>>>> 
>>>>> Well, further investigation found this:
>>>>> 
>>>>> If I edit the rank file and change it like this:
>>>>> 
>>>>> before:
>>>>> rank 0=mach1 slot=0
>>>>> rank 1=mach1 slot=4
>>>>> rank 2=mach1 slot=8
>>>>> rank 3=mach1 slot=12
>>>>> rank 4=mach1 slot=16
>>>>> rank 5=mach1 slot=20
>>>>> rank 6=mach1 slot=24
>>>>> rank 7=mach1 slot=28
>>>>> rank 8=mach1 slot=32
>>>>> rank 9=mach1 slot=36
>>>>> rank 10=mach1 slot=40
>>>>> rank 11=mach1 slot=44
>>>>> rank 12=mach1 slot=1
>>>>> rank 13=mach1 slot=5
>>>>> rank 14=mach1 slot=9
>>>>> rank 15=mach1 slot=13
>>>>> 
>>>>> after:
>>>>> rank 0=mach1 slot=0
>>>>> rank 1=mach1 slot=1
>>>>> rank 2=mach1 slot=2
>>>>> rank 3=mach1 slot=3
>>>>> rank 4=mach1 slot=4
>>>>> rank 5=mach1 slot=5
>>>>> rank 6=mach1 slot=6
>>>>> rank 7=mach1 slot=7
>>>>> rank 8=mach1 slot=8
>>>>> rank 9=mach1 slot=9
>>>>> rank 10=mach1 slot=10
>>>>> rank 11=mach1 slot=11
>>>>> rank 12=mach1 slot=12
>>>>> rank 13=mach1 slot=13
>>>>> rank 14=mach1 slot=14
>>>>> rank 15=mach1 slot=15
>>>>> 
>>>>> It does what I expect:
>>>>>   PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     
>>>>> N4     N5     N6     N7 ]
>>>>> 12192 my_executable         0                   472.0M [ 472.0M     0     
>>>>>  0      0      0      0      0      0  ]
>>>>> 12193 my_executable         4                   358.0M [ 358.0M     0     
>>>>>  0      0      0      0      0      0  ]
>>>>> 12194 my_executable         8                   450.4M [ 450.4M     0     
>>>>>  0      0      0      0      0      0  ]
>>>>> 12195 my_executable        12                  439.1M [ 439.1M     0      
>>>>> 0      0      0      0      0      0  ]
>>>>> 12196 my_executable        16                  392.1M [ 392.1M     0      
>>>>> 0      0      0      0      0      0  ]
>>>>> 12197 my_executable        20                  420.6M [ 420.6M     0      
>>>>> 0      0      0      0      0      0  ]
>>>>> 12198 my_executable        24                  414.9M [     0  414.9M     
>>>>> 0      0      0      0      0      0  ]
>>>>> 12199 my_executable        28                  388.9M [     0  388.9M     
>>>>> 0      0      0      0      0      0  ]
>>>>> 12200 my_executable        32                  452.7M [     0  452.7M     
>>>>> 0      0      0      0      0      0  ]
>>>>> 12201 my_executable        36                  438.9M [     0  438.9M     
>>>>> 0      0      0      0      0      0  ]
>>>>> 12202 my_executable        40                  369.3M [     0  369.3M     
>>>>> 0      0      0      0      0      0  ]
>>>>> 12203 my_executable        44                  440.5M [     0  440.5M     
>>>>> 0      0      0      0      0      0  ]
>>>>> 12204 my_executable         1                   447.7M [     0      0  
>>>>> 447.7M     0      0      0      0      0  ]
>>>>> 12205 my_executable         5                   367.1M [     0      0  
>>>>> 367.1M     0      0      0      0      0  ]
>>>>> 12206 my_executable         9                   426.5M [     0      0  
>>>>> 426.5M     0      0      0      0      0  ]
>>>>> 12207 my_executable        13                  414.2M [     0      0  
>>>>> 414.2M     0      0      0      0      0  ]
>>>>> 
>>>>> We use hwloc 1.4 to generate a layout of the cores etc.
>>>>> 
>>>>> So either LSF created the wrong rankfile (via my config errors, most 
>>>>> likely) or mpirun can't deal with that rankfile.
>>>>> 
>>>>> I can try the nightly tarball as well.  The hardware is 48 core AMD:  4 
>>>>> sockets, 2 Numa nodes per socket with 6 cores each.
>>>>> 
>>>>> thanks
>>>>> tom 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> From: devel <devel-boun...@open-mpi.org 
>>>>> <mailto:devel-boun...@open-mpi.org>> on behalf of Ralph Castain 
>>>>> <rhc.open...@gmail.com <mailto:rhc.open...@gmail.com>>
>>>>> Sent: Wednesday, November 5, 2014 4:27 PM
>>>>> To: Open MPI Developers
>>>>> Subject: Re: [OMPI devel] mpirun does not honor rankfile
>>>>>  
>>>>> Hmmm…well, it seems to be working fine in 1.8.4rc1 (I only have 12 cores 
>>>>> on my humble machine). However, I can’t test any interactions with LSF, 
>>>>> though that shouldn’t be an issue:
>>>>> 
>>>>> $ mpirun -host bend001 -rf ./rankfile --report-bindings 
>>>>> --display-devel-map hostname
>>>>>  Data for JOB [60677,1] offset 0
>>>>> 
>>>>>  Mapper requested: NULL  Last mapper: rank_file  Mapping policy: BYUSER  
>>>>> Ranking policy: SLOT
>>>>>  Binding policy: CPUSET  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
>>>>>   Num new daemons: 0
>>>>> New daemon starting vpid INVALID
>>>>>   Num nodes: 1
>>>>> 
>>>>>  Data for node: bend001     Launch id: -1
>>>>> State: 2
>>>>>   Daemon: [[60677,0],0]
>>>>> Daemon launched: True
>>>>>   Num slots: 12
>>>>> Slots in use: 12     Oversubscribed: FALSE
>>>>>   Num slots allocated: 12
>>>>> Max slots: 0
>>>>>   Username on node: NULL
>>>>>   Num procs: 12
>>>>> Next node_rank: 12
>>>>>   Data for proc: [[60677,1],0]
>>>>>   Pid: 0
>>>>> Local rank: 0     Node rank: 0
>>>>> App rank: 0
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 0,12
>>>>>   Data for proc: [[60677,1],1]
>>>>>   Pid: 0
>>>>> Local rank: 1     Node rank: 1
>>>>> App rank: 1
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 8,20
>>>>>   Data for proc: [[60677,1],2]
>>>>>   Pid: 0
>>>>> Local rank: 2     Node rank: 2
>>>>> App rank: 2
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 5,17
>>>>>   Data for proc: [[60677,1],3]
>>>>>   Pid: 0
>>>>> Local rank: 3     Node rank: 3
>>>>> App rank: 3
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 9,21
>>>>>   Data for proc: [[60677,1],4]
>>>>>   Pid: 0
>>>>> Local rank: 4     Node rank: 4
>>>>> App rank: 4
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 11,23
>>>>>   Data for proc: [[60677,1],5]
>>>>>   Pid: 0
>>>>> Local rank: 5     Node rank: 5
>>>>> App rank: 5
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 7,19
>>>>>   Data for proc: [[60677,1],6]
>>>>>   Pid: 0
>>>>> Local rank: 6     Node rank: 6
>>>>> App rank: 6
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 3,15
>>>>>   Data for proc: [[60677,1],7]
>>>>>   Pid: 0
>>>>> Local rank: 7     Node rank: 7
>>>>> App rank: 7
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 6,18
>>>>>   Data for proc: [[60677,1],8]
>>>>>   Pid: 0
>>>>> Local rank: 8     Node rank: 8
>>>>> App rank: 8
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 2,14
>>>>>   Data for proc: [[60677,1],9]
>>>>>   Pid: 0
>>>>> Local rank: 9     Node rank: 9
>>>>> App rank: 9
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 4,16
>>>>>   Data for proc: [[60677,1],10]
>>>>>   Pid: 0
>>>>> Local rank: 10     Node rank: 10
>>>>> App rank: 10
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 10,22
>>>>>   Data for proc: [[60677,1],11]
>>>>>   Pid: 0
>>>>> Local rank: 11     Node rank: 11
>>>>> App rank: 11
>>>>>   State: INITIALIZED
>>>>> Restarts: 0 App_context: 0
>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>  Binding: 1,13
>>>>> [bend001:24667] MCW rank 1 bound to socket 0[core 4[hwt 0-1]]: 
>>>>> [../../../../BB/..][../../../../../..]
>>>>> [bend001:24667] MCW rank 2 bound to socket 1[core 8[hwt 0-1]]: 
>>>>> [../../../../../..][../../BB/../../..]
>>>>> [bend001:24667] MCW rank 3 bound to socket 1[core 10[hwt 0-1]]: 
>>>>> [../../../../../..][../../../../BB/..]
>>>>> [bend001:24667] MCW rank 4 bound to socket 1[core 11[hwt 0-1]]: 
>>>>> [../../../../../..][../../../../../BB]
>>>>> [bend001:24667] MCW rank 5 bound to socket 1[core 9[hwt 0-1]]: 
>>>>> [../../../../../..][../../../BB/../..]
>>>>> [bend001:24667] MCW rank 6 bound to socket 1[core 7[hwt 0-1]]: 
>>>>> [../../../../../..][../BB/../../../..]
>>>>> [bend001:24667] MCW rank 7 bound to socket 0[core 3[hwt 0-1]]: 
>>>>> [../../../BB/../..][../../../../../..]
>>>>> [bend001:24667] MCW rank 8 bound to socket 0[core 1[hwt 0-1]]: 
>>>>> [../BB/../../../..][../../../../../..]
>>>>> [bend001:24667] MCW rank 9 bound to socket 0[core 2[hwt 0-1]]: 
>>>>> [../../BB/../../..][../../../../../..]
>>>>> [bend001:24667] MCW rank 10 bound to socket 0[core 5[hwt 0-1]]: 
>>>>> [../../../../../BB][../../../../../..]
>>>>> [bend001:24667] MCW rank 11 bound to socket 1[core 6[hwt 0-1]]: 
>>>>> [../../../../../..][BB/../../../../..]
>>>>> [bend001:24667] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: 
>>>>> [BB/../../../../..][../../../../../..]
>>>>> 
>>>>> Can you try with the latest nightly 1.8 tarball?
>>>>> 
>>>>> http://www.open-mpi.org/nightly/v1.8/ 
>>>>> <http://www.open-mpi.org/nightly/v1.8/>
>>>>> 
>>>>> Note that it is also possible that hwloc isn’t correctly identifying the 
>>>>> cores here. Can you tell us something about the hardware? Do you have 
>>>>> hardware threads enabled?
>>>>> 
>>>>> I ask because the binding being reported by us is the cpu numbers as 
>>>>> identified by hwloc - which may not be the same you are expecting from 
>>>>> some hardware vendor’s map. We are using logical processor assignments, 
>>>>> not physical. You can use the —report-bindings option to show the 
>>>>> resulting map, as above.
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Nov 5, 2014, at 7:21 AM, twu...@goodyear.com 
>>>>>> <mailto:twu...@goodyear.com> wrote:
>>>>>> 
>>>>>> I am using openmpi v 1.8.3 and LSF 9.1.3.
>>>>>> 
>>>>>> LSF creates a rankfile that looks like:
>>>>>> 
>>>>>> RANK_FILE:
>>>>>> ======================================================================
>>>>>> rank 0=mach1 slot=0
>>>>>> rank 1=mach1 slot=4
>>>>>> rank 2=mach1 slot=8
>>>>>> rank 3=mach1 slot=12
>>>>>> rank 4=mach1 slot=16
>>>>>> rank 5=mach1 slot=20
>>>>>> rank 6=mach1 slot=24
>>>>>> rank 7=mach1 slot=28
>>>>>> rank 8=mach1 slot=32
>>>>>> rank 9=mach1 slot=36
>>>>>> rank 10=mach1 slot=40
>>>>>> rank 11=mach1 slot=44
>>>>>> rank 12=mach1 slot=1
>>>>>> rank 13=mach1 slot=5
>>>>>> rank 14=mach1 slot=9
>>>>>> rank 15=mach1 slot=13
>>>>>> 
>>>>>> which really are the cores I want to use, in order. 
>>>>>> 
>>>>>> I logon to this machine and type (all on one line):
>>>>>> 
>>>>>> /apps/share/openmpi/1.8.3.I1217913/bin/mpirun \
>>>>>>  --mca orte_base_help_aggregate 0 \
>>>>>>  -v -display-devel-allocation \
>>>>>>  -display-devel-map \
>>>>>>  --rankfile RANK_FILE \
>>>>>>  --mca btl openib,tcp,sm,self \
>>>>>>  --x LD_LIBRARY_PATH \
>>>>>>  --np 16 \
>>>>>>  my_executable \
>>>>>>  -i model.i \
>>>>>>  -l model.o
>>>>>> 
>>>>>> And I get the following on the screen:
>>>>>> 
>>>>>> ======================   ALLOCATED NODES   ======================
>>>>>> mach1: slots=16 max_slots=0 slots_inuse=0 state=UP
>>>>>> =================================================================
>>>>>> Data for JOB [52387,1] offset 0
>>>>>> 
>>>>>> Mapper requested: NULL  Last mapper: rank_file  Mapping policy: BYUSER  
>>>>>> Ranking policy: SLOT
>>>>>> Binding policy: CPUSET  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
>>>>>> Num new daemons: 0
>>>>>> New daemon starting vpid INVALID
>>>>>> Num nodes: 1
>>>>>> 
>>>>>> Data for node: mach1     Launch id: -1
>>>>>> State: 2
>>>>>> Daemon: [[52387,0],0]
>>>>>> Daemon launched: True
>>>>>> Num slots: 16
>>>>>> Slots in use: 16     Oversubscribed: FALSE
>>>>>> Num slots allocated: 16
>>>>>> Max slots: 0
>>>>>> Username on node: NULL
>>>>>> Num procs: 16
>>>>>> Next node_rank: 16
>>>>>> Data for proc: [[52387,1],0]
>>>>>> Pid: 0
>>>>>> Local rank: 0     Node rank: 0
>>>>>> App rank: 0
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 0
>>>>>> Data for proc: [[52387,1],1]
>>>>>> Pid: 0
>>>>>> Local rank: 1     Node rank: 1
>>>>>> App rank: 1
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 16
>>>>>> Data for proc: [[52387,1],2]
>>>>>> Pid: 0
>>>>>> Local rank: 2     Node rank: 2
>>>>>> App rank: 2
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 32
>>>>>> Data for proc: [[52387,1],3]
>>>>>> Pid: 0
>>>>>> Local rank: 3     Node rank: 3
>>>>>> App rank: 3
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 1
>>>>>> Data for proc: [[52387,1],4]
>>>>>> Pid: 0
>>>>>> Local rank: 4     Node rank: 4
>>>>>> App rank: 4
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 17
>>>>>> Data for proc: [[52387,1],5]
>>>>>> Pid: 0
>>>>>> Local rank: 5     Node rank: 5
>>>>>> App rank: 5
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 33
>>>>>> Data for proc: [[52387,1],6]
>>>>>> Pid: 0
>>>>>> Local rank: 6     Node rank: 6
>>>>>> App rank: 6
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 2
>>>>>> Data for proc: [[52387,1],7]
>>>>>> Pid: 0
>>>>>> Local rank: 7     Node rank: 7
>>>>>> App rank: 7
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 18
>>>>>> Data for proc: [[52387,1],8]
>>>>>> Pid: 0
>>>>>> Local rank: 8     Node rank: 8
>>>>>> App rank: 8
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 34
>>>>>> Data for proc: [[52387,1],9]
>>>>>> Pid: 0
>>>>>> Local rank: 9     Node rank: 9
>>>>>> App rank: 9
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 3
>>>>>> Data for proc: [[52387,1],10]
>>>>>> Pid: 0
>>>>>> Local rank: 10     Node rank: 10
>>>>>> App rank: 10
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 19
>>>>>> Data for proc: [[52387,1],11]
>>>>>> Pid: 0
>>>>>> Local rank: 11     Node rank: 11
>>>>>> App rank: 11
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 35
>>>>>> Data for proc: [[52387,1],12]
>>>>>> Pid: 0
>>>>>> Local rank: 12     Node rank: 12
>>>>>> App rank: 12
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 4
>>>>>> Data for proc: [[52387,1],13]
>>>>>> Pid: 0
>>>>>> Local rank: 13     Node rank: 13
>>>>>> App rank: 13
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 20
>>>>>> Data for proc: [[52387,1],14]
>>>>>> Pid: 0
>>>>>> Local rank: 14     Node rank: 14
>>>>>> App rank: 14
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 36
>>>>>> Data for proc: [[52387,1],15]
>>>>>> Pid: 0
>>>>>> Local rank: 15     Node rank: 15
>>>>>> App rank: 15
>>>>>> State: INITIALIZED
>>>>>> Restarts: 0 App_context: 0
>>>>>> Locale: UNKNOWN     Bind location: (null)
>>>>>>  Binding: 5
>>>>>> 
>>>>>> And a numa-map of the node shows:
>>>>>> 
>>>>>>  PID COMMAND         CPUMASK     TOTAL [     N0     N1     N2     N3     
>>>>>> N4     N5     N6     N7 ]
>>>>>> 31044 my_executable         0    443.3M [ 443.3M     0      0      0     
>>>>>>  0      0      0      0  ]
>>>>>> 31045 my_executable        16    459.7M [ 459.7M     0      0      0     
>>>>>>  0      0      0      0  ]
>>>>>> 31046 my_executable        32    435.0M [     0  435.0M     0      0     
>>>>>>  0      0      0      0  ]
>>>>>> 31047 my_executable         1    468.8M [     0      0  468.8M     0     
>>>>>>  0      0      0      0  ]
>>>>>> 31048 my_executable        17    493.2M [     0      0  493.2M     0     
>>>>>>  0      0      0      0  ]
>>>>>> 31049 my_executable        33    498.0M [     0      0      0  498.0M    
>>>>>>  0      0      0      0  ]
>>>>>> 31050 my_executable         2    501.2M [     0      0      0      0  
>>>>>> 501.2M     0      0      0  ]
>>>>>> 31051 my_executable        18    502.4M [     0      0      0      0  
>>>>>> 502.4M     0      0      0  ]
>>>>>> 31052 my_executable        34    500.5M [     0      0      0      0     
>>>>>>  0  500.5M     0      0  ]
>>>>>> 31053 my_executable         3    515.6M [     0      0      0      0     
>>>>>>  0      0  515.6M     0  ]
>>>>>> 31054 my_executable        19    508.1M [     0      0      0      0     
>>>>>>  0      0  508.1M     0  ]
>>>>>> 31055 my_executable        35    503.9M [     0      0      0      0     
>>>>>>  0      0      0  503.9M ]
>>>>>> 31056 my_executable         4    502.1M [ 502.1M     0      0      0     
>>>>>>  0      0      0      0  ]
>>>>>> 31057 my_executable        20    515.2M [ 515.2M     0      0      0     
>>>>>>  0      0      0      0  ]
>>>>>> 31058 my_executable        36    508.1M [     0  508.1M     0      0     
>>>>>>  0      0      0      0  ]
>>>>>> 31059 my_executable         5    446.7M [     0      0  446.7M     0     
>>>>>>  0      0      0      0  ]
>>>>>> -- 
>>>>>> 
>>>>>> Why didn't mpirun honor the ranfile and put the processes on the correct 
>>>>>> cores in
>>>>>> the proper order?  It looks to me like mpirun doesn't like the 
>>>>>> rankfile...??
>>>>>> 
>>>>>> Thanks for any help.
>>>>>> 
>>>>>> Tom
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16199.php 
>>>>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16199.php>
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16221.php 
>>>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16221.php>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16229.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16229.php>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/11/16233.php 
>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16233.php>
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/11/16272.php 
> <http://www.open-mpi.org/community/lists/devel/2014/11/16272.php>

Re: [OMPI devel] mpirun does not honor rankfile

Reply via email to