Hmmm….and those are, of course, intended to be physical core numbers. I wonder how they are numbering them? The OS index won’t be unique, which is what is causing us trouble, so they must have some way of translating them to provide a unique number.
> On Nov 10, 2014, at 10:42 AM, Tom Wurgler <twu...@goodyear.com> wrote: > > LSF gives this, for example, over which we (LSF users) have no control. > > rank 0=mach1 slot=0 > rank 1=mach1 slot=4 > rank 2=mach1 slot=8 > rank 3=mach1 slot=12 > rank 4=mach1 slot=16 > rank 5=mach1 slot=20 > rank 6=mach1 slot=24 > rank 7=mach1 slot=28 > rank 8=mach1 slot=32 > rank 9=mach1 slot=36 > rank 10=mach1 slot=40 > rank 11=mach1 slot=44 > rank 12=mach1 slot=1 > rank 13=mach1 slot=5 > rank 14=mach1 slot=9 > rank 15=mach1 slot=13 > > I have also filed a service ticket with LSF to see if they can change to > logical numbering etc. > > In the meantime we have written a translator, but this is cluster, actually > node, specific and should not be called a solution. Running lstopo on the > whole cluster found 2 nodes giving logical numbering and the rest giving > physical. > Which is interesting in itself. We find those 2 nodes having a newer bios > level. Still investigating this... > > thanks > tom > > Tom Wurgler > Application Systems Principal > The Goodyear Tire & Rubber Company > 200 Innovation Way, Akron, OH 44316 > phone.330.796.1656 > twu...@goodyear.com <mailto:twu...@goodyear.com> > > > > > > From: devel <devel-boun...@open-mpi.org> on behalf of Ralph Castain > <r...@open-mpi.org> > Sent: Monday, November 10, 2014 1:16 PM > To: Open MPI Developers > Subject: Re: [OMPI devel] mpirun does not honor rankfile > > I’ve been taking a look at this, and I believe I can get something > implemented shortly. However, one problem I’ve encountered is that physical > core indexes are NOT unique in many systems, e.g., x86 when hyperthreads are > enabled. So you would have to specify socket:core in order to get a unique > location. Alternatively, when hyperthreads are enabled, the physical > hyperthread number is unique. > > My question, therefore, is whether or not this is going to work for you? I > don’t know what LSF is giving you - can you provide a socket:core, or a > physical hyperthread number? > > >> On Nov 6, 2014, at 8:34 AM, Ralph Castain <rhc.open...@gmail.com >> <mailto:rhc.open...@gmail.com>> wrote: >> >> IIRC, you prefix the core number with a P to indicate physical >> >> I’ll see what I can do about getting the physical notation re-implemented - >> just can’t promise when that will happen >> >> >>> On Nov 6, 2014, at 8:30 AM, Tom Wurgler <twu...@goodyear.com >>> <mailto:twu...@goodyear.com>> wrote: >>> >>> Well, unless we can get LSF to use physical numbering, we are dead in the >>> water without a translator of some sort. >>> >>> We are trying to figure how we can automate the translation in the >>> meantime, but we have a mix of clusters and the mapping is different >>> between them. >>> >>> We daily use 1.6.4 openmpi (vs all this current testing has been with >>> 1.8.3). In reading the 1.8.1 man page for mpirun, it states that >>> >>> "Starting with Open MPI v1.7, all socket/core slot locations are be >>> specified as logical indexes (the Open MPI v1.6 series used physical >>> indexes)." >>> >>> But testing using rankfiles with 1.6.4, it behaves like 1.8.3, ie using >>> logical indexes. Is there maybe a switch in 1.6.4 to use physical indexes? >>> I am not seeing it in the mpirun --help... >>> thanks >>> >>> >>> >>> >>> >>> >>> >>> From: devel <devel-boun...@open-mpi.org >>> <mailto:devel-boun...@open-mpi.org>> on behalf of Ralph Castain >>> <rhc.open...@gmail.com <mailto:rhc.open...@gmail.com>> >>> Sent: Thursday, November 6, 2014 11:08 AM >>> To: Open MPI Developers >>> Subject: Re: [OMPI devel] mpirun does not honor rankfile >>> >>> Ugh….we used to have a switch for that purpose, but it became hard to >>> manage the code. I could reimplement at some point, but it won’t be in the >>> immediate future. >>> >>> I gather the issue is that the system tools report physical numbering, and >>> so you have to mentally translate to create the rankfile? Or is there an >>> automated script you run to do the translation? >>> >>> In other words, is it possible to simplify the translation in the interim? >>> Or is this a show-stopper for you? >>> >>> >>>> On Nov 6, 2014, at 7:21 AM, Tom Wurgler <twu...@goodyear.com >>>> <mailto:twu...@goodyear.com>> wrote: >>>> >>>> So we used lstopo with a arg of "--logical" and the output showed the core >>>> numbering 0,1,2,3...47 instead of >>>> 0,4,8,12 etc. >>>> >>>> The multiplying by 4 you speak of falls apart when you get to the second >>>> socket as its physical numbers are >>>> 1,5,9,13... and its logical is 12,13,14,15.... >>>> >>>> So the question is can we get mpirun to honor the physical numbering? >>>> >>>> thanks! >>>> tom >>>> >>>> From: devel <devel-boun...@open-mpi.org >>>> <mailto:devel-boun...@open-mpi.org>> on behalf of Ralph Castain >>>> <rhc.open...@gmail.com <mailto:rhc.open...@gmail.com>> >>>> Sent: Wednesday, November 5, 2014 6:30 PM >>>> To: Open MPI Developers >>>> Subject: Re: [OMPI devel] mpirun does not honor rankfile >>>> >>>> I suspect the issue may be with physical vs logical numbering. As I said, >>>> we use logical numbering in the rankfile, not physical. So I’m not >>>> entirely sure how to translate the cpumask in your final table into the >>>> numbering shown in your rankfile listings. Is the cpumask showing a >>>> physical core number? >>>> >>>> I ask because it sure looks like the logical numbering we use is getting >>>> multiplied by 4 to become the cpumask you show. If they logically number >>>> their cores by socket (i.e., core 0 is first core in first socket, core 1 >>>> is first core in second socket, etc.), then that would explain the output. >>>> >>>> >>>>> On Nov 5, 2014, at 2:23 PM, Tom Wurgler <twu...@goodyear.com >>>>> <mailto:twu...@goodyear.com>> wrote: >>>>> >>>>> Well, further investigation found this: >>>>> >>>>> If I edit the rank file and change it like this: >>>>> >>>>> before: >>>>> rank 0=mach1 slot=0 >>>>> rank 1=mach1 slot=4 >>>>> rank 2=mach1 slot=8 >>>>> rank 3=mach1 slot=12 >>>>> rank 4=mach1 slot=16 >>>>> rank 5=mach1 slot=20 >>>>> rank 6=mach1 slot=24 >>>>> rank 7=mach1 slot=28 >>>>> rank 8=mach1 slot=32 >>>>> rank 9=mach1 slot=36 >>>>> rank 10=mach1 slot=40 >>>>> rank 11=mach1 slot=44 >>>>> rank 12=mach1 slot=1 >>>>> rank 13=mach1 slot=5 >>>>> rank 14=mach1 slot=9 >>>>> rank 15=mach1 slot=13 >>>>> >>>>> after: >>>>> rank 0=mach1 slot=0 >>>>> rank 1=mach1 slot=1 >>>>> rank 2=mach1 slot=2 >>>>> rank 3=mach1 slot=3 >>>>> rank 4=mach1 slot=4 >>>>> rank 5=mach1 slot=5 >>>>> rank 6=mach1 slot=6 >>>>> rank 7=mach1 slot=7 >>>>> rank 8=mach1 slot=8 >>>>> rank 9=mach1 slot=9 >>>>> rank 10=mach1 slot=10 >>>>> rank 11=mach1 slot=11 >>>>> rank 12=mach1 slot=12 >>>>> rank 13=mach1 slot=13 >>>>> rank 14=mach1 slot=14 >>>>> rank 15=mach1 slot=15 >>>>> >>>>> It does what I expect: >>>>> PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 >>>>> N4 N5 N6 N7 ] >>>>> 12192 my_executable 0 472.0M [ 472.0M 0 >>>>> 0 0 0 0 0 0 ] >>>>> 12193 my_executable 4 358.0M [ 358.0M 0 >>>>> 0 0 0 0 0 0 ] >>>>> 12194 my_executable 8 450.4M [ 450.4M 0 >>>>> 0 0 0 0 0 0 ] >>>>> 12195 my_executable 12 439.1M [ 439.1M 0 >>>>> 0 0 0 0 0 0 ] >>>>> 12196 my_executable 16 392.1M [ 392.1M 0 >>>>> 0 0 0 0 0 0 ] >>>>> 12197 my_executable 20 420.6M [ 420.6M 0 >>>>> 0 0 0 0 0 0 ] >>>>> 12198 my_executable 24 414.9M [ 0 414.9M >>>>> 0 0 0 0 0 0 ] >>>>> 12199 my_executable 28 388.9M [ 0 388.9M >>>>> 0 0 0 0 0 0 ] >>>>> 12200 my_executable 32 452.7M [ 0 452.7M >>>>> 0 0 0 0 0 0 ] >>>>> 12201 my_executable 36 438.9M [ 0 438.9M >>>>> 0 0 0 0 0 0 ] >>>>> 12202 my_executable 40 369.3M [ 0 369.3M >>>>> 0 0 0 0 0 0 ] >>>>> 12203 my_executable 44 440.5M [ 0 440.5M >>>>> 0 0 0 0 0 0 ] >>>>> 12204 my_executable 1 447.7M [ 0 0 >>>>> 447.7M 0 0 0 0 0 ] >>>>> 12205 my_executable 5 367.1M [ 0 0 >>>>> 367.1M 0 0 0 0 0 ] >>>>> 12206 my_executable 9 426.5M [ 0 0 >>>>> 426.5M 0 0 0 0 0 ] >>>>> 12207 my_executable 13 414.2M [ 0 0 >>>>> 414.2M 0 0 0 0 0 ] >>>>> >>>>> We use hwloc 1.4 to generate a layout of the cores etc. >>>>> >>>>> So either LSF created the wrong rankfile (via my config errors, most >>>>> likely) or mpirun can't deal with that rankfile. >>>>> >>>>> I can try the nightly tarball as well. The hardware is 48 core AMD: 4 >>>>> sockets, 2 Numa nodes per socket with 6 cores each. >>>>> >>>>> thanks >>>>> tom >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> From: devel <devel-boun...@open-mpi.org >>>>> <mailto:devel-boun...@open-mpi.org>> on behalf of Ralph Castain >>>>> <rhc.open...@gmail.com <mailto:rhc.open...@gmail.com>> >>>>> Sent: Wednesday, November 5, 2014 4:27 PM >>>>> To: Open MPI Developers >>>>> Subject: Re: [OMPI devel] mpirun does not honor rankfile >>>>> >>>>> Hmmm…well, it seems to be working fine in 1.8.4rc1 (I only have 12 cores >>>>> on my humble machine). However, I can’t test any interactions with LSF, >>>>> though that shouldn’t be an issue: >>>>> >>>>> $ mpirun -host bend001 -rf ./rankfile --report-bindings >>>>> --display-devel-map hostname >>>>> Data for JOB [60677,1] offset 0 >>>>> >>>>> Mapper requested: NULL Last mapper: rank_file Mapping policy: BYUSER >>>>> Ranking policy: SLOT >>>>> Binding policy: CPUSET Cpu set: NULL PPR: NULL Cpus-per-rank: 1 >>>>> Num new daemons: 0 >>>>> New daemon starting vpid INVALID >>>>> Num nodes: 1 >>>>> >>>>> Data for node: bend001 Launch id: -1 >>>>> State: 2 >>>>> Daemon: [[60677,0],0] >>>>> Daemon launched: True >>>>> Num slots: 12 >>>>> Slots in use: 12 Oversubscribed: FALSE >>>>> Num slots allocated: 12 >>>>> Max slots: 0 >>>>> Username on node: NULL >>>>> Num procs: 12 >>>>> Next node_rank: 12 >>>>> Data for proc: [[60677,1],0] >>>>> Pid: 0 >>>>> Local rank: 0 Node rank: 0 >>>>> App rank: 0 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 0,12 >>>>> Data for proc: [[60677,1],1] >>>>> Pid: 0 >>>>> Local rank: 1 Node rank: 1 >>>>> App rank: 1 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 8,20 >>>>> Data for proc: [[60677,1],2] >>>>> Pid: 0 >>>>> Local rank: 2 Node rank: 2 >>>>> App rank: 2 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 5,17 >>>>> Data for proc: [[60677,1],3] >>>>> Pid: 0 >>>>> Local rank: 3 Node rank: 3 >>>>> App rank: 3 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 9,21 >>>>> Data for proc: [[60677,1],4] >>>>> Pid: 0 >>>>> Local rank: 4 Node rank: 4 >>>>> App rank: 4 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 11,23 >>>>> Data for proc: [[60677,1],5] >>>>> Pid: 0 >>>>> Local rank: 5 Node rank: 5 >>>>> App rank: 5 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 7,19 >>>>> Data for proc: [[60677,1],6] >>>>> Pid: 0 >>>>> Local rank: 6 Node rank: 6 >>>>> App rank: 6 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 3,15 >>>>> Data for proc: [[60677,1],7] >>>>> Pid: 0 >>>>> Local rank: 7 Node rank: 7 >>>>> App rank: 7 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 6,18 >>>>> Data for proc: [[60677,1],8] >>>>> Pid: 0 >>>>> Local rank: 8 Node rank: 8 >>>>> App rank: 8 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 2,14 >>>>> Data for proc: [[60677,1],9] >>>>> Pid: 0 >>>>> Local rank: 9 Node rank: 9 >>>>> App rank: 9 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 4,16 >>>>> Data for proc: [[60677,1],10] >>>>> Pid: 0 >>>>> Local rank: 10 Node rank: 10 >>>>> App rank: 10 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 10,22 >>>>> Data for proc: [[60677,1],11] >>>>> Pid: 0 >>>>> Local rank: 11 Node rank: 11 >>>>> App rank: 11 >>>>> State: INITIALIZED >>>>> Restarts: 0 App_context: 0 >>>>> Locale: UNKNOWN Bind location: (null) >>>>> Binding: 1,13 >>>>> [bend001:24667] MCW rank 1 bound to socket 0[core 4[hwt 0-1]]: >>>>> [../../../../BB/..][../../../../../..] >>>>> [bend001:24667] MCW rank 2 bound to socket 1[core 8[hwt 0-1]]: >>>>> [../../../../../..][../../BB/../../..] >>>>> [bend001:24667] MCW rank 3 bound to socket 1[core 10[hwt 0-1]]: >>>>> [../../../../../..][../../../../BB/..] >>>>> [bend001:24667] MCW rank 4 bound to socket 1[core 11[hwt 0-1]]: >>>>> [../../../../../..][../../../../../BB] >>>>> [bend001:24667] MCW rank 5 bound to socket 1[core 9[hwt 0-1]]: >>>>> [../../../../../..][../../../BB/../..] >>>>> [bend001:24667] MCW rank 6 bound to socket 1[core 7[hwt 0-1]]: >>>>> [../../../../../..][../BB/../../../..] >>>>> [bend001:24667] MCW rank 7 bound to socket 0[core 3[hwt 0-1]]: >>>>> [../../../BB/../..][../../../../../..] >>>>> [bend001:24667] MCW rank 8 bound to socket 0[core 1[hwt 0-1]]: >>>>> [../BB/../../../..][../../../../../..] >>>>> [bend001:24667] MCW rank 9 bound to socket 0[core 2[hwt 0-1]]: >>>>> [../../BB/../../..][../../../../../..] >>>>> [bend001:24667] MCW rank 10 bound to socket 0[core 5[hwt 0-1]]: >>>>> [../../../../../BB][../../../../../..] >>>>> [bend001:24667] MCW rank 11 bound to socket 1[core 6[hwt 0-1]]: >>>>> [../../../../../..][BB/../../../../..] >>>>> [bend001:24667] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: >>>>> [BB/../../../../..][../../../../../..] >>>>> >>>>> Can you try with the latest nightly 1.8 tarball? >>>>> >>>>> http://www.open-mpi.org/nightly/v1.8/ >>>>> <http://www.open-mpi.org/nightly/v1.8/> >>>>> >>>>> Note that it is also possible that hwloc isn’t correctly identifying the >>>>> cores here. Can you tell us something about the hardware? Do you have >>>>> hardware threads enabled? >>>>> >>>>> I ask because the binding being reported by us is the cpu numbers as >>>>> identified by hwloc - which may not be the same you are expecting from >>>>> some hardware vendor’s map. We are using logical processor assignments, >>>>> not physical. You can use the —report-bindings option to show the >>>>> resulting map, as above. >>>>> >>>>> >>>>> >>>>>> On Nov 5, 2014, at 7:21 AM, twu...@goodyear.com >>>>>> <mailto:twu...@goodyear.com> wrote: >>>>>> >>>>>> I am using openmpi v 1.8.3 and LSF 9.1.3. >>>>>> >>>>>> LSF creates a rankfile that looks like: >>>>>> >>>>>> RANK_FILE: >>>>>> ====================================================================== >>>>>> rank 0=mach1 slot=0 >>>>>> rank 1=mach1 slot=4 >>>>>> rank 2=mach1 slot=8 >>>>>> rank 3=mach1 slot=12 >>>>>> rank 4=mach1 slot=16 >>>>>> rank 5=mach1 slot=20 >>>>>> rank 6=mach1 slot=24 >>>>>> rank 7=mach1 slot=28 >>>>>> rank 8=mach1 slot=32 >>>>>> rank 9=mach1 slot=36 >>>>>> rank 10=mach1 slot=40 >>>>>> rank 11=mach1 slot=44 >>>>>> rank 12=mach1 slot=1 >>>>>> rank 13=mach1 slot=5 >>>>>> rank 14=mach1 slot=9 >>>>>> rank 15=mach1 slot=13 >>>>>> >>>>>> which really are the cores I want to use, in order. >>>>>> >>>>>> I logon to this machine and type (all on one line): >>>>>> >>>>>> /apps/share/openmpi/1.8.3.I1217913/bin/mpirun \ >>>>>> --mca orte_base_help_aggregate 0 \ >>>>>> -v -display-devel-allocation \ >>>>>> -display-devel-map \ >>>>>> --rankfile RANK_FILE \ >>>>>> --mca btl openib,tcp,sm,self \ >>>>>> --x LD_LIBRARY_PATH \ >>>>>> --np 16 \ >>>>>> my_executable \ >>>>>> -i model.i \ >>>>>> -l model.o >>>>>> >>>>>> And I get the following on the screen: >>>>>> >>>>>> ====================== ALLOCATED NODES ====================== >>>>>> mach1: slots=16 max_slots=0 slots_inuse=0 state=UP >>>>>> ================================================================= >>>>>> Data for JOB [52387,1] offset 0 >>>>>> >>>>>> Mapper requested: NULL Last mapper: rank_file Mapping policy: BYUSER >>>>>> Ranking policy: SLOT >>>>>> Binding policy: CPUSET Cpu set: NULL PPR: NULL Cpus-per-rank: 1 >>>>>> Num new daemons: 0 >>>>>> New daemon starting vpid INVALID >>>>>> Num nodes: 1 >>>>>> >>>>>> Data for node: mach1 Launch id: -1 >>>>>> State: 2 >>>>>> Daemon: [[52387,0],0] >>>>>> Daemon launched: True >>>>>> Num slots: 16 >>>>>> Slots in use: 16 Oversubscribed: FALSE >>>>>> Num slots allocated: 16 >>>>>> Max slots: 0 >>>>>> Username on node: NULL >>>>>> Num procs: 16 >>>>>> Next node_rank: 16 >>>>>> Data for proc: [[52387,1],0] >>>>>> Pid: 0 >>>>>> Local rank: 0 Node rank: 0 >>>>>> App rank: 0 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 0 >>>>>> Data for proc: [[52387,1],1] >>>>>> Pid: 0 >>>>>> Local rank: 1 Node rank: 1 >>>>>> App rank: 1 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 16 >>>>>> Data for proc: [[52387,1],2] >>>>>> Pid: 0 >>>>>> Local rank: 2 Node rank: 2 >>>>>> App rank: 2 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 32 >>>>>> Data for proc: [[52387,1],3] >>>>>> Pid: 0 >>>>>> Local rank: 3 Node rank: 3 >>>>>> App rank: 3 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 1 >>>>>> Data for proc: [[52387,1],4] >>>>>> Pid: 0 >>>>>> Local rank: 4 Node rank: 4 >>>>>> App rank: 4 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 17 >>>>>> Data for proc: [[52387,1],5] >>>>>> Pid: 0 >>>>>> Local rank: 5 Node rank: 5 >>>>>> App rank: 5 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 33 >>>>>> Data for proc: [[52387,1],6] >>>>>> Pid: 0 >>>>>> Local rank: 6 Node rank: 6 >>>>>> App rank: 6 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 2 >>>>>> Data for proc: [[52387,1],7] >>>>>> Pid: 0 >>>>>> Local rank: 7 Node rank: 7 >>>>>> App rank: 7 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 18 >>>>>> Data for proc: [[52387,1],8] >>>>>> Pid: 0 >>>>>> Local rank: 8 Node rank: 8 >>>>>> App rank: 8 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 34 >>>>>> Data for proc: [[52387,1],9] >>>>>> Pid: 0 >>>>>> Local rank: 9 Node rank: 9 >>>>>> App rank: 9 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 3 >>>>>> Data for proc: [[52387,1],10] >>>>>> Pid: 0 >>>>>> Local rank: 10 Node rank: 10 >>>>>> App rank: 10 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 19 >>>>>> Data for proc: [[52387,1],11] >>>>>> Pid: 0 >>>>>> Local rank: 11 Node rank: 11 >>>>>> App rank: 11 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 35 >>>>>> Data for proc: [[52387,1],12] >>>>>> Pid: 0 >>>>>> Local rank: 12 Node rank: 12 >>>>>> App rank: 12 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 4 >>>>>> Data for proc: [[52387,1],13] >>>>>> Pid: 0 >>>>>> Local rank: 13 Node rank: 13 >>>>>> App rank: 13 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 20 >>>>>> Data for proc: [[52387,1],14] >>>>>> Pid: 0 >>>>>> Local rank: 14 Node rank: 14 >>>>>> App rank: 14 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 36 >>>>>> Data for proc: [[52387,1],15] >>>>>> Pid: 0 >>>>>> Local rank: 15 Node rank: 15 >>>>>> App rank: 15 >>>>>> State: INITIALIZED >>>>>> Restarts: 0 App_context: 0 >>>>>> Locale: UNKNOWN Bind location: (null) >>>>>> Binding: 5 >>>>>> >>>>>> And a numa-map of the node shows: >>>>>> >>>>>> PID COMMAND CPUMASK TOTAL [ N0 N1 N2 N3 >>>>>> N4 N5 N6 N7 ] >>>>>> 31044 my_executable 0 443.3M [ 443.3M 0 0 0 >>>>>> 0 0 0 0 ] >>>>>> 31045 my_executable 16 459.7M [ 459.7M 0 0 0 >>>>>> 0 0 0 0 ] >>>>>> 31046 my_executable 32 435.0M [ 0 435.0M 0 0 >>>>>> 0 0 0 0 ] >>>>>> 31047 my_executable 1 468.8M [ 0 0 468.8M 0 >>>>>> 0 0 0 0 ] >>>>>> 31048 my_executable 17 493.2M [ 0 0 493.2M 0 >>>>>> 0 0 0 0 ] >>>>>> 31049 my_executable 33 498.0M [ 0 0 0 498.0M >>>>>> 0 0 0 0 ] >>>>>> 31050 my_executable 2 501.2M [ 0 0 0 0 >>>>>> 501.2M 0 0 0 ] >>>>>> 31051 my_executable 18 502.4M [ 0 0 0 0 >>>>>> 502.4M 0 0 0 ] >>>>>> 31052 my_executable 34 500.5M [ 0 0 0 0 >>>>>> 0 500.5M 0 0 ] >>>>>> 31053 my_executable 3 515.6M [ 0 0 0 0 >>>>>> 0 0 515.6M 0 ] >>>>>> 31054 my_executable 19 508.1M [ 0 0 0 0 >>>>>> 0 0 508.1M 0 ] >>>>>> 31055 my_executable 35 503.9M [ 0 0 0 0 >>>>>> 0 0 0 503.9M ] >>>>>> 31056 my_executable 4 502.1M [ 502.1M 0 0 0 >>>>>> 0 0 0 0 ] >>>>>> 31057 my_executable 20 515.2M [ 515.2M 0 0 0 >>>>>> 0 0 0 0 ] >>>>>> 31058 my_executable 36 508.1M [ 0 508.1M 0 0 >>>>>> 0 0 0 0 ] >>>>>> 31059 my_executable 5 446.7M [ 0 0 446.7M 0 >>>>>> 0 0 0 0 ] >>>>>> -- >>>>>> >>>>>> Why didn't mpirun honor the ranfile and put the processes on the correct >>>>>> cores in >>>>>> the proper order? It looks to me like mpirun doesn't like the >>>>>> rankfile...?? >>>>>> >>>>>> Thanks for any help. >>>>>> >>>>>> Tom >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16199.php >>>>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16199.php> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/11/16221.php >>>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16221.php> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/11/16229.php >>>> <http://www.open-mpi.org/community/lists/devel/2014/11/16229.php> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/11/16233.php >>> <http://www.open-mpi.org/community/lists/devel/2014/11/16233.php> > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/11/16272.php > <http://www.open-mpi.org/community/lists/devel/2014/11/16272.php>