Thanks, Jeff. I think Devendar and me are observing the same issue. We're talking about the same cluster. And I agree with Ralph it must be just a print out error since latency test shows that actual binding seems to be correct.
Best regards, Elena On Tue, Apr 21, 2015 at 6:17 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote: > +1 > > Devendar, you seem to be reporting a different issue than Elena...? FWIW: > Open MPI has always used logical CPU numbering. As far as I can tell from > your output, it looks like Open MPI did the Right Thing with your examples. > > Elena's example seemed to show conflicting cpu numbering -- where OMPI > said it would bind a process and then where it actually bound it. Ralph > mentioned to me that he would look at this as soon as he could; he thinks > it might just be an error in the printf output (and that the binding is > actually occurring in the right location). > > > > > On Apr 20, 2015, at 9:48 PM, tmish...@jcity.maeda.co.jp wrote: > > > > Hi Devendar, > > > > As far as I know, the report-bindings option shows the logical > > cpu order. On the other hand, you are talking about physical one, > > I guess. > > > > Regards, > > Tetsuya Mishima > > > > 2015/04/21 9:04:37、"devel"さんは「Re: [OMPI devel] binding output > > error」で書きました > >> HT is not enabled. All node are same topo . This is reproducible even > on > > single node. > >> > >> > >> > >> I ran osu latency to see if it is really is mapped to other socket or > not > > with –map-by socket. It looks likes mapping is correct as per latency > > test. > >> > >> > >> > >> $mpirun -np 2 -report-bindings -map-by > > socket > /hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4.1/osu_latency > > > >> > >> [clx-orion-001:10084] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > > [B/././././././././././././.][./././././././././././././.] > >> > >> [clx-orion-001:10084] MCW rank 1 bound to socket 1[core 14[hwt 0]]: > > [./././././././././././././.][B/././././././././././././.] > >> > >> # OSU MPI Latency Test v4.4.1 > >> > >> # Size Latency (us) > >> > >> 0 0.50 > >> > >> 1 0.50 > >> > >> 2 0.50 > >> > >> 4 0.49 > >> > >> > >> > >> > >> > >> $mpirun -np 2 -report-bindings -cpu-set > > 1,7 > /hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4.1/osu_latency > > > >> > >> [clx-orion-001:10155] MCW rank 0 bound to socket 0[core 1[hwt 0]]: > > [./B/./././././././././././.][./././././././././././././.] > >> > >> [clx-orion-001:10155] MCW rank 1 bound to socket 0[core 7[hwt 0]]: > > [./././././././B/./././././.][./././././././././././././.] > >> > >> # OSU MPI Latency Test v4.4.1 > >> > >> # Size Latency (us) > >> > >> 0 0.23 > >> > >> 1 0.24 > >> > >> 2 0.23 > >> > >> 4 0.22 > >> > >> 8 0.23 > >> > >> > >> > >> Both hwloc and /proc/cpuinfo indicates following cpu numbering > >> > >> socket 0 cpus: 0 1 2 3 4 5 6 14 15 16 17 18 19 20 > >> > >> socket 1 cpus: 7 8 9 10 11 12 13 21 22 23 24 25 26 27 > >> > >> > >> > >> $hwloc-info -f > >> > >> Machine (256GB) > >> > >> NUMANode L#0 (P#0 128GB) + Socket L#0 + L3 L#0 (35MB) > >> > >> L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0) > >> > >> L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1) > >> > >> L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2) > >> > >> L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3) > >> > >> L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4) > >> > >> L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5) > >> > >> L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6) > >> > >> L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#14) > >> > >> L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#15) > >> > >> L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#16) > >> > >> L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#17) > >> > >> L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#18) > >> > >> L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 + PU L#12 (P#19) > >> > >> L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 + PU L#13 (P#20) > >> > >> NUMANode L#1 (P#1 128GB) + Socket L#1 + L3 L#1 (35MB) > >> > >> L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 + PU L#14 (P#7) > >> > >> L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 + PU L#15 (P#8) > >> > >> L2 L#16 (256KB) + L1 L#16 (32KB) + Core L#16 + PU L#16 (P#9) > >> > >> L2 L#17 (256KB) + L1 L#17 (32KB) + Core L#17 + PU L#17 (P#10) > >> > >> L2 L#18 (256KB) + L1 L#18 (32KB) + Core L#18 + PU L#18 (P#11) > >> > >> L2 L#19 (256KB) + L1 L#19 (32KB) + Core L#19 + PU L#19 (P#12) > >> > >> L2 L#20 (256KB) + L1 L#20 (32KB) + Core L#20 + PU L#20 (P#13) > >> > >> L2 L#21 (256KB) + L1 L#21 (32KB) + Core L#21 + PU L#21 (P#21) > >> > >> L2 L#22 (256KB) + L1 L#22 (32KB) + Core L#22 + PU L#22 (P#22) > >> > >> L2 L#23 (256KB) + L1 L#23 (32KB) + Core L#23 + PU L#23 (P#23) > >> > >> L2 L#24 (256KB) + L1 L#24 (32KB) + Core L#24 + PU L#24 (P#24) > >> > >> L2 L#25 (256KB) + L1 L#25 (32KB) + Core L#25 + PU L#25 (P#25) > >> > >> L2 L#26 (256KB) + L1 L#26 (32KB) + Core L#26 + PU L#26 (P#26) > >> > >> L2 L#27 (256KB) + L1 L#27 (32KB) + Core L#27 + PU L#27 (P#27) > >> > >> > >> > >> > >> > >> So, Is --reporting-binding shows one more level of logical CPU > numbering? > >> > >> > >> > >> > >> > >> -Devendar > >> > >> > >> > >> > >> > >> From:devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph > Castain > >> Sent: Monday, April 20, 2015 3:52 PM > >> To: Open MPI Developers > >> Subject: Re: [OMPI devel] binding output error > >> > >> > >> > >> Also, was this with HT's enabled? I'm wondering if the print code is > > incorrectly computing the core because it isn't correctly accounting for > HT > > cpus. > >> > >> > >> > >> > >> > >> On Mon, Apr 20, 2015 at 3:49 PM, Jeff Squyres (jsquyres) > > <jsquy...@cisco.com> wrote: > >> > >> Ralph's the authority on this one, but just to be sure: are all nodes > the > > same topology? E.g., does adding "--hetero-nodes" to the mpirun command > > line fix the problem? > >> > >> > >> > >>> On Apr 20, 2015, at 9:29 AM, Elena Elkina <elena.elk...@itseez.com> > > wrote: > >>> > >>> Hi guys, > >>> > >>> I faced with an issue on our cluster related to mapping & binding > > policies on 1.8.5. > >>> > >>> The matter is that --report-bindings output doesn't correspond to the > > locale. It looks like there is a mistake on the output itself, because it > > just puts serial core number while that core can be > >> on another socket. For example, > >>> > >>> mpirun -np 2 --display-devel-map --report-bindings --map-by socket > > hostname > >>> Data for JOB [43064,1] offset 0 > >>> > >>> Mapper requested: NULL Last mapper: round_robin Mapping policy: > > BYSOCKET Ranking policy: SOCKET > >>> Binding policy: CORE Cpu set: NULL PPR: NULL Cpus-per-rank: 1 > >>> Num new daemons: 0 New daemon starting vpid INVALID > >>> Num nodes: 1 > >>> > >>> Data for node: clx-orion-001 Launch id: -1 State: 2 > >>> Daemon: [[43064,0],0] Daemon launched: True > >>> Num slots: 28 Slots in use: 2 Oversubscribed: FALSE > >>> Num slots allocated: 28 Max slots: 0 > >>> Username on node: NULL > >>> Num procs: 2 Next node_rank: 2 > >>> Data for proc: [[43064,1],0] > >>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 > >>> State: INITIALIZED Restarts: 0 App_context: 0 > > Locale: 0-6,14-20 Bind location: 0 Binding: 0 > >>> Data for proc: [[43064,1],1] > >>> Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 > >>> State: INITIALIZED Restarts: 0 App_context: 0 > > Locale: 7-13,21-27 Bind location: 7 Binding: 7 > >>> [clx-orion-001:26951] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > > [B/././././././././././././.][./././././././././././././.] > >>> [clx-orion-001:26951] MCW rank 1 bound to socket 1[core 14[hwt 0]]: > > [./././././././././././././.][B/././././././././././././.] > >>> > >>> The second process should be bound at core 7 (not core 14). > >>> > >>> > >>> Another example: > >>> mpirun -np 8 --display-devel-map --report-bindings --map-by core > > hostname > >>> Data for JOB [43202,1] offset 0 > >>> > >>> Mapper requested: NULL Last mapper: round_robin Mapping policy: > > BYCORE Ranking policy: CORE > >>> Binding policy: CORE Cpu set: NULL PPR: NULL Cpus-per-rank: 1 > >>> Num new daemons: 0 New daemon starting vpid INVALID > >>> Num nodes: 1 > >>> > >>> Data for node: clx-orion-001 Launch id: -1 State: 2 > >>> Daemon: [[43202,0],0] Daemon launched: True > >>> Num slots: 28 Slots in use: 8 Oversubscribed: FALSE > >>> Num slots allocated: 28 Max slots: 0 > >>> Username on node: NULL > >>> Num procs: 8 Next node_rank: 8 > >>> Data for proc: [[43202,1],0] > >>> Pid: 0 Local rank: 0 Node rank: 0 App rank: 0 > >>> State: INITIALIZED Restarts: 0 App_context: 0 > > Locale: 0 Bind location: 0 Binding: 0 > >>> Data for proc: [[43202,1],1] > >>> Pid: 0 Local rank: 1 Node rank: 1 App rank: 1 > >>> State: INITIALIZED Restarts: 0 App_context: 0 > > Locale: 1 Bind location: 1 Binding: 1 > >>> Data for proc: [[43202,1],2] > >>> Pid: 0 Local rank: 2 Node rank: 2 App rank: 2 > >>> State: INITIALIZED Restarts: 0 App_context: 0 > > Locale: 2 Bind location: 2 Binding: 2 > >>> Data for proc: [[43202,1],3] > >>> Pid: 0 Local rank: 3 Node rank: 3 App rank: 3 > >>> State: INITIALIZED Restarts: 0 App_context: 0 > > Locale: 3 Bind location: 3 Binding: 3 > >>> Data for proc: [[43202,1],4] > >>> Pid: 0 Local rank: 4 Node rank: 4 App rank: 4 > >>> State: INITIALIZED Restarts: 0 App_context: 0 > > Locale: 4 Bind location: 4 Binding: 4 > >>> Data for proc: [[43202,1],5] > >>> Pid: 0 Local rank: 5 Node rank: 5 App rank: 5 > >>> State: INITIALIZED Restarts: 0 App_context: 0 > > Locale: 5 Bind location: 5 Binding: 5 > >>> Data for proc: [[43202,1],6] > >>> Pid: 0 Local rank: 6 Node rank: 6 App rank: 6 > >>> State: INITIALIZED Restarts: 0 App_context: 0 > > Locale: 6 Bind location: 6 Binding: 6 > >>> Data for proc: [[43202,1],7] > >>> Pid: 0 Local rank: 7 Node rank: 7 App rank: 7 > >>> State: INITIALIZED Restarts: 0 App_context: 0 > > Locale: 14 Bind location: 14 Binding: 14 > >>> [clx-orion-001:27069] MCW rank 0 bound to socket 0[core 0[hwt 0]]: > > [B/././././././././././././.][./././././././././././././.] > >>> [clx-orion-001:27069] MCW rank 1 bound to socket 0[core 1[hwt 0]]: > > [./B/./././././././././././.][./././././././././././././.] > >>> [clx-orion-001:27069] MCW rank 2 bound to socket 0[core 2[hwt 0]]: > > [././B/././././././././././.][./././././././././././././.] > >>> [clx-orion-001:27069] MCW rank 3 bound to socket 0[core 3[hwt 0]]: > > [./././B/./././././././././.][./././././././././././././.] > >>> [clx-orion-001:27069] MCW rank 4 bound to socket 0[core 4[hwt 0]]: > > [././././B/././././././././.][./././././././././././././.] > >>> [clx-orion-001:27069] MCW rank 5 bound to socket 0[core 5[hwt 0]]: > > [./././././B/./././././././.][./././././././././././././.] > >>> [clx-orion-001:27069] MCW rank 6 bound to socket 0[core 6[hwt 0]]: > > [././././././B/././././././.][./././././././././././././.] > >>> [clx-orion-001:27069] MCW rank 7 bound to socket 0[core 7[hwt 0]]: > > [./././././././B/./././././.][./././././././././././././.] > >>> > >>> Rank 7 should be bound at core 14 instead of core 7 since core 7 is at > > another socket. > >>> > >>> Best regards, > >>> Elena > >>> > >>> > >> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/04/17273.php > >> > >> > >> -- > >> Jeff Squyres > >> jsquy...@cisco.com > >> For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/04/17282.php > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink to > > this post: > http://www.open-mpi.org/community/lists/devel/2015/04/17287.php > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/04/17291.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/04/17295.php