Thanks, Jeff. I think Devendar and me are observing the same issue. We're
talking about the same cluster. And I agree with Ralph it must be just a
print out error since latency test shows that actual binding seems to be
correct.

Best regards,
Elena


On Tue, Apr 21, 2015 at 6:17 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> +1
>
> Devendar, you seem to be reporting a different issue than Elena...?  FWIW:
> Open MPI has always used logical CPU numbering.  As far as I can tell from
> your output, it looks like Open MPI did the Right Thing with your examples.
>
> Elena's example seemed to show conflicting cpu numbering -- where OMPI
> said it would bind a process and then where it actually bound it.  Ralph
> mentioned to me that he would look at this as soon as he could; he thinks
> it might just be an error in the printf output (and that the binding is
> actually occurring in the right location).
>
>
>
> > On Apr 20, 2015, at 9:48 PM, tmish...@jcity.maeda.co.jp wrote:
> >
> > Hi Devendar,
> >
> > As far as I know, the report-bindings option shows the logical
> > cpu order. On the other hand, you are talking about physical one,
> > I guess.
> >
> > Regards,
> > Tetsuya Mishima
> >
> > 2015/04/21 9:04:37、"devel"さんは「Re: [OMPI devel] binding output
> > error」で書きました
> >> HT is not enabled.  All node are same topo . This is reproducible even
> on
> > single node.
> >>
> >>
> >>
> >> I ran osu latency to see if it is really is mapped to other socket or
> not
> > with –map-by socket.  It looks likes mapping is correct as per latency
> > test.
> >>
> >>
> >>
> >> $mpirun -np 2 -report-bindings -map-by
> > socket
> /hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4.1/osu_latency
> >
> >>
> >> [clx-orion-001:10084] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> > [B/././././././././././././.][./././././././././././././.]
> >>
> >> [clx-orion-001:10084] MCW rank 1 bound to socket 1[core 14[hwt 0]]:
> > [./././././././././././././.][B/././././././././././././.]
> >>
> >> # OSU MPI Latency Test v4.4.1
> >>
> >> # Size          Latency (us)
> >>
> >> 0                       0.50
> >>
> >> 1                       0.50
> >>
> >> 2                       0.50
> >>
> >> 4                       0.49
> >>
> >>
> >>
> >>
> >>
> >> $mpirun -np 2 -report-bindings -cpu-set
> > 1,7
> /hpc/local/benchmarks/hpc-stack-icc/install/ompi-mellanox-v1.8/tests/osu-micro-benchmarks-4.4.1/osu_latency
> >
> >>
> >> [clx-orion-001:10155] MCW rank 0 bound to socket 0[core 1[hwt 0]]:
> > [./B/./././././././././././.][./././././././././././././.]
> >>
> >> [clx-orion-001:10155] MCW rank 1 bound to socket 0[core 7[hwt 0]]:
> > [./././././././B/./././././.][./././././././././././././.]
> >>
> >> # OSU MPI Latency Test v4.4.1
> >>
> >> # Size          Latency (us)
> >>
> >> 0                       0.23
> >>
> >> 1                       0.24
> >>
> >> 2                       0.23
> >>
> >> 4                       0.22
> >>
> >> 8                       0.23
> >>
> >>
> >>
> >> Both hwloc and /proc/cpuinfo indicates following cpu numbering
> >>
> >> socket 0 cpus: 0 1 2 3 4 5 6 14 15 16 17 18 19 20
> >>
> >> socket 1 cpus: 7 8 9 10 11 12 13 21 22 23 24 25 26 27
> >>
> >>
> >>
> >> $hwloc-info -f
> >>
> >> Machine (256GB)
> >>
> >>   NUMANode L#0 (P#0 128GB) + Socket L#0 + L3 L#0 (35MB)
> >>
> >>     L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
> >>
> >>     L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
> >>
> >>     L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
> >>
> >>     L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
> >>
> >>     L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
> >>
> >>     L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
> >>
> >>     L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
> >>
> >>     L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#14)
> >>
> >>     L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#15)
> >>
> >>     L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#16)
> >>
> >>     L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#17)
> >>
> >>     L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#18)
> >>
> >>     L2 L#12 (256KB) + L1 L#12 (32KB) + Core L#12 + PU L#12 (P#19)
> >>
> >>     L2 L#13 (256KB) + L1 L#13 (32KB) + Core L#13 + PU L#13 (P#20)
> >>
> >>   NUMANode L#1 (P#1 128GB) + Socket L#1 + L3 L#1 (35MB)
> >>
> >>     L2 L#14 (256KB) + L1 L#14 (32KB) + Core L#14 + PU L#14 (P#7)
> >>
> >>     L2 L#15 (256KB) + L1 L#15 (32KB) + Core L#15 + PU L#15 (P#8)
> >>
> >>     L2 L#16 (256KB) + L1 L#16 (32KB) + Core L#16 + PU L#16 (P#9)
> >>
> >>     L2 L#17 (256KB) + L1 L#17 (32KB) + Core L#17 + PU L#17 (P#10)
> >>
> >>     L2 L#18 (256KB) + L1 L#18 (32KB) + Core L#18 + PU L#18 (P#11)
> >>
> >>     L2 L#19 (256KB) + L1 L#19 (32KB) + Core L#19 + PU L#19 (P#12)
> >>
> >>     L2 L#20 (256KB) + L1 L#20 (32KB) + Core L#20 + PU L#20 (P#13)
> >>
> >>     L2 L#21 (256KB) + L1 L#21 (32KB) + Core L#21 + PU L#21 (P#21)
> >>
> >>     L2 L#22 (256KB) + L1 L#22 (32KB) + Core L#22 + PU L#22 (P#22)
> >>
> >>     L2 L#23 (256KB) + L1 L#23 (32KB) + Core L#23 + PU L#23 (P#23)
> >>
> >>     L2 L#24 (256KB) + L1 L#24 (32KB) + Core L#24 + PU L#24 (P#24)
> >>
> >>     L2 L#25 (256KB) + L1 L#25 (32KB) + Core L#25 + PU L#25 (P#25)
> >>
> >>     L2 L#26 (256KB) + L1 L#26 (32KB) + Core L#26 + PU L#26 (P#26)
> >>
> >>     L2 L#27 (256KB) + L1 L#27 (32KB) + Core L#27 + PU L#27 (P#27)
> >>
> >>
> >>
> >>
> >>
> >> So, Is --reporting-binding shows one more level of logical CPU
> numbering?
> >>
> >>
> >>
> >>
> >>
> >> -Devendar
> >>
> >>
> >>
> >>
> >>
> >> From:devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph
> Castain
> >> Sent: Monday, April 20, 2015 3:52 PM
> >> To: Open MPI Developers
> >> Subject: Re: [OMPI devel] binding output error
> >>
> >>
> >>
> >> Also, was this with HT's enabled? I'm wondering if the print code is
> > incorrectly computing the core because it isn't correctly accounting for
> HT
> > cpus.
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Apr 20, 2015 at 3:49 PM, Jeff Squyres (jsquyres)
> > <jsquy...@cisco.com> wrote:
> >>
> >> Ralph's the authority on this one, but just to be sure: are all nodes
> the
> > same topology? E.g., does adding "--hetero-nodes" to the mpirun command
> > line fix the problem?
> >>
> >>
> >>
> >>> On Apr 20, 2015, at 9:29 AM, Elena Elkina <elena.elk...@itseez.com>
> > wrote:
> >>>
> >>> Hi guys,
> >>>
> >>> I faced with an issue on our cluster related to mapping & binding
> > policies on 1.8.5.
> >>>
> >>> The matter is that --report-bindings output doesn't correspond to the
> > locale. It looks like there is a mistake on the output itself, because it
> > just puts serial core number while that core can be
> >> on another socket. For example,
> >>>
> >>> mpirun -np 2 --display-devel-map --report-bindings --map-by socket
> > hostname
> >>>   Data for JOB [43064,1] offset 0
> >>>
> >>>   Mapper requested: NULL  Last mapper: round_robin  Mapping policy:
> > BYSOCKET  Ranking policy: SOCKET
> >>>   Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
> >>>        Num new daemons: 0      New daemon starting vpid INVALID
> >>>        Num nodes: 1
> >>>
> >>>   Data for node: clx-orion-001         Launch id: -1   State: 2
> >>>        Daemon: [[43064,0],0]   Daemon launched: True
> >>>        Num slots: 28   Slots in use: 2 Oversubscribed: FALSE
> >>>        Num slots allocated: 28 Max slots: 0
> >>>        Username on node: NULL
> >>>        Num procs: 2    Next node_rank: 2
> >>>        Data for proc: [[43064,1],0]
> >>>                Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
> >>>                State: INITIALIZED      Restarts: 0     App_context: 0
> > Locale: 0-6,14-20       Bind location: 0        Binding: 0
> >>>        Data for proc: [[43064,1],1]
> >>>                Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
> >>>                State: INITIALIZED      Restarts: 0     App_context: 0
> > Locale: 7-13,21-27      Bind location: 7        Binding: 7
> >>> [clx-orion-001:26951] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> > [B/././././././././././././.][./././././././././././././.]
> >>> [clx-orion-001:26951] MCW rank 1 bound to socket 1[core 14[hwt 0]]:
> > [./././././././././././././.][B/././././././././././././.]
> >>>
> >>> The second process should be bound at core 7 (not core 14).
> >>>
> >>>
> >>> Another example:
> >>> mpirun -np 8 --display-devel-map --report-bindings --map-by core
> > hostname
> >>>   Data for JOB [43202,1] offset 0
> >>>
> >>>   Mapper requested: NULL  Last mapper: round_robin  Mapping policy:
> > BYCORE  Ranking policy: CORE
> >>>   Binding policy: CORE  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
> >>>        Num new daemons: 0      New daemon starting vpid INVALID
> >>>        Num nodes: 1
> >>>
> >>>   Data for node: clx-orion-001         Launch id: -1   State: 2
> >>>        Daemon: [[43202,0],0]   Daemon launched: True
> >>>        Num slots: 28   Slots in use: 8 Oversubscribed: FALSE
> >>>        Num slots allocated: 28 Max slots: 0
> >>>        Username on node: NULL
> >>>        Num procs: 8    Next node_rank: 8
> >>>        Data for proc: [[43202,1],0]
> >>>                Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
> >>>                State: INITIALIZED      Restarts: 0     App_context: 0
> > Locale: 0       Bind location: 0        Binding: 0
> >>>        Data for proc: [[43202,1],1]
> >>>                Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
> >>>                State: INITIALIZED      Restarts: 0     App_context: 0
> > Locale: 1       Bind location: 1        Binding: 1
> >>>        Data for proc: [[43202,1],2]
> >>>                Pid: 0  Local rank: 2   Node rank: 2    App rank: 2
> >>>                State: INITIALIZED      Restarts: 0     App_context: 0
> > Locale: 2       Bind location: 2        Binding: 2
> >>>        Data for proc: [[43202,1],3]
> >>>                Pid: 0  Local rank: 3   Node rank: 3    App rank: 3
> >>>                State: INITIALIZED      Restarts: 0     App_context: 0
> > Locale: 3       Bind location: 3        Binding: 3
> >>>        Data for proc: [[43202,1],4]
> >>>                Pid: 0  Local rank: 4   Node rank: 4    App rank: 4
> >>>                State: INITIALIZED      Restarts: 0     App_context: 0
> > Locale: 4       Bind location: 4        Binding: 4
> >>>        Data for proc: [[43202,1],5]
> >>>                Pid: 0  Local rank: 5   Node rank: 5    App rank: 5
> >>>                State: INITIALIZED      Restarts: 0     App_context: 0
> > Locale: 5       Bind location: 5        Binding: 5
> >>>        Data for proc: [[43202,1],6]
> >>>                Pid: 0  Local rank: 6   Node rank: 6    App rank: 6
> >>>                State: INITIALIZED      Restarts: 0     App_context: 0
> > Locale: 6       Bind location: 6        Binding: 6
> >>>        Data for proc: [[43202,1],7]
> >>>                Pid: 0  Local rank: 7   Node rank: 7    App rank: 7
> >>>                State: INITIALIZED      Restarts: 0     App_context: 0
> > Locale: 14      Bind location: 14       Binding: 14
> >>> [clx-orion-001:27069] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
> > [B/././././././././././././.][./././././././././././././.]
> >>> [clx-orion-001:27069] MCW rank 1 bound to socket 0[core 1[hwt 0]]:
> > [./B/./././././././././././.][./././././././././././././.]
> >>> [clx-orion-001:27069] MCW rank 2 bound to socket 0[core 2[hwt 0]]:
> > [././B/././././././././././.][./././././././././././././.]
> >>> [clx-orion-001:27069] MCW rank 3 bound to socket 0[core 3[hwt 0]]:
> > [./././B/./././././././././.][./././././././././././././.]
> >>> [clx-orion-001:27069] MCW rank 4 bound to socket 0[core 4[hwt 0]]:
> > [././././B/././././././././.][./././././././././././././.]
> >>> [clx-orion-001:27069] MCW rank 5 bound to socket 0[core 5[hwt 0]]:
> > [./././././B/./././././././.][./././././././././././././.]
> >>> [clx-orion-001:27069] MCW rank 6 bound to socket 0[core 6[hwt 0]]:
> > [././././././B/././././././.][./././././././././././././.]
> >>> [clx-orion-001:27069] MCW rank 7 bound to socket 0[core 7[hwt 0]]:
> > [./././././././B/./././././.][./././././././././././././.]
> >>>
> >>> Rank 7 should be bound at core 14 instead of core 7 since core 7 is at
> > another socket.
> >>>
> >>> Best regards,
> >>> Elena
> >>>
> >>>
> >>
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/04/17273.php
> >>
> >>
> >> --
> >> Jeff Squyres
> >> jsquy...@cisco.com
> >> For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >>
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription:  http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/04/17282.php
> >>
> >>  _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/develLink to
> > this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17287.php
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17291.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/04/17295.php

Reply via email to