Okay, can you add --display-devel-map --mca rmaps_base_verbose 10 to your cmd line?
It sounds like there is something about that topo that is bothering the mapper > On Sep 2, 2016, at 9:31 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > > Thanks Gilles, that's a very useful trick. The bindings reported by ORTE are > in sync with the one reported by the OS. > > $ mpirun -np 2 --tag-output --bind-to core --report-bindings grep > Cpus_allowed_list /proc/self/status > [1,0]<stderr>:[arc00:90813] MCW rank 0 bound to socket 0[core 0[hwt 0]], > socket 0[core 4[hwt 0]]: > [B./../../../B./../../../../..][../../../../../../../../../..] > [1,1]<stderr>:[arc00:90813] MCW rank 1 bound to socket 1[core 10[hwt 0]], > socket 1[core 14[hwt 0]]: > [../../../../../../../../../..][B./../../../B./../../../../..] > [1,0]<stdout>:Cpus_allowed_list: 0,8 > [1,1]<stdout>:Cpus_allowed_list: 1,9 > > George. > > > > On Sat, Sep 3, 2016 at 12:27 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> wrote: > George, > > I cannot help much with this i am afraid > > My best bet would be to rebuild OpenMPI with --enable-debug and an external > recent hwloc (iirc hwloc v2 cannot be used in Open MPI yet) > > You might also want to try > mpirun --tag-output --bind-to xxx --report-bindings grep Cpus_allowed_list > /proc/self/status > > So you can confirm both openmpi and /proc/self/status report the same thing > > Hope this helps a bit ... > > Gilles > > > George Bosilca <bosi...@icl.utk.edu <mailto:bosi...@icl.utk.edu>> wrote: > While investigating the ongoing issue with OMPI messaging layer, I run into > some troubles with process binding. I read the documentation, but I still > find this puzzling. > > Disclaimer: all experiments were done with current master (9c496f7) compiled > in optimized mode. The hardware: a single node 20 core Xeon E5-2650 v3 > (hwloc-ls is at the end of this email). > > First and foremost, trying to bind to NUMA nodes was a sure way to get a > segfault: > > $ mpirun -np 2 --mca btl vader,self --bind-to numa --report-bindings true > -------------------------------------------------------------------------- > No objects of the specified type were found on at least one node: > > Type: NUMANode > Node: arc00 > > The map cannot be done as specified. > -------------------------------------------------------------------------- > [dancer:32162] *** Process received signal *** > [dancer:32162] Signal: Segmentation fault (11) > [dancer:32162] Signal code: Address not mapped (1) > [dancer:32162] Failing at address: 0x3c > [dancer:32162] [ 0] /lib64/libpthread.so.0[0x3126a0f7e0] > [dancer:32162] [ 1] > /home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(+0x560e0)[0x7f9075bc60e0] > [dancer:32162] [ 2] > /home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_grpcomm_API_xcast+0x84)[0x7f9075bc6f54] > [dancer:32162] [ 3] > /home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_plm_base_orted_exit+0x1a8)[0x7f9075bd9308] > [dancer:32162] [ 4] > /home/bosilca/opt/trunk/fast/lib/openmpi/mca_plm_rsh.so(+0x384e)[0x7f907361284e] > [dancer:32162] [ 5] > /home/bosilca/opt/trunk/fast/lib/libopen-rte.so.0(orte_state_base_check_all_complete+0x324)[0x7f9075bedca4] > [dancer:32162] [ 6] > /home/bosilca/opt/trunk/fast/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7f90758eafec] > [dancer:32162] [ 7] mpirun[0x401251] > [dancer:32162] [ 8] mpirun[0x400e24] > [dancer:32162] [ 9] /lib64/libc.so.6(__libc_start_main+0xfd)[0x312621ed1d] > [dancer:32162] [10] mpirun[0x400d49] > [dancer:32162] *** End of error message *** > Segmentation fault > > As you can see on the hwloc output below, there are 2 NUMA nodes on the node > and HWLOC correctly identifies them, making OMPI error message confusing. > Anyway, we should not segfault but report a more meaning error message. > > Binding to slot (I got this from the man page for 2.0) is apparently not > supported anymore. Reminder: We should update the manpage accordingly. > > Trying to bind to core looks better, the application at least starts. > Unfortunately the reported bindings (or at least my understanding of these > bindings) are troubling. Assuming that the way we report the bindings is > correct, why are my processes assigned to 2 cores far apart each ? > > $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true > [arc00:39350] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core > 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..] > [arc00:39350] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core > 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..] > > Maybe because I only used the binding option. Adding the mapping to the mix > (--map-by option) seem hopeless, the binding remains unchanged for 2 > processes. > > $ mpirun -np 2 --mca btl vader,self --bind-to core --report-bindings true > [arc00:40401] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core > 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..] > [arc00:40401] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core > 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..] > > At this point I really wondered what is going on. To clarify I tried to > launch 3 processes on the node. Bummer ! the reported binding shows that one > of my processes got assigned to cores on different sockets. > > $ mpirun -np 3 --mca btl vader,self --bind-to core --report-bindings true > [arc00:40311] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core > 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..] > [arc00:40311] MCW rank 2 bound to socket 0[core 1[hwt 0]], socket 0[core > 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..] > [arc00:40311] MCW rank 1 bound to socket 0[core 4[hwt 0]], socket 1[core > 12[hwt 0]]: [../../../../B./../../../../..][../../B./../../../../../../..] > > Why is rank 1 on core 4 and rank 2 on core 1 ? Maybe specifying the mapping > will help. Will I get a more sensible binding (as suggested by our online > documentation and the man pages) ? > > $ mpirun -np 3 --mca btl vader,self --bind-to core --map-by core > --report-bindings true > [arc00:40254] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core > 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..] > [arc00:40254] MCW rank 1 bound to socket 0[core 1[hwt 0]], socket 0[core > 9[hwt 0]]: [../B./../../../../../../../B.][../../../../../../../../../..] > [arc00:40254] MCW rank 2 bound to socket 0[core 2[hwt 0]], socket 1[core > 10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../..] > > There is a difference. The logical rank of processes is now respected but one > of my processes is still bound to 2 cores on different sockets, but these > cores are different from the case when the mapping was not specified. > > Trying to bind on sockets I got an even more confusing outcome. So I went the > hard way, what can go wrong if I manually define the binding via a rankfile ? > Fail ! My processes continue to report an unsettling bindings (there is some > relationship with my rank file but most of the issues I reported above still > remain). > > $ more rankfile > rank 0=arc00 slot=0 > rank 1=arc00 slot=2 > $ mpirun -np 2 --mca btl vader,self -rf rankfile --report-bindings true > [arc00:40718] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket 0[core > 8[hwt 0]]: [B./../../../../../../../B./..][../../../../../../../../../..] > [arc00:40718] MCW rank 1 bound to socket 0[core 2[hwt 0]], socket 1[core > 10[hwt 0]]: [../../B./../../../../../../..][B./../../../../../../../../..] > > At this point I got pretty much completely confused with how OMPI binding > works. I'm counting on a good samaritan to explain how this works. > > Thanks, > George. > > PS: rankfile feature of using relative hostnames (+n?) seems to be busted as > the example from the man page leads to the following complaint > > -------------------------------------------------------------------------- > A relative host was specified, but no prior allocation has been made. > Thus, there is no way to determine the proper host to be used. > > hostfile entry: +n0 > > Please see the orte_hosts man page for further information. > -------------------------------------------------------------------------- > > > $ hwloc-ls > Machine (63GB) > NUMANode L#0 (P#0 31GB) > Socket L#0 + L3 L#0 (25MB) > L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 > PU L#0 (P#0) > PU L#1 (P#20) > L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 > PU L#2 (P#1) > PU L#3 (P#21) > L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 > PU L#4 (P#2) > PU L#5 (P#22) > L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 > PU L#6 (P#3) > PU L#7 (P#23) > L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 > PU L#8 (P#4) > PU L#9 (P#24) > L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 > PU L#10 (P#5) > PU L#11 (P#25) > L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 > PU L#12 (P#6) > PU L#13 (P#26) > L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 > PU L#14 (P#7) > PU L#15 (P#27) > L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 > PU L#16 (P#8) > PU L#17 (P#28) > L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 > PU L#18 (P#9) > PU L#19 (P#29) > NUMANode L#1 (P#1 31GB) > Socket L#1 + L3 L#1 (25MB) > L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 > PU L#20 (P#10) > PU L#21 (P#30) > L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 > PU L#22 (P#11) > PU L#23 (P#31) > L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 > PU L#24 (P#12) > PU L#25 (P#32) > L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 > PU L#26 (P#13) > PU L#27 (P#33) > L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 > PU L#28 (P#14) > PU L#29 (P#34) > L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 > PU L#30 (P#15) > PU L#31 (P#35) > L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 > PU L#32 (P#16) > PU L#33 (P#36) > L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 > PU L#34 (P#17) > PU L#35 (P#37) > L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 > PU L#36 (P#18) > PU L#37 (P#38) > L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 > PU L#38 (P#19) > PU L#39 (P#39) > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > <https://rfd.newmexicoconsortium.org/mailman/listinfo/devel> > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel