The topology of the virtual node is a bit unusual, I am reproducing a similar setup with Linux cgroups. I already found some problems there, no idea if they are related to yours, we'll see when I'll have some patches.
Brice Le 01/02/2012 21:07, Paul H. Hargrove a écrit : > Responses interspersed w/ your questions, below. > -Paul > > On 2/1/2012 6:13 AM, Brice Goglin wrote: >> Can you run hwloc-gather-topology and send the resulting tarball and >> output ? > > Attached. > >> We've seen some powerpc machines where the old kernel didn't say much >> about the topology, so your 8 cores with 4 threads appeared as 32 things >> without much details about their organization. I assume you can't >> upgrade the kernel. Which kernel is this? > > I am told the VM spans 1 socket of 8 cores, and each core has 4 threads. > /proc/cpuinfo doesn't show any "structure". > So, when lstopo reports the machine as (8 sockets X 1 core X 4 > threads), it was probably as close as it could be w/o the "missing" > information. [note that I MISreported lstopo's output as (8 sockets X > 4 cores) in my previous email]. > > I am a guest on this machine and can't change the kernel nor add > accounts. >> $ uname -a >> Linux biou2.rice.edu 2.6.32-131.6.1.el6.ppc64 #1 SMP Tue Sep 13 >> 15:16:45 CDT 2011 ppc64 ppc64 ppc64 GNU/Linux > Which isn't really all that old. > > >> Yes the virtual node thing could also make things more wrong. What kind >> of "virtualization" is this? > > > I don't know for certain, but would guess they are using the stuff > described in Chapter 3 of the pdf I gave the URL for. > I don't think RHEL6 has any other virtualization support for PPC. > >> Thanks >> Brice >> >> >> Le 01/02/2012 04:29, Paul H. Hargrove a écrit : >>> This node is an IBM "Power 750 Express server", described in detail at >>> http://www.redbooks.ibm.com/redpapers/pdfs/redp4638.pdf >>> >>> Notably it is a quad-socket chassis which can take 6-core or 8-core >>> processors. >>> However, lstopo is reporting 8 sockets of 4-cores each. >>> This discrepancy lead me to recall the following from an email sent to >>> me by a colleague: >>>> A surprise >>>> to me is that the login nodes provide the appearance of having 32 >>>> cpu's, but those are in fact only 8 cores with 4 hyper-threads, >>>> and they are in fact VM's on top of one socket of a compute node. >>> So, I am not really certain what I should expect lstopo to report. >>> I suppose it is accurately reporting to me the virtual node's >>> configuration. >>> >>> I bring this up because it may very well be related to the assertion >>> failures. >>> My guess here being that some part of hwloc has seen past the >>> "virtual" to see the "physical" and the assertion failure reflects the >>> resulting inconsistency. But that is just a guess. Let me know how I >>> might help debug this failure. >>> >>> -Paul >>> >>> On 1/31/2012 7:12 PM, Paul H. Hargrove wrote: >>>> The problem I reported below also exists in hwloc-1.4.1. >>>> Additionally, I can reproduce the SEGVs with xlc which Chris Samuel >>>> reported in >>>> >>>> http://www.open-mpi.org/community/lists/hwloc-devel/2012/01/2738.php >>>> >>>> -Paul >>>> >>>> On 1/31/2012 5:56 PM, Paul H. Hargrove wrote: >>>>> When running "make check" in hwloc-1.3.1 on a Linux/POWER7 system I >>>>> see: >>>>>> lt-linux-libnuma: >>>>>> /users/phh1/OMPI/hwloc-1.3.1-linux-ppc64-gcc//hwloc-1.3.1/tests/linux-libnuma.c:53: >>>>>> >>>>>> main: Assertion `hwloc_bitmap_isequal(set, set2)' failed. >>>>>> /bin/sh: line 5: 21415 Aborted ${dir}$tst >>>>>> FAIL: linux-libnuma >>>>> I've reproduced that failure with 4 different compilers (3 gcc's and >>>>> an xlc). >>>>> The xlc-built hwloc-1.3.1 also fails an additional test: >>>>>> lt-glibc-sched: >>>>>> /users/phh1/OMPI/hwloc-1.3.1-linux-ppc64-xlc-11.1//hwloc-1.3.1/tests/glibc-sched.c:43: >>>>>> >>>>>> main: Assertion `!err' failed. >>>>>> /bin/sh: line 5: 7077 Aborted ${dir}$tst >>>>>> FAIL: glibc-sched >>>>> >>>>> The contents of /proc/cpuinfo are: >>>>>> processor : 0 >>>>>> cpu : POWER7 (architected), altivec supported >>>>>> clock : 3550.000000MHz >>>>>> revision : 2.0 (pvr 003f 0200) >>>>>> >>>>>> [30 more of the same] >>>>>> >>>>>> processor : 31 >>>>>> cpu : POWER7 (architected), altivec supported >>>>>> clock : 3550.000000MHz >>>>>> revision : 2.0 (pvr 003f 0200) >>>>>> >>>>>> timebase : 512000000 >>>>>> platform : pSeries >>>>>> model : IBM,8233-E8B >>>>>> machine : CHRP IBM,8233-E8B >>>>> Let me know of any other h/w or s/w info I can report. >>>>> >>>>> -Paul >>>>> >