Le 28/07/2010 16:21, Bernd Kallies a écrit : > We just got one SGI UltraViolet rack, containing 48 NUMA nodes with one > Octocore Nehalem each, SMT switched on. Essentially the machine is a big > shared-memory machine, similar to what SGI had with their Itanium-based > Altix 4700. > > OS is SLES11 (2.6.32.12-0.7.1.1381.1.PTF-default x86_64). I used > hwloc-1.0.2 compiled with gcc. > > The lstopo output looks a bit strange. The full output of lstopo is > attached. It begins with > > Machine (1534GB) > Group4 #0 (1022GB) > Group3 #0 (510GB) > Group2 #0 (254GB) > Group1 #0 (126GB) > Group0 #0 (62GB) > NUMANode #0 (phys=0 30GB) + Socket #0 + L3 #0 (24MB) > L2 #0 (256KB) + L1 #0 (32KB) + Core #0 > PU #0 (phys=0) > PU #1 (phys=384) > L2 #1 (256KB) + L1 #1 (32KB) + Core #1 > PU #2 (phys=1) > PU #3 (phys=385) > L2 #2 (256KB) + L1 #2 (32KB) + Core #2 > ... > NUMANode #1 (phys=1 32GB) + Socket #1 + L3 #1 (24MB) > L2 #8 (256KB) + L1 #8 (32KB) + Core #8 > PU #16 (phys=8) > PU #17 (phys=392) > L2 #9 (256KB) + L1 #9 (32KB) + Core #9 > ... > > The output essentially says that there are 48 NUMA nodes with 8 cores > each. Each NUMA node contains 32 GB memory except the 1st one, which > contains 30 GB. Two NUMA nodes are grouped together as "Group0". Two > "Group0" are grouped together as "Group1" and so on. There are three > "Group3" objects, the 1st one contains 16 NUMA nodes with 510 GB, the > remaining two contain 16 NUMA nodes with 512 GB each. Up to here the > topology is understandeable. I'm wondering about "Group4", which > contains the three "Group3" objects. lstopo should print "1534GB" > instead of "1022GB". There is only one "Group4" object, and there are no > other direct children of the root object. >
Indeed, there's something wrong. Can you send the output of tests/linux/gather_topology.sh so that I try to debug this from here? > Moreover, when running applications that use the hwloc API, and call > functions like hwloc_get_next_obj_by_depth or hwloc_get_obj_by_depth, > then calling hwloc_topology_destroy or even free() on some > self-allocated memory, then the app fail at this stage with > > *** glibc detected *** a.out: double free or corruption (out). > or > *** glibc detected *** a.out: free(): invalid next size (fast): > Can you send an example as well? thanks, Brice