Closing this for the curious. Took a walk to the datacenter and pulled this server and a neighbor to compare it to a known good server and discovered that two DIMMs were installed in the wrong sockets. Correcting that resolved the missing numa nodes.
Thanks, jbh On Wed, May 30, 2012 at 11:27 PM, John Hanks <john.ha...@usu.edu> wrote: > Brice, > > Thanks for the advice, I may have gotten lucky. During POST it clearly > shows 4 nodes, Node 0, Node1 Node 2 and Node 3 with nodes 0 and 3 > marked N/A. Have sent a screenshot of that to HP. > > jbh > > On Wed, May 30, 2012 at 1:26 PM, Brice Goglin <brice.gog...@inria.fr> wrote: >> We don't need any other info on the hwloc side. And we thank you for >> testing the big hwloc warning code :) >> >> For HP: >> * If you're lucky, the BIOS may talk about the number of NUMA nodes >> (either on the usual messages during boot, or in the BIOS configuration >> menu). See if it says 2 on the broken node instead of 4 on other nodes, >> you have something easy to tell HP. >> * Otherwise we'll have to dig in the SRAT ACPI info. "dmesg | grep SRAT" >> should talk about some "PXM" properties, which are basically NUMA >> localities. You should see PXM 1 and 2 on the broken node, and PXM 0, 1, >> 2 and 3 on the other ones. SRAT comes from ACPI, if SRAT is broken, the >> hardware/firmware is buggy. >> >> Brice >> >> >> >> >> Le 30/05/2012 21:06, John Hanks a écrit : >>> I updated the BIOS and still got the error on this host, then I did >>> what I should have done in the first place and checked another >>> physically identical host. Of the 4 nodes I have that are the same, >>> only this one exhibits the error. At this point I'm blaming a hardware >>> problem, if there's any benefit to hwloc for me to send additional >>> debugging information I am happy to, otherwise I'm going try to figure >>> out how what to say to HP to get this node fixed. >>> >>> Thanks, >>> >>> jbh >>> >>> On Wed, May 30, 2012 at 9:27 AM, John Hanks <john.ha...@usu.edu> wrote: >>>> I recently inherited these machines and would bet small amounts of >>>> hard currency they have never seen a BIOS update since birth. I'll >>>> figure out how to update the BIOS and let you know if the error >>>> persists. >>>> >>>> Thanks, >>>> >>>> jbh >>>> >>>> On Wed, May 30, 2012 at 9:24 AM, Jeff Squyres <jsquy...@cisco.com> wrote: >>>>> On May 30, 2012, at 11:22 AM, Samuel Thibault wrote: >>>>> >>>>>> i.e. the kernel reports that socket 0 is completely in node 1, while >>>>>> socket 1 is half in node 1 and half in node 2. Do you have more >>>>>> information about what the machine actually contains socket- and >>>>>> NUMA-wise? The dell website is not really felpful, it talks about 4-16 >>>>>> cores for the DL165 G7, while you have 24. >>>>> >>>>> How old is your Dell BIOS firmware? You might need to update it. >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> >>>>> _______________________________________________ >>>>> hwloc-users mailing list >>>>> hwloc-us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users >>> _______________________________________________ >>> hwloc-users mailing list >>> hwloc-us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users >> >> _______________________________________________ >> hwloc-users mailing list >> hwloc-us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users