Hi Guy, After reading more code relative to bug 6745357, I found there may be another better way to fix it. In file uts/i86pc/vm/vm_machdep.c, all "mnoderanges" relative logic has an assumption that entries in mnoderanges array are arranged in ascendent order with memory physical address. But there's no existing way to ensure that assumption. A quick fix would be to add logic to ensure mnoderanges is in ascendent order. Follow is a small patch to achieve that which keeps mnoderanges in ascendent order when creating mnoderanges in mnode_range_setup(). ======================================================== diff -r fd335a2c3bc4 usr/src/uts/i86pc/vm/vm_machdep.c --- a/usr/src/uts/i86pc/vm/vm_machdep.c Wed Mar 18 00:36:41 2009 +0800 +++ b/usr/src/uts/i86pc/vm/vm_machdep.c Wed Mar 18 12:17:57 2009 +0800 @@ -1250,10 +1250,26 @@ mnode_range_setup(mnoderange_t *mnoderanges) { int mnode, mri; + int i, max_mnodes = 0; + int mnodes[MAX_MEM_NODES];
for (mnode = 0; mnode < max_mem_nodes; mnode++) { if (mem_node_config[mnode].exists == 0) continue; + for (i = max_mnodes; i > 0; i--) { + if (mem_node_config[mnode].physbase > + mem_node_config[mnodes[i - 1]].physbase) { + break; + } else { + mnodes[i] = mnodes[i - 1]; + } + } + mnodes[i] = mnode; + max_mnodes++; + } + + for (i = 0; i < max_mnodes; i++) { + mnode = mnodes[i]; mri = nranges - 1; ======================================================== The above patch may work for current platform, but it still has issue to support memory migration and hotplug. To really make thing right, mem_node_config relative logic in vm_machdep.c should be cleaned up. I have delayed to send out the patch one day to find a machine to verify it. Now on my test machine, the patch works correctly and could solve 6745357 and relative bugs. Any comments? Guy <> wrote: > Hello Gerry, > > About your former post : > >> The patch is still based on the assumption that memory node with >> bigger node id will have higher memory address with it. That >> assumption is true for most current platforms, but things change >> fast and that assumption may become broken with future platform. > > This patch proposal address a problem for a given set of server. > It doesn't address to a RFE for future evolution of ACPI and/or > future NUMA architecture. > > nevada is broken since build 88 on these platform. s10 is broken > since u6 and will stay broken until u8 (at least). > I do think it's worth fixing this particular problem, even if we know > a problem could arise on future platform or acpi specs. But the later > could be addressed by a separate RFE. It's less urgent since these > cases do not exist yet. > >> 1) According to ACPI spec, there's no guarantee that domain ids >> will be continuous starting from 0. On a NUMA platform with >> unpolulated socket, there may be domains existing in SLIT/SRAT but >> disabled/unused. > > I think this situation is already addressed by current code > ("exists" property of different objects). > >> According to my understanding, Gavin's patch should fixed on design >> defect in x86 lgrp implemention. > > The fix referred by Gavin in this thread doesn't work. > > According to Kit Chow in a discussion we had by email with Jonathan > Chew : > >>>> mnode 0 contains a higher physical address range than mnode 1. This >>>> breaks various assumptions made by software that deal with physical >>>> memory. Very likely the reason for the panic... >>>> >>>> Jonathan, is this ordering problem caused by what you had >>>> previously described to me (something like srat index info >>>> starting at 1 instead of 0 and you grabbed the info from index 2 >>>> first because 2%2 = 0)? >>> >>> Yes. If possible, I want to confirm what you suspect and make sure >>> that we really find the root cause because there seems to be a bunch >>> of issues associated with 6745357 and none of them seem to have been >>> root caused (or at least they aren't documented very well). >>> >>> Is there some way to tell based on where the kernel died and >>> examining the relevant data structures to determine what's going on >>> and pinpoint the root cause? >>> >> mem_node_config and mnoderanges referenced below has a range of >> 0x80000-f57f5 in slot 0. This is bad and needs to be addressed first >> and foremost even if there could be other issues. The one assertion >> that I >> saw about the calculation of a pfn not matching its mnode is very >> very likely because of the ordering problem. >> >> Kit > > which lead me to think that changing the code to support situations > where mnodes are not is ascending order should be addressed in a > separate RFE. > > Thank you > > Best regards > > Guy Liu Jiang (Gerry) OpenSolaris, OTC, SSG, Intel
vm_machdep.diff
Description: vm_machdep.diff
_______________________________________________ opensolaris-code mailing list opensolaris-code@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/opensolaris-code