Re: [osol-code] workaround proposal for bug 6745357

Liu, Jiang Thu, 19 Mar 2009 07:32:55 -0700

Hi Guy,
        After reading more code relative to bug 6745357, I found there
may be another better way to fix it. 
        In file uts/i86pc/vm/vm_machdep.c, all "mnoderanges" relative 
logic has an assumption that entries in mnoderanges array are arranged
 in ascendent order with memory physical address. But there's no 
existing way to ensure that assumption. A quick fix would be to add
logic to ensure mnoderanges is in ascendent order. Follow is a small
patch to achieve that which keeps mnoderanges in ascendent order
when creating mnoderanges in mnode_range_setup().
 
========================================================
diff -r fd335a2c3bc4 usr/src/uts/i86pc/vm/vm_machdep.c
--- a/usr/src/uts/i86pc/vm/vm_machdep.c Wed Mar 18 00:36:41 2009 +0800
+++ b/usr/src/uts/i86pc/vm/vm_machdep.c Wed Mar 18 12:17:57 2009 +0800
@@ -1250,10 +1250,26 @@
 mnode_range_setup(mnoderange_t *mnoderanges)
 {
        int     mnode, mri;
+       int     i, max_mnodes = 0;
+       int     mnodes[MAX_MEM_NODES];


        for (mnode = 0; mnode < max_mem_nodes; mnode++) {
                if (mem_node_config[mnode].exists == 0)
                        continue;
+               for (i = max_mnodes; i > 0; i--) {
+                       if (mem_node_config[mnode].physbase >
+                           mem_node_config[mnodes[i - 1]].physbase) {
+                               break;
+                       } else {
+                               mnodes[i] = mnodes[i - 1];
+                       }
+               }
+               mnodes[i] = mnode;
+               max_mnodes++;
+       }
+
+       for (i = 0; i < max_mnodes; i++) {
+               mnode = mnodes[i];

                mri = nranges - 1;

========================================================
        The above patch may work for current platform, but it still has issue 
to support
memory migration and hotplug. To really make thing right, mem_node_config 
relative
logic in vm_machdep.c should be cleaned up.

        I have delayed to send out the patch one day to find a machine to 
verify it. 
Now on my test machine, the patch works correctly and could solve 6745357 and
relative bugs.

        Any comments?

Guy <> wrote:
> Hello Gerry,
> 
> About your former post :
> 
>> The patch is still based on the assumption that memory node with
>> bigger node id will have higher memory address with it. That
>> assumption is true for most current platforms, but things change
>> fast and that assumption may become broken with future platform.
> 
> This patch proposal address a problem for a given set of server.
> It doesn't address to a RFE for future evolution of ACPI and/or
> future NUMA architecture. 
> 
> nevada is broken since build 88 on these platform. s10 is broken
> since u6 and will stay broken until u8 (at least). 
> I do think it's worth fixing this particular problem, even if we know
> a problem could arise on future platform or acpi specs. But the later
> could be addressed by a separate RFE. It's less urgent since these
> cases do not exist yet.  
> 
>> 1) According to ACPI spec, there's no guarantee that domain ids
>> will be continuous starting from 0. On a NUMA platform with
>> unpolulated socket, there may be domains existing in SLIT/SRAT but
>> disabled/unused. 
> 
> I think this situation is already addressed by current code 
> ("exists" property of different objects). 
> 
>> According to my understanding, Gavin's patch should fixed on design
>> defect in x86 lgrp implemention.
> 
> The fix referred by Gavin in this thread doesn't work.
> 
> According to Kit Chow in a discussion we had by email with Jonathan
> Chew : 
> 
>>>> mnode 0 contains a higher physical address range than mnode 1. This
>>>> breaks various assumptions made by software that deal with physical
>>>> memory. Very likely the reason for the panic...
>>>> 
>>>> Jonathan, is this ordering problem caused by what you had
>>>> previously described to me (something like srat index info
>>>> starting at 1 instead of 0 and you grabbed the info from index 2
>>>> first because 2%2 = 0)? 
>>> 
>>> Yes.  If possible, I want to confirm what you suspect and make sure
>>> that we really find the root cause because there seems to be a bunch
>>> of issues associated with 6745357 and none of them seem to have been
>>> root caused (or at least they aren't documented very well).
>>> 
>>> Is there some way to tell based on where the kernel died and
>>> examining the relevant data structures to determine what's going on
>>> and pinpoint the root cause? 
>>> 
>> mem_node_config and mnoderanges referenced below has a range of
>> 0x80000-f57f5 in slot 0. This is bad and needs to be addressed first
>> and foremost even if there could be other issues. The one assertion
>> that I 
>> saw about the calculation of a pfn not matching its mnode is very
>> very likely because of the ordering problem.
>> 
>> Kit
> 
> which lead me to think that changing the code to support situations
> where mnodes are not is ascending order should be addressed in a
> separate RFE. 
> 
> Thank you
> 
> Best regards
> 
> Guy

Liu Jiang (Gerry)
OpenSolaris, OTC, SSG, Intel

vm_machdep.diff
Description: vm_machdep.diff

_______________________________________________
opensolaris-code mailing list
opensolaris-code@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Re: [osol-code] workaround proposal for bug 6745357

Reply via email to