Hi Guy,
        I have reviewed your patch for bug 6745357 and feels that it's just
a workaround instead of a real fix for 6745357.
        The patch is still based on the assumption that memory node with
bigger node id will have higher memory address with it. That assumption is 
true for most current platforms, but things change fast and that assumption
may become broken with future platform. For example, following cases/factors
may break the assumption:
        1) According to ACPI spec, there's no guarantee that domain ids 
will be continuous starting from 0. On a NUMA platform with unpolulated
socket, there may be domains existing in SLIT/SRAT but disabled/unused.
        2) If in future, x86 platform supports memory renaming/migration, 
the same physical memory address range could be migrated from on domain
to another domain at run time.
        3) The assumption is introduced by x86 specific implementation, lgrp
subsystem has no such requirement.
        So it would be better to figure out the root cause of the bug and fix 
it.
According to my understanding, Gavin's patch should fixed on design defect
in x86 lgrp implemention. With Gavin's patch, there's asserting failure, that
may mean there are still other unveiled design/implementation defects.
        So how about to root cause it?
        Thanks!

Guy <> wrote:
> Hello
> 
> After working with Jonathan Chew on the subject (by emails),
> I discovered that the problem came from the wrong initialization
> of the lgrp kernel internal objects
> when the ACPI SRAT table is used to populate these objects (this
> applies to my system HP proliant with at least 2 opteron processors). 
> 
> On my server, in the ACPI SRAT table, the proximity domain ids start
> at 1 (not 0). 
> when this happens, the lgrp_plat_node_memory records are swapped.
> ex :
> 
> lgrp_plat_node_memory[]
>     {
>         start = 0x80000
>         end = 0xfffff
>         exists = 0x1
>         prox_domain = 0x2
>     }
>     {
>         start = 0
>         end = 0x7ffff
>         exists = 0x1
>         prox_domain = 0x1
>     }
> 
> and mem_node_config[] :
>     {
>         exists = 0x1
>         physbase = 0x80000
>         physmax = 0xf57f5
>     }
>     {
>         exists = 0x1
>         physbase = 0
>         physmax = 0x7ffff
>     }
> 
> 
> When the SRAT table is not processed (which was the case before
> snv_88), 
> these objects are initialized using pci registers and give :
> lgrp_plat_node_memory[]
>     {
>         start = 0
>         end = 0x7ffff
>         exists = 0x1
>         prox_domain = 0x1
>     }
>     {
>         start = 0x80000
>         end = 0xfffff
>         exists = 0x1
>         prox_domain = 0x2
>     }
> and :
> mem_node_config[] :
>     {
>         exists = 0x1
>         physbase = 0
>         physmax = 0x7ffff
>     }
>     {
>         exists = 0x1
>         physbase = 0x80000
>         physmax = 0xf57f5
>     }
> With this configuration, the system works properly.
> 
> 
> The problem is that access to this table is done with a hash function
> which is prox-domain-id modulo number of nodes in system. 
> On my system, since prox-domain start at 1 :
> 1%2 = 1
> 2%2 = 0
> So it reverses the order of the node in the table.
> 
> I tried the folowing fix in lgrpplat.c :
> I changed the macro :
> NODE_DOMAIN_HASH(domain, node_cnt)    ((domain) % node_cnt)
> into :
> NODE_DOMAIN_HASH(domain, node_cnt, first_domain)    ((domain -
> first_domain) % node_cnt) 
> 
> (full patch in attached file)
> 
> With this corrected code, the lgrp objects are initialized in the
> correct way (like when the objects are initialized from pci regs). 
> 
> I don't think this fix has side effect outside lgrpplat.c, since
> changes are internal to this file, no interface modified. 
> 
> If you have a system impacted by this bug, can you test this fix (I
> can provide a bfu binary archive if you need it) ? 
> 
> Thank you
> 
> Guy

Liu Jiang (Gerry)
OpenSolaris, OTC, SSG, Intel
_______________________________________________
opensolaris-code mailing list
opensolaris-code@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Reply via email to