On Mon, 27 Jul 2020 10:49:08 -0500
Babu Moger <babu.mo...@amd.com> wrote:

> > -----Original Message-----
> > From: Igor Mammedov <imamm...@redhat.com>
> > Sent: Friday, July 24, 2020 12:05 PM
> > To: Moger, Babu <babu.mo...@amd.com>
> > Cc: qemu-devel@nongnu.org; pbonz...@redhat.com; ehabk...@redhat.com;
> > r...@twiddle.net
> > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> > CpuInstanceProperties
> > 
> > On Mon, 13 Jul 2020 14:30:29 -0500
> > Babu Moger <babu.mo...@amd.com> wrote:
> >   
> > > > -----Original Message-----
> > > > From: Igor Mammedov <imamm...@redhat.com>
> > > > Sent: Monday, July 13, 2020 12:32 PM
> > > > To: Moger, Babu <babu.mo...@amd.com>
> > > > Cc: pbonz...@redhat.com; r...@twiddle.net; ehabk...@redhat.com; qemu-
> > > > de...@nongnu.org
> > > > Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> > > > CpuInstanceProperties
> > > >
> > > > On Mon, 13 Jul 2020 11:43:33 -0500
> > > > Babu Moger <babu.mo...@amd.com> wrote:
> > > >  
> > > > > On 7/13/20 11:17 AM, Igor Mammedov wrote:  
> > > > > > On Mon, 13 Jul 2020 10:02:22 -0500 Babu Moger
> > > > > > <babu.mo...@amd.com> wrote:
> > > > > >  
> > > > > >>> -----Original Message-----
> > > > > >>> From: Igor Mammedov <imamm...@redhat.com>
> > > > > >>> Sent: Monday, July 13, 2020 4:08 AM
> > > > > >>> To: Moger, Babu <babu.mo...@amd.com>
> > > > > >>> Cc: pbonz...@redhat.com; r...@twiddle.net; ehabk...@redhat.com;
> > > > > >>> qemu- de...@nongnu.org
> > > > > >>> Subject: Re: [PATCH v2 1/3] hw/i386: Initialize topo_ids from
> > > > > >>> CpuInstanceProperties  
> > > > > > [...]  
> > > > > >>>> +
> > > > > >>>> +/*
> > > > > >>>> + * Initialize topo_ids from CpuInstanceProperties
> > > > > >>>> + * node_id in CpuInstanceProperties(or in CPU device) is a
> > > > > >>>> +sequential
> > > > > >>>> + * number, but while building the topology  
> > > > > >>>  
> > > > > >>>> we need to separate it for
> > > > > >>>> + * each socket(mod nodes_per_pkg).  
> > > > > >>> could you clarify a bit more on why this is necessary?  
> > > > > >>
> > > > > >> If you have two sockets and 4 numa nodes, node_id in
> > > > > >> CpuInstanceProperties will be number sequentially as 0, 1, 2, 3.
> > > > > >> But in EPYC topology, it will be  0, 1, 0, 1( Basically mod %
> > > > > >> number of nodes  
> > > > per socket).  
> > > > > >
> > > > > > I'm confused, let's suppose we have 2 EPYC sockets with 2 nodes
> > > > > > per socket so APIC id woulbe be composed like:
> > > > > >
> > > > > >  1st socket
> > > > > >    pkg_id(0) | node_id(0)
> > > > > >    pkg_id(0) | node_id(1)
> > > > > >
> > > > > >  2nd socket
> > > > > >    pkg_id(1) | node_id(0)
> > > > > >    pkg_id(1) | node_id(1)
> > > > > >
> > > > > > if that's the case, then EPYC's node_id here doesn't look like a
> > > > > > NUMA node in the sense it's usually used (above config would
> > > > > > have 4 different memory controllers => 4 conventional NUMA nodes).  
> > > > >
> > > > > EPIC model uses combination of socket id and node id to identify
> > > > > the numa nodes. So, it internally uses all the information.  
> > > >
> > > > well with above values, EPYC's node_id doesn't look like it's
> > > > specifying a machine numa node, but rather a node index within
> > > > single socket. In which case, it doesn't make much sense calling it
> > > > NUMA node_id, it's rather some index within a socket. (it starts
> > > > looking like terminology is all mixed up)
> > > >
> > > > If you have access to a milti-socket EPYC machine, can you dump and
> > > > post here its apic ids, pls?  
> > >
> > > Here is the output from my EPYC machine with 2 sockets and totally 8
> > > nodes(SMT disabled). The cpus 0-31 are in socket 0 and  cpus 32-63 in
> > > socket 1.
> > >
> > > # lscpu
> > > Architecture:        x86_64
> > > CPU op-mode(s):      32-bit, 64-bit
> > > Byte Order:          Little Endian
> > > CPU(s):              64
> > > On-line CPU(s) list: 0-63
> > > Thread(s) per core:  1
> > > Core(s) per socket:  32
> > > Socket(s):           2
> > > NUMA node(s):        8
> > > Vendor ID:           AuthenticAMD
> > > CPU family:          23
> > > Model:               1
> > > Model name:          AMD Eng Sample: 1S1901A4VIHF5_30/19_N
> > > Stepping:            2
> > > CPU MHz:             2379.233
> > > CPU max MHz:         1900.0000
> > > CPU min MHz:         1200.0000
> > > BogoMIPS:            3792.81
> > > Virtualization:      AMD-V
> > > L1d cache:           32K
> > > L1i cache:           64K
> > > L2 cache:            512K
> > > L3 cache:            8192K
> > > NUMA node0 CPU(s):   0-7
> > > NUMA node1 CPU(s):   8-15
> > > NUMA node2 CPU(s):   16-23
> > > NUMA node3 CPU(s):   24-31
> > > NUMA node4 CPU(s):   32-39
> > > NUMA node5 CPU(s):   40-47
> > > NUMA node6 CPU(s):   48-55
> > > NUMA node7 CPU(s):   56-63
> > >
> > > Here is the output of #cpuid  -l 0x8000001e  -r  
> > 
> > 
> > (1)  
> > > You may want to refer
> > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.
> > >  
> > amd.com%2Fsystem%2Ffiles%2FTechDocs%2F54945_3.03_ppr_ZP_B2_pub.zip&
> > amp  
> > >  
> > ;data=02%7C01%7Cbabu.moger%40amd.com%7Ceacf7e8facbc4ae2eee808d82
> > ff3ca9  
> > >  
> > 0%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6373120714103223
> > 35&amp;  
> > >  
> > sdata=%2Fdr93YVlwSq82%2FwRh2NU21Zkw4HJ%2B%2FVVYxAkhCCKJ4w%3D&a
> > mp;reser  
> > > ved=0 (section 2.1.12.2.1.3 ApicId Enumeration Requirements).
> > > Note that this is a general guideline. We tried to generalize in qemu
> > > as much as possible. It is bit complex.  
> > 
> > 
> >   
> > > CPU 0:
> > >    0x8000001e 0x00: eax=0x00000000 ebx=0x00000100 ecx=0x00000300
> > > edx=0x00000000  
> > [...]  
> > > CPU 63:
> > >    0x8000001e 0x00: eax=0x0000007e ebx=0x0000011f ecx=0x00000307
> > > edx=0x00000000
> > >  
> > > >  
> > > > >  
> > > > > >
> > > > > > I wonder if linux guest actually uses node_id encoded in apic id
> > > > > > for configuring/checking numa structures, or it just uses
> > > > > > whatever ACPI SRAT table provided.
> > > > > >  
> > > > > >>>> + */
> > > > > >>>> +static inline void x86_init_topo_ids(X86CPUTopoInfo *topo_info,
> > > > > >>>> +                                     CpuInstanceProperties 
> > > > > >>>> props,
> > > > > >>>> +                                     X86CPUTopoIDs *topo_ids) {
> > > > > >>>> +    topo_ids->smt_id = props.has_thread_id ? props.thread_id : 
> > > > > >>>> 0;
> > > > > >>>> +    topo_ids->core_id = props.has_core_id ? props.core_id : 0;
> > > > > >>>> +    topo_ids->die_id = props.has_die_id ? props.die_id : 0;
> > > > > >>>> +    topo_ids->node_id = props.has_node_id ?
> > > > > >>>> +                        props.node_id %
> > > > > >>>> +MAX(topo_info->nodes_per_pkg, 1) : 0;  
> > 
> > It looks like I was wrong pushing system wide NUMA node-id into APIC ID
> > (choosen naming is confusing a bit), per [1] mentioned above, EPYC's 
> > node-id is:
> > 
> > • ApicId[6] = Socket ID.
> > * ApicId[5:4]= Node ID.
> > • ApicId[3] = Logical CCX L3 complex ID
> > • ApicId[2:0]= (SMT) ? {LogicalCoreID[1:0],ThreadId} : 
> > {1'b0,LogicalCoreID[1:0]}
> > 
> > which is can hold only 0-3 values, and defined as:
> > 
> > "A node, is an integrated circuit device that includes one to 8 cores (one 
> > or two
> > Core Complexes)."
> > 
> > spec also mentions it indirectly as die-id if one looks at 
> > CPUID_Fn8000001E_EBX
> > [Core Identifiers] (Core::X86::Cpuid::CoreId) ...
> >   CoreId = ({2'b0, DieId[1:0], LogicalComplexId[0], LogicalThreadId[2:0]} 
> > >> SMT
> > 
> > and in
> > (2)
> > CPUID_Fn8000001E_ECX [Node Identifiers] (Core::X86::Cpuid::NodeId) ...
> >   {5'b00000,1'b[SOCKET_ID],2'b[DIE_ID]}
> > 
> > Question is why we did not reuse topo_ids->die_id instead of adding 
> > confusing
> > topo_ids->node_id in the first place?  
> 
> Initially, I thought about it. But Intel uses die_id differently than AMD.
> So, did not want complicate things.
> If we take that route then we need to re-arrange the numa code as we need
> to numa information to calculate the die id. So, did not want to mix up
> things.
> 
> > 
> > Also looking APIC ID and SRAT table provided here, CPUID_Fn8000001E_ECX
> > corresponds to NUMA node id (i.e. what -numa in QEMU used for) and Node ID
> > embeded into ApicId[5:4] is basically die-id.
> > 
> > Difference between die-id implemented in QEMU and EPYC's die id (topo_ids-  
> > >node_id) is that the former doesn't require numa config (maybe it should, 
> > >but  
> > ship'salready sailed) and gets number of dies from '-smp dies=X' CLI option,
> > while for EPYC we calculate topo_ids->node_id implicitly from number of numa
> > nodes and sockets, which implicitly requires that machine 'must' be 
> > configured
> > with -numa options.
> > 
> > Maybe we should drop this implicit calculation along with topo_ids->node_id
> > and reuse '-smp dies=X' plus extra checks for EPYC to ask for -numa if 
> > there is
> > more than 1 die and if we need to be really strict, add checks for limit of
> > dies/cores within socket/die according to spec[2] so encoded APIC ID and
> > CPUID_8000001E match the spec.  
> 
> There will be complications when user configures with both die_id and
> numa_id. It will complicate things further. I will have to look closely at
> the code if it is feasible.

it's worth a try.
conseptionally die_id in intel/amd is the same. Most likely intel has a 
dedicated
memory controller on each die so it still should form a NUMA node. But that 
aspect
probably was ignored while implementing it in QEMU so ping of configuring QEMU 
right
is on user's shoulders (there is no checks whatsoever if cpu belonging to 
specific
die is on right NUMA node).

What AMD has implemented on top of that in CPU hw, is to expose NUMA node id in
CPUID_8000001E. I don't know why it was done as usually it's ACPI tables that
describe relations between nodes so for OS this info almost useless (I'd guess
it's faster to use CPUID instead of fetching pre-cpu variable but that's pretty
much it from OS point of view)

> 
> > 
> > 
> >   
> > > > > >>>> +    topo_ids->pkg_id = props.has_socket_id ? props.socket_id :
> > > > > >>>> +0; }
> > > > > >>>>  /*
> > > > > >>>>   * Make APIC ID for the CPU 'cpu_index'
> > > > > >>>>   *
> > > > > >>>>  
> > > > > >>  
> > > > > >  
> > > > >  
> > >
> > >  
> 


Reply via email to