Hi Eli,

Sorry for top-posting. Just a quick note to say I had a good conversation on Monday about this with Sean Mooney. I think we have some ideas on how to model all of these resources in the new placement/resource providers schema.

Are you at the PTG? If so, would be great to meet up to discuss...

Best,
-jay

On 02/21/2017 05:38 AM, Qiao, Liyong wrote:
Hi folks:



Seeking community input on an initial design for Intel Resource Director
Technology (RDT), in particular for Cache Allocation Technology in
OpenStack Nova to protect workloads from co-resident noisy neighbors, to
ensure quality of service (QoS).



1. What is Cache Allocation Technology (CAT)?**

Intel’s RDT(Resource Director Technology) [1]  is a umbrella of
*hardware* support to facilitate the monitoring and reservation of
shared resources such as cache, memory and network bandwidth towards
obtaining Quality of Service. RDT will enable fine grain control of
resources which in particular is valuable in cloud environments to meet
Service Level Agreements while increasing resource utilization through
sharing. CAT is a part of RDT and concerns itself with reserving for a
process(es) a portion of last level cache with further fine grain
control as to how much for code versus data. The below figure shows a
single processor composed of 4 cores and the cache hierarchy. The L1
cache is split into Instruction and Data, the L2 cache is next in speed
to L1. The L1 and L2 caches are per core. The Last Level Cache (LLC) is
shared among all cores. With CAT on the currently available hardware the
LLC can be partitioned on a per process (virtual machine, container, or
normal application) or process group basis.



Libvirt and OpenStack [2] already support monitoring cache (CMT) and
memory bandwidth usage local to a processor socket (MBM_local) and total
memory bandwidth usage across all processor sockets (MBM_total) for a
process or process group.




2. How CAT works  **

To learn more about CAT please refer to the Intel Processor Soft
Developer's Manual
<http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html>
 volume 3b, chapters 17.16 and 17.17 [3]. Linux kernel support for the
same is expected in release 4.10 and documented at [4]


3. Libvirt Interface**


Libvirt support for CAT is underway with the patch at reversion 7



Interface changes of libvirt:



3.1 The capabilities xml has been extended to reveal cache information **



<cache>

     <bank id='0' type='l3' size='56320' unit='KiB' cpus='0-21,44-65'>

       <control min='2816' reserved='2816' unit='KiB' scope='L3'/>

     </bank>

     <bank id='1' type='l3' size='56320' unit='KiB' cpus='22-43,66-87'>

       <control min='2816' reserved='2816' unit='KiB' scope='L3'/>

     </bank>

</cache>



The new `cache` xml element shows that the host has two *banks* of
*type* L3 or Last Level Cache (LLC), one per processor socket. The cache
*type* is l3 cache, its *size* 56320 KiB, and the *cpus* attribute
indicates the physical CPUs associated with the same, here ‘0-21’,
‘44-65’ respectively.



The *control *tag shows that bank belongs to scope L3, with a minimum
possible allocation of 2816 KiB and still has 2816 KiB need to be reserved.



If the host enabled CDP (Code and Data Prioritization) , l3 cache will
be divided as code  (L3CODE)and data (L3Data).



Control tag will be extended to:

...

 <control min='2816' reserved='2816' unit='KiB' scope='L3CODE'/>

 <control min='2816' reserved='2816' unit='KiB' scope='L3DATA'/>

…



The scope of L3CODE and L3DATA show that we can allocate cache for
code/data usage respectively, they share same amount of l3 cache.



3.2 Domain xml extended to include new CacheTune element **



<cputune>

   <vcpupin vcpu='0' cpuset='0'/>

               <vcpupin vcpu='1' cpuset='1'/>

   <vcpupin vcpu='2' cpuset='0'/>

               <vcpupin vcpu='3' cpuset='1'/>

   <cachetune id='0' host_id='0' type='l3' size='2816' unit='KiB'
vcpus='0, 1/>

   <cachetune id='1' host_id='1' type='l3' size='2816' unit='KiB'
vcpus=’2, 3’/>

...

</cputune>



This means the guest will be have vcpus 0, 1 running on host’s socket 0,
with 2816 KiB cache exclusively allocated to it and vcpus 2, 3 running
on host’s socket 0, with 2816 KiB cache exclusively allocated to it.



Here we need to make sure vcpus 0, 1 are pinned to the pcpus of socket
0, refer capabilities

 <bank id='0' type='l3' size='56320' unit='KiB' cpus='0-21,44-65'>:



Here we need to make sure vcpus 2, 3 are pinned to the pcpus of socket
1, refer capabilities

 <bank id='1' type='l3' size='56320' unit='KiB' cpus='22-43,66-87'>:.



3.3 Libvirt work flow for CAT**



 1. Create qemu process and get it’s PIDs
 2. Define a new resource control domain also known as
    *Cl*ass-*o*f-*S*ervice (CLOS) under /sys/fs/resctrl and set the
    desired *C*ache *B*it *M*ask(CBM) in the libvirt domain xml file in
    addition to updating the default schemata of the host



4. Proposed Nova Changes**



 1. Get host capabilities from libvirt and extend compute node’ filed
 2. Add new scheduler filter and weight to help schedule host for
    requested guest.
 3. Extend flavor’s (and image meta) extra spec fields:



We need to specify  numa setting for NUMA hosts if we want to enable
CAT, see [5] to learn more about NUMA.

In flavor, we can have:



vcpus=8

mem=4

hw:numa_nodes=2 - numa of NUMA nodes to expose to the guest.

hw:numa_cpus.0=0,1,2,3,4,5

hw:numa_cpus.1=6,7

hw:numa_mem.0=3072

hw:numa_mem.1=1024

//  new added in the proposal

hw:cache_banks=2   ///cache banks to be allocated to a  guest, (can be
less than the number of NUMA nodes)/

hw:cache_type.0=l3  ///cache bank type, could be l3, l3data + l3code/

hw:cache_type.1=l3_c+d  ///cache bank type, could be l3, l3data + l3code/

hw:cache_vcpus.0=0,1  ///vcpu list on cache banks, can be none/

hw:cache_vcpus.1=6,7

hw:cache_l3.0=2816  ///cache size in KiB./

hw:cache_l3_code.1=2816

hw:cache_l3_data.1=2816



Here, user can clear about which vcpus will benefit cache allocation,
about cache bank, it’s should be co-work with numa cell, it will
allocate cache on a physical CPU socket, but here cache bank is a logic
concept. Cache bank will allocate cache for a vcpu list, all vcpu list
should group



Modify in addition the <cachetune> element in libvirt domain xml, see
3.2 for detail



This will allocate 2 cache banks from the host’s cache banks and
associate vcpus to the same.

In the example, the guest will be have vcpus 0, 1 running on socket 0 of
the host with 2816 KiB of cache for exclusive use and have vcpus 6, 7
running on socket 1 of the host with l3 code cache 2816KiB and l3 data
with 2816KiB cache allocation.



If a NUMA Cell were to contain multiple CPU sockets (this is rare), then
we will adjust NUMA vCPU placement policy, to ensure that vCPUs and the
cache allocated to them are all co-located on the same socket.



  * We can define less cache bank on a multiple NUMA cell node.
  * No cache_vcpus parameter needs to be specified if no reservation is
    desired.



NOTE: the cache allocation for a guest is in isolated/exclusive mode.



References**



[1]
http://www.intel.com/content/www/us/en/architecture-and-technology/resource-director-technology.html

[2] https://blueprints.launchpad.net/nova/+spec/support-perf-event

[3]
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

[4]
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/tree/Documentation/x86/intel_rdt_ui.txt?h=x86/cache


[5]
https://specs.openstack.org/openstack/nova-specs/specs/juno/implemented/virt-driver-numa-placement.html






Best Regards



Eli Qiao(乔立勇)OpenStack Core team OTC Intel.

--





__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to