[libvirt] RFC for support Intel RDT/CAT in libvirt

Qiao, Liyong Wed, 21 Dec 2016 01:53:05 -0800

 Hi folks

I would like to start a discussion about how to support a new cpu feature in 
libvirt. CAT support is not fully merged into linux kernel yet, the target 
release is 4.10, and all patches has been merged into linux tip branch. So 
there won’t be interface/design changes.


## Background

Intel RDT is a toolkit to do resource Qos  for cpu such as llc(l3) cache, 
memory bandwidth usage, these fine granularity resource control features are 
very useful in a cloud environment which will run logs of noisy instances.
Currently, Libvirt has supported CAT/MBMT/MBML already, they are only for 
resource usage monitor, propose to supporting CAT to control VM’s l3 cache 
quota.



## CAT interface in kernel

In kernel, a new resource interface has been introduced under /sys/fs/resctrl, 
it’s used for resource control, for more information, refer
Intel_rdt_ui [ 
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/tree/Documentation/x86/intel_rdt_ui.txt?h=x86/cache
 ]


Kernel requires to provide schemata for l3 cache before add a task to a new 
partition, these interface is too much detail for a virtual machine user, so 
propose to let Libvirt manage schemata on the host.




## What will libvirt do?



### Questions:

To enable CAT support in libvirt, we need to think about follow questions




  1.  Only set CAT when an VM has CPU pin, which is to say, l3 cache is per cpu 
socket resources. On a host which has 2 cpu sockets, each cpu socket has it own 
cache, and can not be shared..
  2.  What the cache allocation policy should be used, this will be looks like:
a.                  VM has it’s own dedicated l3 cache and also can share other 
l3 cache.
b.                  VM can only use the caches which allocated to it.
c.                   Has some pre-defined policies and priority for a VM
Like COB [1]

  1.  Should reserve some l3 cache for host’s system usage (related to 2)
  2.  What’s the unit for l3 cache allocation? (related to 2)

### Propose Changes

XML domain user interface changes:

Option 1: explicit specify cache allocation for a VM

1 work with numa node

Some cloud orchestration software use numa + vcpu pin together, so we can 
enable cat supporting with numa infra.

Expose how many l3 cache a VM want to reserved and we require that the l3 cache 
should be bind on some specify cpu socket, just like what we did for numa node.

This is an domain xml example which is generated by OpenStack Nova for allocate 
llc(l3 cache) when booting a new VM

<domain>
…
 <cputune>
   <vcpupin vcpu='0' cpuset='19'/>
   <vcpupin vcpu='1' cpuset='63'/>
   <vcpupin vcpu='2' cpuset='83'/>
   <vcpupin vcpu='3' cpuset='39'/>
   <vcpupin vcpu='4' cpuset='40'/>
   <vcpupin vcpu='5' cpuset='84'/>
   <emulatorpin cpuset='19,39-40,63,83-84'/>
 </cputune>
...
 <cpu mode='host-model'>
   <model fallback='allow'/>
   <topology sockets='3' cores='1' threads='2'/>
   <numa>
     <cell id='0' cpus='0-1' memory='2097152' l3cache='1408' unit='KiB'/>
     <cell id='1' cpus='2-5' memory='4194304' l3cache='5632' unit='KiB'/>
   </numa>
 </cpu>
...
</domain>

Refer to [http://libvirt.org/formatdomain.html#elementsCPUTuning]



So finally we can calculate on which CPU socket(cell) we need to allocate how 
may l3cache for a VM.

2. work with vcpu pin

Forget numa part, CAT setting should have relationship with cpu core setting, 
we can apply CAT policy if VM has set cpu pin setting (only VM won’t be 
schedule to another CPU sockets)

Cache allocation on which CPU socket can be calculate as just as 1.

We may need to enable both 1 and 2.

There are several policy for cache allocation:

Let’ take some examples:

For intel e5 v4 2699(Signal socket), there are 55M l3 cache on the chip , the 
default of L3 schemata is L3:0=ffffff , it represents to use 20 bit to control 
l3 cache, each bit will represent 2.75M, which will be the minimal unit on this 
host.
The allocation policy could be 3 policies :

1.      One priority VM:
A priority import VM can be allocated a dedicated amount l3 cache (let’s say 
2.75 * 4 = 11M) and it can also reach the left 44 M cache which will be shared 
with other process and VM on the same host.
So that we need to create a new ‘Partition’ n-20371

root@s2600wt:/sys/fs/resctrl# ls
cpus  info  n-20371  schemata  tasks

Inside of n-20371 directory:

root@s2600wt:/sys/fs/resctrl# ls n-20371/
cpus  schemata  tasks

The schemata content  will be L3:0=fffff
The tasks content will be the pids of that VM

Along we need to change the default schemata of system:

root@s2600wt:/sys/fs/resctrl# cat schemata
L3:0=ffff # which can not use the highest 4 bits, only tasks in  n-20371 can 
reach that.

In this design , we can only get 1 priority VM.

Let’s change it a bit to have 2 VMs

The schemata of the 2 VMs could be:

1.      L3:0=ffff0 # could not use the 4 low bits 11M l3 cache
2.      L3:0=0ffff # could not use the 4 high bits 11M l3 cache



Default schemata changes to

L3:0=0fff0 # default system process could only use the middle 33M l3 cache

2.      Isolated l3 cache dedicated allocation for each VM(if required)

A VM can only use the cache allocated to it.

For example
VM 1 requires to allocate 11 M
It’s schemata will be L3:0=f0000  #
VM 2 requires to allocate 11M
It’s schemata will be L3:0=f000

And default schemata will be L3:0=fff

In this case, we can create multiple VMs which each of them can have dedicated 
l3 cache.
The disadvantage is that we the allocated cache could be not be shared 
efficiently.

3.      Isolated l3 cache shared allocation for each VM(if required by user)

In this case, we will put some VMs (which consider as noisy neighbors) to a 
‘Partition’, restrict them to use the only caches allocated to them, by do 
this, other much more priority VM can be ensure to have enough l3 cache

Then we should decide how much cache the noisy group should have, and put all 
of their pids in that tasks file.



Option 2: set cache priority  and apply policies

Don’t specify cache amount at all, only define cache usage priority when define 
a VM domain XML.

Cache priority will decide how much the VM can use l3 cache on a host, it’s not 
a quantized.  So user don’t need to think about how much cache it should have 
when define a domain XML.

Libvirt will decide cache allocation by the priority of VM defined and policies 
using.

Disadvantage is that caches ability on different host may be different. Same VM 
domain XML on different host may have vary caches allocation amount.

# Support CAT in libvirt itself or leverage other software

COB

COB is Intel Cache Orchestrator Balancer (COB). please refer 
http://clrgitlab.intel.com/rdt/cob/tree/master

COB supports some pre-defined policies, it will monitor cpu/cache/cache missing 
and do cache allocation based on policy using.

If COB support monitor some specified process (VM process) and accept priority 
defined, it will be good to reuse.




At last the question came out:

  *   Support a fine-grained llc cache control , let user specify cache 
allocation
  *   Support pre-defined policies and user specify llc allocation priority.

Reference

[1] COB http://clrgitlab.intel.com/rdt/cob/tree/master
[2] CAT intro: 
https://software.intel.com/en-us/articles/software-enabling-for-cache-allocation-technology
[3] kernel Intel_rdt_ui [ 
https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/tree/Documentation/x86/intel_rdt_ui.txt?h=x86/cache
 ]




Best Regards

Eli Qiao(乔立勇）OpenStack Core team OTC Intel.
--

--
libvir-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/libvir-list

[libvirt] RFC for support Intel RDT/CAT in libvirt

Reply via email to