Re: [openstack-dev] realtime kvm cpu affinities

Chris Friesen Tue, 27 Jun 2017 08:27:30 -0700

On 06/27/2017 01:44 AM, Sahid Orentino Ferdjaoui wrote:

On Mon, Jun 26, 2017 at 10:19:12AM +0200, Henning Schild wrote:

Am Sun, 25 Jun 2017 10:09:10 +0200
schrieb Sahid Orentino Ferdjaoui <[email protected]>:

On Fri, Jun 23, 2017 at 10:34:26AM -0600, Chris Friesen wrote:

On 06/23/2017 09:35 AM, Henning Schild wrote:

Am Fri, 23 Jun 2017 11:11:10 +0200
schrieb Sahid Orentino Ferdjaoui <[email protected]>:

In Linux RT context, and as you mentioned, the non-RT vCPU can
acquire some guest kernel lock, then be pre-empted by emulator
thread while holding this lock. This situation blocks RT vCPUs
from doing its work. So that is why we have implemented [2].
For DPDK I don't think we have such problems because it's
running in userland.

So for DPDK context I think we could have a mask like we have
for RT and basically considering vCPU0 to handle best effort
works (emulator threads, SSH...). I think it's the current
pattern used by DPDK users.


DPDK is just a library and one can imagine an application that has
cross-core communication/synchronisation needs where the emulator
slowing down vpu0 will also slow down vcpu1. You DPDK application
would have to know which of its cores did not get a full pcpu.

I am not sure what the DPDK-example is doing in this discussion,
would that not just be cpu_policy=dedicated? I guess normal
behaviour of dedicated is that emulators and io happily share
pCPUs with vCPUs and you are looking for a way to restrict
emulators/io to a subset of pCPUs because you can live with some
of them beeing not 100%.


Yes.  A typical DPDK-using VM might look something like this:

vCPU0: non-realtime, housekeeping and I/O, handles all virtual
interrupts and "normal" linux stuff, emulator runs on same pCPU
vCPU1: realtime, runs in tight loop in userspace processing packets
vCPU2: realtime, runs in tight loop in userspace processing packets
vCPU3: realtime, runs in tight loop in userspace processing packets

In this context, vCPUs 1-3 don't really ever enter the kernel, and
we've offloaded as much kernel work as possible from them onto
vCPU0.  This works pretty well with the current system.

For RT we have to isolate the emulator threads to an additional
pCPU per guests or as your are suggesting to a set of pCPUs for
all the guests running.

I think we should introduce a new option:

    - hw:cpu_emulator_threads_mask=^1

If on 'nova.conf' - that mask will be applied to the set of all
host CPUs (vcpu_pin_set) to basically pack the emulator threads
of all VMs running here (useful for RT context).


That would allow modelling exactly what we need.
In nova.conf we are talking absolute known values, no need for a
mask and a set is much easier to read. Also using the same name
does not sound like a good idea.
And the name vcpu_pin_set clearly suggest what kind of load runs
here, if using a mask it should be called pin_set.


I agree with Henning.

In nova.conf we should just use a set, something like
"rt_emulator_vcpu_pin_set" which would be used for running the
emulator/io threads of *only* realtime instances.


I'm not agree with you, we have a set of pCPUs and we want to
substract some of them for the emulator threads. We need a mask. The
only set we need is to selection which pCPUs Nova can use
(vcpus_pin_set).


At that point it does not really matter whether it is a set or a mask.
They can both express the same where a set is easier to read/configure.
With the same argument you could say that vcpu_pin_set should be a mask
over the hosts pcpus.

As i said before: vcpu_pin_set should be renamed because all sorts of
threads are put here (pcpu_pin_set?). But that would be a bigger change
and should be discussed as a seperate issue.

So far we talked about a compute-node for realtime only doing realtime.
In that case vcpu_pin_set + emulator_io_mask would work. If you want to
run regular VMs on the same host, you can run a second nova, like we do.

We could also use vcpu_pin_set + rt_vcpu_pin_set(/mask). I think that
would allow modelling all cases in just one nova. Having all in one
nova, you could potentially repurpose rt cpus to best-effort and back.
Some day in the future ...


That is not something we should allow or at least
advertise. compute-node can't run both RT and non-RT guests and that
because the nodes should have a kernel RT. We can't guarantee RT if
both are on same nodes.

A compute node with an RT OS could run RT and non-RT guests at the same timejust fine. In a small cloud (think hyperconverged with maybe two nodes total)it's not viable to dedicate an entire node to just RT loads.

I'd personally rather see nova able to handle a mix of RT and non-RT than needto run multiple nova instances on the same node and figure out an up-front splitof resources between RT nova and non-RT nova. Better to allow nova todynamically allocate resources as needed.


Chris

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] realtime kvm cpu affinities

Reply via email to