Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE

Anup Patel Wed, 09 Oct 2013 08:18:36 -0700

On Wed, Oct 9, 2013 at 8:40 PM, Anup Patel <[email protected]> wrote:
> On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <[email protected]> wrote:
>> On 09/10/13 15:50, Anup Patel wrote:
>>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <[email protected]> wrote:
>>>> On 09/10/13 14:26, Gleb Natapov wrote:
>>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:53, Gleb Natapov <[email protected]> wrote:
>>>>>>
>>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>>>
>>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>>>> running at all.
>>>>>>>>>>>
>>>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>>>
>>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>>>
>>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>>>
>>>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>>>
>>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>>>> see any significant drop in performance.
>>>>>>>>>
>>>>>>>>> The key is in the ARM ARM:
>>>>>>>>>
>>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a 
>>>>>>>>> Non-secure
>>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a 
>>>>>>>>> Hyp
>>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>>>> permit the processor to suspend execution."
>>>>>>>>>
>>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>>>> hence not trapping. Otherwise, performance would go down the drain 
>>>>>>>>> very
>>>>>>>>> quickly.
>>>>>>>>
>>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have 
>>>>>>>> special hardware features to only ever exit after n number of 
>>>>>>>> turnarounds. I wonder why we have those when we could just as easily 
>>>>>>>> exit on every blocking path.
>>>>>>>>
>>>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>>>
>>>>>> Apparently not so on ARM. At least that's what Marc's numbers are 
>>>>>> showing. I'm not sure what exactly that means. Basically his logic is 
>>>>>> "if we spin, the holder must have been preempted". And it seems to work 
>>>>>> out surprisingly well.
>>>>
>>>> Yes. I basically assume that contention should be rare, and that ending
>>>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>>>> (no event is pending).
>>>>
>>>>>>
>>>>> For not contended locks it make sense. We need to recheck if x86
>>>>> assumption is still true there, but x86 lock is ticketing which
>>>>> has not only lock holder preemption, but also lock waiter
>>>>> preemption problem which make overcommit problem even worse.
>>>>
>>>> Locks are ticketing on ARM as well. But there is one key difference here
>>>> with x86 (or at least what I understand of it, which is very close to
>>>> none): We only trap if we would have blocked anyway. In our case, it is
>>>> almost always better to give up the CPU to someone else rather than
>>>> waiting for some event to take the CPU out of sleep.
>>>
>>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
>>> 1. How spin lock is implemented in Guest OS?
>>> we cannot assume
>>>     that underlying Guest OS is always Linux.
>>> 2. How bad/good is spin
>>
>> We do *not* spin. We *sleep*. So instead of taking a nap on a physical
>> CPU (which is slightly less than useful), we go and run some real
>> workload. If your guest OS is executing WFE (I'm not implying a lock
>> here), *and* that WFE is blocking, then I maintain it will be a gain in
>> the vast majority of the cases.
>
> What if VCPU A was about to release lock and VCPU B tries to grab
> same lock. In this case VCPU B gets Yielded due to WFE causing
> unnecessary delay for VCPU B in acquiring lock. This situation can
> happen quite often because spin locks are generally used for protecting
> very small portion of code.


It will be interesting to see what hackbench number you get if you
don't restrict all Guest VCPUs to same Host CPU? Lets say a Guest
with 8 VCPUs running on Host (with > 2 CPUs).

>
>>
>>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE
>>
>> Not until someone has shown me a (real) workload when this is actually
>> detrimental.
>
> The gains by "Yield CPU when vcpu executes a WFE" are not-significant
> and we dont have consistent improvement when tried multiple times. Please
> look at number you reported for multiple runs. Due to this fact it makes
> more sense to have Kconfig option for this.
>
> --Anup
>
>>
>>         M.
>> --
>> Jazz is not dead. It just smells funny...
>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE

Reply via email to