On Wed, Oct 9, 2013 at 8:40 PM, Anup Patel <[email protected]> wrote: > On Wed, Oct 9, 2013 at 8:29 PM, Marc Zyngier <[email protected]> wrote: >> On 09/10/13 15:50, Anup Patel wrote: >>> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <[email protected]> wrote: >>>> On 09/10/13 14:26, Gleb Natapov wrote: >>>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote: >>>>>> >>>>>> On 07.10.2013, at 18:53, Gleb Natapov <[email protected]> wrote: >>>>>> >>>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote: >>>>>>>> >>>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <[email protected]> wrote: >>>>>>>> >>>>>>>>> On 07/10/13 17:04, Alexander Graf wrote: >>>>>>>>>> >>>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly >>>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a >>>>>>>>>>> lock to be released, while the vcpu holding the lock may not be >>>>>>>>>>> running at all. >>>>>>>>>>> >>>>>>>>>>> This creates contention, and the observed slowdown is 40x for >>>>>>>>>>> hackbench. No, this isn't a typo. >>>>>>>>>>> >>>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now >>>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost, >>>>>>>>>>> allowing the lock to be released more quickly. >>>>>>>>>>> >>>>>>>>>>>> From a performance point of view: hackbench 1 process 1000 >>>>>>>>>>> >>>>>>>>>>> 2xA15 host (baseline): 1.843s >>>>>>>>>>> >>>>>>>>>>> 2xA15 guest w/o patch: 2.083s 4xA15 guest w/o patch: 80.212s >>>>>>>>>>> >>>>>>>>>>> 2xA15 guest w/ patch: 2.072s 4xA15 guest w/ patch: 3.202s >>>>>>>>>> >>>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to >>>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately >>>>>>>>>> succeed. I would've expected to second number to be worse rather than >>>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't >>>>>>>>>> see any significant drop in performance. >>>>>>>>> >>>>>>>>> The key is in the ARM ARM: >>>>>>>>> >>>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a >>>>>>>>> Non-secure >>>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a >>>>>>>>> Hyp >>>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions >>>>>>>>> permit the processor to suspend execution." >>>>>>>>> >>>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock, >>>>>>>>> hence not trapping. Otherwise, performance would go down the drain >>>>>>>>> very >>>>>>>>> quickly. >>>>>>>> >>>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have >>>>>>>> special hardware features to only ever exit after n number of >>>>>>>> turnarounds. I wonder why we have those when we could just as easily >>>>>>>> exit on every blocking path. >>>>>>>> >>>>>>> It will hurt performance if vcpu that holds the lock is running. >>>>>> >>>>>> Apparently not so on ARM. At least that's what Marc's numbers are >>>>>> showing. I'm not sure what exactly that means. Basically his logic is >>>>>> "if we spin, the holder must have been preempted". And it seems to work >>>>>> out surprisingly well. >>>> >>>> Yes. I basically assume that contention should be rare, and that ending >>>> up in a *blocking* WFE is a sign that we're in thrashing mode already >>>> (no event is pending). >>>> >>>>>> >>>>> For not contended locks it make sense. We need to recheck if x86 >>>>> assumption is still true there, but x86 lock is ticketing which >>>>> has not only lock holder preemption, but also lock waiter >>>>> preemption problem which make overcommit problem even worse. >>>> >>>> Locks are ticketing on ARM as well. But there is one key difference here >>>> with x86 (or at least what I understand of it, which is very close to >>>> none): We only trap if we would have blocked anyway. In our case, it is >>>> almost always better to give up the CPU to someone else rather than >>>> waiting for some event to take the CPU out of sleep. >>> >>> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on: >>> 1. How spin lock is implemented in Guest OS? >>> we cannot assume >>> that underlying Guest OS is always Linux. >>> 2. How bad/good is spin >> >> We do *not* spin. We *sleep*. So instead of taking a nap on a physical >> CPU (which is slightly less than useful), we go and run some real >> workload. If your guest OS is executing WFE (I'm not implying a lock >> here), *and* that WFE is blocking, then I maintain it will be a gain in >> the vast majority of the cases. > > What if VCPU A was about to release lock and VCPU B tries to grab > same lock. In this case VCPU B gets Yielded due to WFE causing > unnecessary delay for VCPU B in acquiring lock. This situation can > happen quite often because spin locks are generally used for protecting > very small portion of code.
It will be interesting to see what hackbench number you get if you don't restrict all Guest VCPUs to same Host CPU? Lets say a Guest with 8 VCPUs running on Host (with > 2 CPUs). > >> >>> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE >> >> Not until someone has shown me a (real) workload when this is actually >> detrimental. > > The gains by "Yield CPU when vcpu executes a WFE" are not-significant > and we dont have consistent improvement when tried multiple times. Please > look at number you reported for multiple runs. Due to this fact it makes > more sense to have Kconfig option for this. > > --Anup > >> >> M. >> -- >> Jazz is not dead. It just smells funny... >> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
