Re: [PATCH 14/23] userfaultfd: wake pending userfaults
On Thu, May 14, 2015 at 07:31:11PM +0200, Andrea Arcangeli wrote: > @@ -255,21 +259,23 @@ int handle_userfault(struct vm_area_struct *vma, > unsigned long address, >* through poll/read(). >*/ > __add_wait_queue(>fault_wqh, ); > - for (;;) { > - set_current_state(TASK_KILLABLE); > - if (!uwq.pending || ACCESS_ONCE(ctx->released) || > - fatal_signal_pending(current)) > - break; > - spin_unlock(>fault_wqh.lock); > + set_current_state(TASK_KILLABLE); > + spin_unlock(>fault_wqh.lock); > > + if (likely(!ACCESS_ONCE(ctx->released) && > +!fatal_signal_pending(current))) { > wake_up_poll(>fd_wqh, POLLIN); > schedule(); > + ret |= VM_FAULT_MAJOR; > + } So what happens here if schedule() spontaneously wakes for no reason? I'm not sure enough of userfaultfd semantics to say if that would be bad, but the code looks suspiciously like it relies on schedule() not to do that; which is wrong. > + __set_current_state(TASK_RUNNING); > + /* see finish_wait() comment for why list_empty_careful() */ > + if (!list_empty_careful(_list)) { > spin_lock(>fault_wqh.lock); > + list_del_init(_list); > + spin_unlock(>fault_wqh.lock); > } > - __remove_wait_queue(>fault_wqh, ); > - __set_current_state(TASK_RUNNING); > - spin_unlock(>fault_wqh.lock); > > /* >* ctx may go away after this if the userfault pseudo fd is -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/23] userfaultfd: wake pending userfaults
On Thu, Oct 22, 2015 at 03:20:15PM +0200, Andrea Arcangeli wrote: > If schedule spontaneously wakes up a task in TASK_KILLABLE state that > would be a bug in the scheduler in my view. Luckily there doesn't seem > to be such a bug, or at least we never experienced it. Well, there will be a wakeup, just not the one you were hoping for. We have code that does: @cond = true; get_task_struct(p); queue(p) /* random wait somewhere */ for (;;) { prepare_to_wait(); if (@cond) break; ... handle_userfault() ... schedule(); ... dequeue(p) wake_up_process(p) ---> wakeup without userfault wakeup These races are (extremely) rare, but they do exist. Therefore one must never assume schedule() will not spuriously wake because of these things. Also, see: lkml.kernel.org/r/CA+55aFwHkOo+YGWKYROmce1-H_uG3KfEUmCkJUerTj=ojy2...@mail.gmail.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 14/23] userfaultfd: wake pending userfaults
On Thu, Oct 22, 2015 at 04:18:31PM +0200, Andrea Arcangeli wrote: > The risk of memory corruption is still zero no matter what happens > here, in the extremely rare case the app will get a SIGBUS or a That might still upset people, SIGBUS isn't something an app can really recover from. > I'm not exactly sure why we allow VM_FAULT_RETRY only once currently > so I'm tempted to drop FAULT_FLAG_TRIED entirely. I think to ensure we make forward progress. > I've no real preference on how to tweak the page fault code to be able > to return VM_FAULT_RETRY indefinitely and I would aim for the smallest > change possible, so if you've suggestions now it's good time. Indefinitely is such a long time, we should try and finish computation before the computer dies etc. :-) Yes, yes.. I know, extremely unlikely etc. Still guarantees are good. In any case, I'm not really too bothered how you fix it, just figured I'd let you know. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: fix waitqueue_active without memory barrier in virt/kvm/async_pf.c
On Fri, Oct 09, 2015 at 10:45:32AM +0200, Paolo Bonzini wrote: > So you need another smp_mb() after prepare_to_wait(). I'm not sure > if it's needed also for your original tty report, but I think it is > for https://lkml.org/lkml/2015/10/8/989 ("mei: fix waitqueue_active > without memory barrier in mei drivers"). > > I wonder if it makes sense to introduce smp_mb__before_spin_lock() > and smp_mb__after_spin_unlock(). On x86 the former could be a > simple compiler barrier, and on s390 both of them could. But that > should be a separate patch. Not having actually read or thought about the issue at hand, its perfectly valid to pair an smp_mb() with either spin_lock() or spin_unlock(). IOW. MB <-> {ACQUIRE, RELEASE} is a valid pairing. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] kvm: fix waitqueue_active without memory barrier in virt/kvm/async_pf.c
On Fri, Oct 09, 2015 at 12:21:55PM +, Kosuke Tatsukawa wrote: > + * Memory barrier is required here to make sure change to > + * vcpu->async_pf.done is visible from other CPUs. This memory > + * barrier pairs with prepare_to_wait's set_current_state() That is not how memory barriers work; they don't 'make visible'. They simply ensure order between operations. X = Y = 0 CPU0CPU1 [S] X=1 [S] Y=1 MB MB [L] y=Y [L] x=X assert(x || y) The issue of the memory barrier does not mean the store is visible, it merely means that the load _must_ happen after the store (in the above scenario). This gives a guarantee that not both x and y can be 0. Because either being 0, means the other has not yet executed and must therefore observe your store. Nothing more, nothing less. So your comment is misleading at best. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mon, Sep 21, 2015 at 09:36:15AM -0700, Linus Torvalds wrote: > On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnarwrote: > > > > Linus, what's your preference? > > So quite frankly, is there any reason we don't just implement > native_read_msr() as just > >unsigned long long native_read_msr(unsigned int msr) >{ > int err; > unsigned long long val; > > val = native_read_msr_safe(msr, ); > WARN_ON_ONCE(err); > return val; >} > > Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be > done with it. I don't see the downside. > > How many msr reads are so critical that the function call > overhead would matter? Get rid of the inline version of the _safe() > thing too, and put that thing there too. There are a few in the perf code, and esp. on cores without a stack engine the call overhead is noticeable. Also note that the perf MSRs are generally optimized MSRs and less slow (we cannot say fast, they're still MSRs) than regular MSRs. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sched: access local runqueue directly in single_task_running
On Fri, Sep 18, 2015 at 01:26:53PM +0200, Paolo Bonzini wrote: > > +++ b/kernel/sched/core.c > > @@ -2614,13 +2614,13 @@ unsigned long nr_running(void) > > > > /* > > * Check if only the current task is running on the cpu. > > + * > > + * Caution result is subject to time-of-check-to-time-of-use race, > > + * every caller is responsible to set up additional fences if necessary. > > Let's expand it a bit more: > > * Caution: this function does not check that the caller has disabled > * preemption, thus the result might have a time-of-check-to-time-of-use > * race. The caller is responsible to use this correctly, for example: > * > * - use it from a non-preemptable section > * > * - use it from a thread that is bound to a single CPU > * > * - use it in a loop where each iteration takes very little time > * (e.g. a polling loop) > */ > > I'll include it in my pull request. In which case: Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: single_task_running() vs. preemption warnings (was Re: [PATCH] kvm: fix preemption warnings in kvm_vcpu_block)
On Thu, Sep 17, 2015 at 01:32:55PM -0700, Tim Chen wrote: > I have no objection to change single_task_running to use > raw_smp_processor_id. The worker in mcryptd is bound to > the cpu so it has no migration/preemption issue. So it shouldn't care > which smp_processor_id version is being used. Yes, please add a comment > to alert the user of this caveat should you change single_task_running. We actually have raw_rq() for that, and the whole if thing looks rather superfluous. So something like the below, except with a suitable comment on and tested etc.. ;-) --- kernel/sched/core.c | 5 + 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 6ab415aa15c4..f39c0498e284 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2666,10 +2666,7 @@ unsigned long nr_running(void) */ bool single_task_running(void) { - if (cpu_rq(smp_processor_id())->nr_running == 1) - return true; - else - return false; + return raw_rq()->nr_running == 1; } EXPORT_SYMBOL(single_task_running); -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops
On Wed, Sep 16, 2015 at 04:33:11PM -0700, Andy Lutomirski wrote: > Setting CONFIG_PARAVIRT=y has an unintended side effect: it silently > turns all rdmsr and wrmsr operations into the safe variants without > any checks that the operations actually succeed. > > This is IMO awful: it papers over bugs. In particular, KVM gueests > might be unwittingly depending on this behavior because > CONFIG_KVM_GUEST currently depends on CONFIG_PARAVIRT. I'm not > aware of any such problems, but applying this series would be a good > way to shake them out. > > Fix it so that the MSR operations work the same on CONFIG_PARAVIRT=n > and CONFIG_PARAVIRT=y as long as Xen isn't being used. The Xen > maintainers are welcome to make a similar change on top of this. > > Since there's plenty of time before the next merge window, I think > we should apply and fix anything that breaks. > > Doing this is probably a prerequisite to sanely decoupling > CONFIG_KVM_GUEST and CONFIG_PARAVIRT, which would probably make > Arjan and the rest of the Clear Containers people happy :) So I actually like this, although by Ingo's argument, its a tad risky. But the far greater problem I have with the whole virt thing is that you cannot use rdmsr_safe() to probe if an MSR exists at all because, as you told me, these virt thingies return 0 for all 'unknown' MSRs instead of faulting. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops
On Thu, Sep 17, 2015 at 01:40:30PM +0200, Paolo Bonzini wrote: > > > On 17/09/2015 10:58, Peter Zijlstra wrote: > > But the far greater problem I have with the whole virt thing is that > > you cannot use rdmsr_safe() to probe if an MSR exists at all because, as > > you told me, these virt thingies return 0 for all 'unknown' MSRs instead > > of faulting. > > At least for KVM, that behavior is opt-in (the ignore_msrs parameter) > and no distro that I know enables it by default. Ah, that would be good news. Andy earlier argued I could not rely on rdmsr_safe() faulting on unknown MSRs. If practically we can there's some code I can simplify :-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops
On Thu, Sep 17, 2015 at 08:17:18AM -0700, Andy Lutomirski wrote: > > Ah, that would be good news. Andy earlier argued I could not rely on > > rdmsr_safe() faulting on unknown MSRs. If practically we can there's > > some code I can simplify :-) > > I was taking about QEMU TCG, not KVM. Just for my education, TCG is the thing without _any_ hardware assist? The thing you fear to use because it cannot boot a kernel this side of tomorrow etc.. ? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fixes: 805de8f43c20 (atomic: Replace atomic_{set,clear}_mask() usage)
On Wed, Sep 16, 2015 at 09:13:50AM -0400, Jason J. Herne wrote: > The offending commit accidentally replaces an atomic_clear with an > atomic_or instead of an atomic_andnot in kvm_s390_vcpu_request_handled. > The symptom is that kvm guests on s390 hang on startup. > This patch simply replaces the incorrect atomic_or with atomic_andnot > > Signed-off-by: Jason J. Herne <jjhe...@linux.vnet.ibm.com> Urgh, sorry about that. Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm
On Tue, Sep 15, 2015 at 02:05:14PM +0200, Christian Borntraeger wrote: > Tejun, > > > commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14 (sched, cgroup: replace > signal_struct->group_rwsem with a global percpu_rwsem) causes some noticably > hickups when starting several kvm guests (which libvirt will move into cgroups > - each vcpu thread and each i/o thread) > When you now start lots of guests in parallel on a bigger system (32CPUs with > 2way smt in my case) the system is so busy that systemd runs into several > timeouts > like "Did not receive a reply. Possible causes include: the remote > application did > not send a reply, the message bus security policy blocked the reply, the reply > timeout expired, or the network connection was broken." > > The problem seems to be that the newly used percpu_rwsem does a > rcu_synchronize_sched_expedited for all write downs/ups. Can you try: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev.2015.09.11ab those include Oleg's rework of the percpu rwsem which should hopefully improve things somewhat. But yes, pounding a global lock on a big machine will always suck. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On Fri, Aug 07, 2015 at 12:57:38PM +0200, Peter Zijlstra wrote: +void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait) this one has no users the __ suggests that it is locked edition. Maybe it is for the completions… Yeah, who knows, I certainly do not anymore ;-) On that, we cannot convert completions to swait. Because swait wake_all must not happen from IRQ context, and complete_all() typically is used from just that. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On Wed, Feb 25, 2015 at 10:02:50PM +0100, Sebastian Andrzej Siewior wrote: +static inline int swait_active(struct swait_queue_head *q) +{ +return !list_empty(q-task_list); In RT there was a smp_mb() which you dropped and I assume you had reasons for it. Yeah, RT didn't have a reason for the smp_mb() -- any barrier without a comment is a bug :-) Also waitqueue_active(), its counterpart, does not have a barrier there either. Nor did I see any reason for that mb to be there. I assumed that one can perform list_empty_careful() without a lock if the items were removed with list_del_init(). But since nothing in -RT blow up so far I guess this here is legal, too :) Nobody will come and arrest us for software bugs -- yet ;-) +/* + * The thing about the wake_up_state() return value; I think we can ignore it. + * + * If for some reason it would return 0, that means the previously waiting + * task is already running, so it will observe condition true (or has already). + */ +void swake_up_locked(struct swait_queue_head *q) +{ +struct swait_queue *curr; + +list_for_each_entry(curr, q-task_list, task_list) { +wake_up_process(curr-task); okay. So since we limit everything to TASK_NORMAL which has to sleep while on the list there is no need to check if we actually woken up someone. Partly that, also that I don't see how that return value is meaningful in the first place. If it were to return false, the task was/is already running and it will observe whatever condition we just satisfied to allow waking things up. So either way around, we'll get (at least) 1 task running. +list_del_init(curr-task_list); +break; +} +} +EXPORT_SYMBOL(swake_up_locked); +void swake_up(struct swait_queue_head *q) +{ +unsigned long flags; + +if (!swait_active(q)) +return; + +raw_spin_lock_irqsave(q-lock, flags); +__swake_up_locked(q); I thing this should have been swake_up_locked() instead since __swake_up_locked() isn't part of this patch. Just a nitpick: later there is __prepare_to_swait() and __finish_swait() which have the __ prefix instead a _locked suffix. Not sure what is better for a better for a public API but maybe one way would be good. Yeah, I suppose that's true ;-) +raw_spin_unlock_irqrestore(q-lock, flags); +} +EXPORT_SYMBOL(swake_up); + +/* + * Does not allow usage from IRQ disabled, since we must be able to + * release IRQs to guarantee bounded hold time. + */ +void swake_up_all(struct swait_queue_head *q) +{ +struct swait_queue *curr, *next; +LIST_HEAD(tmp); WARN_ON(irqs_disabled()) ? Lockdep should already catch that by virtue of using unconditional _irq spinlock primitives. +if (!swait_active(q)) +return; + +raw_spin_lock_irq(q-lock); +list_splice_init(q-task_list, tmp); +while (!list_empty(tmp)) { +curr = list_first_entry(tmp, typeof(curr), task_list); + +wake_up_state(curr-task, state); +list_del_init(curr-task_list); So because the task may timeout and remove itself from the list at anytime you need to hold the lock during wakeup and the removal from the list Indeed. + +if (list_empty(tmp)) +break; + +raw_spin_unlock_irq(q-lock); and you drop the lock after each iteration in case there is an IRQ pending or the task, that has been just woken up, has a higher priority than the current task and needs to get on the CPU. Not sure if this case matters: - _this_ task (wake_all) prio 120 - first task in queue prio 10, RR - second task in queue prio 9, RR Why complicate things? Better to not assume anything and just do the simple correct thing. the *old* behavior would put the second task before the first task on CPU. The *new* behaviour puts the first task on the CPU after dropping the lock. The second task (that has a higher priority but nobody knows) has to wait until the first one is done (and anything else that might been woken up in the meantime with a higher prio than 120). Irrelevant, we _must_ drop the lock in order to maintain bounded behaviour. +raw_spin_lock_irq(q-lock); +} +raw_spin_unlock_irq(q-lock); +} +EXPORT_SYMBOL(swake_up_all); +void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait) this one has no users the __ suggests that it is locked edition. Maybe it is for the completions… Yeah, who knows, I certainly do not anymore ;-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On Fri, Aug 07, 2015 at 09:41:31AM -0700, Christoph Hellwig wrote: On Fri, Aug 07, 2015 at 01:14:15PM +0200, Peter Zijlstra wrote: On that, we cannot convert completions to swait. Because swait wake_all must not happen from IRQ context, and complete_all() typically is used from just that. If swait queues aren't useable from IRQ context they will be fairly useless. What's the problem with making them irq safe? Its just the swait_wake_all() that is not. The entire purpose of them was to have something that allows bounded execution (RT and all). Since you can have unbounded numbers of tasks waiting on a waitqueue (well, reality has bounds of course, like total memory available etc..) a wake_all() can end up being many many wake_process() calls. We've had this be a problem in RT. So the proposed swait_wake_all() requires being called from task context, such that it can drop the lock (and IRQ disable) after every wakeup, and thereby guarantee that higher priority things will not experience undue latencies. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should KVM_GUEST stop depending on PARAVIRT?
On Wed, Jul 29, 2015 at 11:24:09AM +0200, Paolo Bonzini wrote: On 29/07/2015 11:19, Peter Zijlstra wrote: On Tue, Jul 28, 2015 at 05:23:22PM -0700, Andy Lutomirski wrote: PeterZ, can we fix this for real instead of relying on CONFIG_PARAVIRT=y accidentally turning all msr accesses into safe accesses? We have the CPUID hypervisor bit, but I still don't fully understand the problem. So a whole bunch of PMU drivers already probe the MSRs at init time using a combination of rdmsrl_safe() and wrmsrl_safe() to see if: 1) its there at all 2) if its there, it 'works' like it ought to See for example: arch/x86/kernel/cpu/perf_event.c:check_hw_exists(). Great. Of course it's even better if you can probe it with CPUID (e.g. aperfmperf in intel_pstate.c), then there's no need for the _safe variant. Reliable CPUID enumeration would be nice indeed. Note that CPUID isn't ideal either, for instance the CPUID HT bit doesn't indicate if the CPU has hyperthreading enabled, just that it is able to report extended topology information. Determining if a CPU has hyperthreading enabled is extremely difficult, and basically boils down to init all cpus, see if there are siblings anywhere. And then try and distinguish between the case where HT is disabled and someone hot-unplugged all siblings :-) As to the 'bug' at hand, I really don't see the problem. The RAPL driver says there's no valid RAPL domains, this is true, there aren't any on the virtual machine, so what? Well, people have complained about it because it's KERN_ERR. Do you think it is okay to downgrade this (perhaps not even just on VMs) to info? Ah, do people really have nothing better to do? ;-) Seems like a petty complaint. Sure, it seems our check_hw_exists() does the same thing: printk(%sFailed to access perfctr msr (MSR %x is %Lx)\n, boot_cpu_has(X86_FEATURE_HYPERVISOR) ? KERN_INFO : KERN_ERR, reg, val_new); so ERR for real hardware, INFO for hypervisor thingies. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should KVM_GUEST stop depending on PARAVIRT?
On Tue, Jul 28, 2015 at 05:23:22PM -0700, Andy Lutomirski wrote: PeterZ, can we fix this for real instead of relying on CONFIG_PARAVIRT=y accidentally turning all msr accesses into safe accesses? We have the CPUID hypervisor bit, but I still don't fully understand the problem. So a whole bunch of PMU drivers already probe the MSRs at init time using a combination of rdmsrl_safe() and wrmsrl_safe() to see if: 1) its there at all 2) if its there, it 'works' like it ought to See for example: arch/x86/kernel/cpu/perf_event.c:check_hw_exists(). I have no problem using the _safe() methods to probe MSRs in init routines and failing the init of the driver etc. What I do object to is sprinkling _safe() all over the driver itself and creating horrid error paths all over (yes, people have proposed crazy things like that). Another example is the LBR probing in arch/x86/kernel/cpu/perf_event_intel.c:intel_pmu_init(). This is the one people wanted to use _safe() crud for all throughout the driver, instead of simply disabling the LBR at init time. As to the 'bug' at hand, I really don't see the problem. The RAPL driver says there's no valid RAPL domains, this is true, there aren't any on the virtual machine, so what? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 3/3] sched/preempt: fix cond_resched_lock() and cond_resched_softirq()
On Wed, Jul 15, 2015 at 03:52:34PM +0300, Konstantin Khlebnikov wrote: On 15.07.2015 15:16, Eric Dumazet wrote: On Wed, 2015-07-15 at 12:52 +0300, Konstantin Khlebnikov wrote: These functions check should_resched() before unlocking spinlock/bh-enable: preempt_count always non-zero = should_resched() always returns false. cond_resched_lock() worked iff spin_needbreak is set. Interesting, this definitely used to work (linux-3.11) Any idea of which commit broke things ? Searching... done This one: bdb43806589096ac4272fe1307e789846ac08d7c in v3.13 before -static inline int should_resched(void) -{ - return need_resched() !(preempt_count() PREEMPT_ACTIVE); -} after +static __always_inline bool should_resched(void) +{ + return unlikely(!*preempt_count_ptr()); +} Argh, indeed! So, Fixes: bdb438065890 (sched: Extract the basic add/sub preempt_count modifiers) Thanks! -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] signals: Generate warning when flush_signals() is called from non-kthread context
On Sat, May 02, 2015 at 09:27:49AM -0700, Linus Torvalds wrote: On Sat, May 2, 2015 at 1:30 AM, NeilBrown ne...@suse.de wrote: All the calls in md.c are in a kernel thread so safe, but I'd rather have an explicit uninterruptible, but no load-average wait Hmm. Our task state is a bitmask anyway, so we could probably just add a #define __TASK_NOLOAD 16 (and move the EXIT_xyz defines *away* from the list that is actually the task state), and teach our load average thing to not count those kinds of waits. Then you could just use TASK_UNINTERRUPTIBLE | __TASK_NOLOAD to make processes not count towards the load. Or - probably preferably - we could really clean things up, and make things much more like the bitmask it *should* be, and have explicit bits for - SLEEPING/STOPPED/EXITING (why not running?) - LOADAVG (accounted towards load) - WAKESIG (ie interruptible) - WAKEKILL (this we already have) and just make the rule be that we use __TASK_xyz for the actual individual bits, and TASK_xyz for the helper combinations. So then we'd have #define TASK_UNINTERRUPTIBLE (__TASK_SLEEPING | __TASK_LOADAVG) #define TASK_INTERRUPTIBLE (__TASK_SLEEPING | __TASK_WAKESIG) #define TASK_KILLABLE (__TASK_SLEEPING | __TASK_WAKEKILL) #define TASK_NOLOADAVG (__TASK_SLEEPING) which is almost certainly how this *should* have been done, but isn't, because of historical use. Cleaning up like that *should* be fairly simple, but I'd be a bit nervous about getting all the state comparisons right (we have an unholy mix of check this bit and check this whole state, and we'd need to make sure we get those cases all right). Ingo, what do you think? This is mostly a scheduler interface issue.. Hehe, a little something like this: https://lkml.org/lkml/2013/11/12/710 Lemme go clean that up and finish it :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Non-exiting rdpmc on KVM guests?
On Tue, Apr 21, 2015 at 02:10:53PM -0700, Andy Lutomirski wrote: One question is whether we care if we leak unrelated counters to the guest. (We already leak them to unrelated user tasks, so this is hopefully not a big deal. OTOH, the API is different for guests.) Good question indeed. I really do not know. Another question is whether it's even worth trying to optimize this. I think I just ran into a bunch of people who think virt pmu stuff is important, but we'll have to see if they follow up with the effort of actually doing the work. My only concern is that they'll not make a mess of things ;-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Non-exiting rdpmc on KVM guests?
On Mon, Apr 20, 2015 at 05:47:23PM -0700, Andy Lutomirski wrote: I just wrote a little perf self-monitoring tool that uses rdpmc to count cycles. Performance sucks under KVM (VMX). How hard would it be to avoid rdpmc exits in cases where the host and guest pmu configurations are compatible as seen by rdpmc? I'm mostly ignorant of how the PMU counter offsets and such work. I'm not sure how all the VMX stuff works. how is rdpmc seem? It it at all possible to not trap that? But as it stands there is not even a hint from the emulated pmu events on which counters they 'ought' to run so they landing the same way the guest thinks they ought to is rather small. 'someone' would have to go look at the event scheduling and see how to implement it, my first suggestion would be to set hwc-idx to the desired number and hope x86_schedule_events() takes the fast path. Of course, the moment there's host events along with the guest events the whole thing comes tumbling down, but there's nothing much one can do about that. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Non-exiting rdpmc on KVM guests?
On Tue, Apr 21, 2015 at 06:32:54PM +0200, Paolo Bonzini wrote: However, if you take into account that RDPMC can also be used to read an inactive counter, and that multiple guests fight for the same host counters, it's even harder to ensure that the guest counter indices match those on the host. That doesn't make sense, only a single vcpu task will ever run at any one time. There cannot be inter guest counter contention. Only the host can compete for the same resources. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Fri, Apr 17, 2015 at 10:52:38AM +0200, Peter Zijlstra wrote: On Fri, Apr 10, 2015 at 05:01:29PM +0200, Paolo Bonzini wrote: include/linux/sched.h |8 + kernel/sched/core.c| 15 + Can you please not puke over the scheduler without Acks from at least one maintainer? I complained about this very thing two years ago: http://marc.info/?l=linux-kernelm=137345253916751 And now it magically re-appears WTF! And I really don't understand _why_ you need that extra callback in the first place. You already have preempt notifiers, just track if you came in on another cpu than you went out on and voila! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Fri, Apr 10, 2015 at 05:01:29PM +0200, Paolo Bonzini wrote: include/linux/sched.h |8 + kernel/sched/core.c| 15 + Can you please not puke over the scheduler without Acks from at least one maintainer? I complained about this very thing two years ago: http://marc.info/?l=linux-kernelm=137345253916751 And now it magically re-appears WTF! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Fri, Apr 17, 2015 at 12:38:07PM +0200, Paolo Bonzini wrote: On 17/04/2015 12:36, Peter Zijlstra wrote: Now you make everybody pay for your crap, x86-64 paravirt or not. Keep the cost by those who need it. Please take it out, ASAP. I'll just implement the static key. Can you first show that: preempt_out: int cpu = smp_processor_id(); if (vcpu-cpu != cpu) vcpu-cpu = cpu; preempt_in: int cpu = smp_processor_id(); if (unlikely(vcpu-cpu != cpu)) do_vcpu_migration_callback(cpu); Is actually a measurable performance hit and we actually _need_ the migration callback? Also, it looks like you already do exactly this for other things, look at: kvm_sched_in() kvm_arch_vcpu_load() if (unlikely(vcpu-cpu != cpu) ... ) So no, I don't believe for one second you need this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Fri, Apr 17, 2015 at 02:46:57PM +0200, Paolo Bonzini wrote: On 17/04/2015 12:55, Peter Zijlstra wrote: Also, it looks like you already do exactly this for other things, look at: kvm_sched_in() kvm_arch_vcpu_load() if (unlikely(vcpu-cpu != cpu) ... ) So no, I don't believe for one second you need this. You're missing that this snippet is running in the host, while this patch is concerned with the guest (paravirt). This doesn't make sense; but it brings us back to where we were last time. There is _0_ justification for this in the patches, that alone is grounds enough to reject it. Why should the guest task care about the physical cpu of the vcpu; that's a layering fail if ever there was one. Furthermore, the only thing that migration handler seems to do is increment a variable that is not actually used in that file. And frankly, I think the static key is snake oil. The cost of task migration in terms of cache misses and TLB misses is in no way comparable to the cost of filling in a structure on the stack, dereferencing the head of the notifiers list and seeing that it's NULL. The path this notifier is called from has nothing to do with those costs. And the fact you're inflicting these costs on _everyone_ for a single x86_64-paravirt case is insane. I've had enough of this, the below goes into sched/urgent and you can come back with sane patches if and when you're ready. --- arch/x86/kernel/pvclock.c | 24 include/linux/sched.h | 8 kernel/sched/core.c | 15 --- 3 files changed, 47 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index e5ecd20e72dd..82f116de3835 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -160,27 +160,6 @@ struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu) } #ifdef CONFIG_X86_64 -static int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, - void *v) -{ - struct task_migration_notifier *mn = v; - struct pvclock_vsyscall_time_info *pvti; - - pvti = pvclock_get_vsyscall_user_time_info(mn-from_cpu); - - /* this is NULL when pvclock vsyscall is not initialized */ - if (unlikely(pvti == NULL)) - return NOTIFY_DONE; - - pvti-migrate_count++; - - return NOTIFY_DONE; -} - -static struct notifier_block pvclock_migrate = { - .notifier_call = pvclock_task_migrate, -}; - /* * Initialize the generic pvclock vsyscall state. This will allocate * a/some page(s) for the per-vcpu pvclock information, set up a @@ -202,9 +181,6 @@ int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, PAGE_KERNEL_VVAR); } - - register_task_migration_notifier(pvclock_migrate); - return 0; } #endif diff --git a/include/linux/sched.h b/include/linux/sched.h index 3f3308824fa4..0eabab99e126 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -176,14 +176,6 @@ extern void get_iowait_load(unsigned long *nr_waiters, unsigned long *load); extern void calc_global_load(unsigned long ticks); extern void update_cpu_load_nohz(void); -/* Notifier for when a task gets migrated to a new CPU */ -struct task_migration_notifier { - struct task_struct *task; - int from_cpu; - int to_cpu; -}; -extern void register_task_migration_notifier(struct notifier_block *n); - extern unsigned long get_parent_ip(unsigned long addr); extern void dump_cpu_task(int cpu); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1d0bc4fe266d..dbfc93d40292 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1013,13 +1013,6 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) rq_clock_skip_update(rq, true); } -static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); - -void register_task_migration_notifier(struct notifier_block *n) -{ - atomic_notifier_chain_register(task_migration_notifier, n); -} - #ifdef CONFIG_SMP void set_task_cpu(struct task_struct *p, unsigned int new_cpu) { @@ -1050,18 +1043,10 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) trace_sched_migrate_task(p, new_cpu); if (task_cpu(p) != new_cpu) { - struct task_migration_notifier tmn; - if (p-sched_class-migrate_task_rq) p-sched_class-migrate_task_rq(p, new_cpu); p-se.nr_migrations++; perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0); - - tmn.task = p; - tmn.from_cpu = task_cpu(p); - tmn.to_cpu = new_cpu; - - atomic_notifier_call_chain(task_migration_notifier, 0, tmn); } __set_task_cpu(p, new_cpu); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord
Re: [GIT PULL] First batch of KVM changes for 4.1
On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote: The path this notifier is called from has nothing to do with those costs. How not? The task is going to incur those costs, it's not like half a dozen extra instruction make any difference. But anyway... Its attributed to the entity doing the migration, which can be the wakeup path or a softirq. And we very much do care about the wakeup path. ... that's a valid objection. Please look at the patch below. Still a NAK on that, distros have no choice but to enable that CONFIG option because people might want to run KVM. CONFIG options are pointless if they end up being mandatory. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] First batch of KVM changes for 4.1
On Fri, Apr 17, 2015 at 12:09:49PM +0200, Paolo Bonzini wrote: On 17/04/2015 11:17, Peter Zijlstra wrote: On Fri, Apr 17, 2015 at 10:52:38AM +0200, Peter Zijlstra wrote: On Fri, Apr 10, 2015 at 05:01:29PM +0200, Paolo Bonzini wrote: include/linux/sched.h |8 + kernel/sched/core.c| 15 + Can you please not puke over the scheduler without Acks from at least one maintainer? Sorry, this was done while I was not handling the KVM tree. At the very least the commit message should have included the original hashes of the commit and the revert. This way one could have found the original Acks: commit 582b336ec2c0f0076f5650a029fcc9abd4a906f7 Author: Marcelo Tosatti mtosa...@redhat.com Date: Tue Nov 27 23:28:54 2012 -0200 sched: add notifier for cross-cpu migrations Originally from Jeremy Fitzhardinge. Acked-by: Ingo Molnar mi...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Still not a good reason to sneak it back it now, after I got it taken out. There was a reason it was removed, prior acks (esp. 2 year old ones) do not count one whit _NOW_. Also, Ingo later agreed that is was a mistake, http://marc.info/?l=linux-kernelm=137346715521978w=2 which is an effective retract of whatever ACK that was. It was crap code then and its crap code now. I complained about this very thing two years ago: http://marc.info/?l=linux-kernelm=137345253916751 And now it magically re-appears WTF! And I really don't understand _why_ you need that extra callback in the first place. You already have preempt notifiers, just track if you came in on another cpu than you went out on and voila! Then you pay for _all_ preemptions of _all_ processes in the guest, instead of the hopefully rare ones that do a CPU migration. Now you make everybody pay for your crap, x86-64 paravirt or not. Keep the cost by those who need it. Please take it out, ASAP. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock
On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote: +void __init __pv_init_lock_hash(void) +{ + int pv_hash_size = 4 * num_possible_cpus(); + + if (pv_hash_size (1U LFSR_MIN_BITS)) + pv_hash_size = (1U LFSR_MIN_BITS); + /* +* Allocate space from bootmem which should be page-size aligned +* and hence cacheline aligned. +*/ + pv_lock_hash = alloc_large_system_hash(PV qspinlock, + sizeof(struct pv_hash_bucket), + pv_hash_size, 0, HASH_EARLY, + pv_lock_hash_bits, NULL, + pv_hash_size, pv_hash_size); pv_taps = lfsr_taps(pv_lock_hash_bits); I don't understand what you meant here. Let me explain (even though I propose taking all the LFSR stuff out). pv_lock_hash_bit is a runtime variable, therefore it cannot compile time evaluate the forest of if statements required to compute the taps value. Therefore its best to compute the taps _once_, and this seems like a good place to do so. + goto done; + } + } + + hash = lfsr(hash, pv_lock_hash_bits, 0); Since pv_lock_hash_bits is a variable, you end up running through that massive if() forest to find the corresponding tap every single time. It cannot compile-time optimize it. The minimum bits size is now 8. So unless the system has more than 64 vCPUs, it will get the right value in the first if statement. Still, no reason to not pre-compute the taps value, its simple enough. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock
On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote: +__visible void __pv_queue_spin_unlock(struct qspinlock *lock) +{ + struct __qspinlock *l = (void *)lock; + struct pv_node *node; + + if (likely(cmpxchg(l-locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL)) + return; + + /* +* The queue head has been halted. Need to locate it and wake it up. +*/ + node = pv_hash_find(lock); + smp_store_release(l-locked, 0); Ah yes, clever that. + /* +* At this point the memory pointed at by lock can be freed/reused, +* however we can still use the PV node to kick the CPU. +*/ + if (READ_ONCE(node-state) == vcpu_halted) + pv_kick(node-cpu); +} +PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_spin_unlock); However I feel the PV_CALLEE_SAVE_REGS_THUNK thing belongs in the x86 code. That is why I originally put my version of the qspinlock_paravirt.h header file under arch/x86/include/asm. Maybe we should move it back there. Putting the thunk in arch/x86/kernel/kvm.c didn't work when you consider that the Xen code also need that. Well the function is 'generic' and belong here I think. Its just the PV_CALLEE_SAVE_REGS_THUNK thing that arch specific. Should have live in arch/x86/kernel/paravirt-spinlocks.c instead? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock
On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote: +static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node) +{ + struct __qspinlock *l = (void *)lock; + struct qspinlock **lp = NULL; + struct pv_node *pn = (struct pv_node *)node; + int slow_set = false; + int loop; + + for (;;) { + for (loop = SPIN_THRESHOLD; loop; loop--) { + if (!READ_ONCE(l-locked)) + return; + + cpu_relax(); + } + + WRITE_ONCE(pn-state, vcpu_halted); + if (!lp) + lp = pv_hash(lock, pn); + /* +* lp must be set before setting _Q_SLOW_VAL +* +* [S] lp = lock[RmW] l = l-locked = 0 +* MB MB +* [S] l-locked = _Q_SLOW_VAL [L] lp +* +* Matches the cmpxchg() in pv_queue_spin_unlock(). +*/ + if (!slow_set + !cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) { + /* +* The lock is free and _Q_SLOW_VAL has never been +* set. Need to clear the hash bucket before getting +* the lock. +*/ + WRITE_ONCE(*lp, NULL); + return; + } else if (slow_set !READ_ONCE(l-locked)) + return; + slow_set = true; I'm somewhat puzzled by the slow_set thing; what is wrong with the thing I had, namely: if (!lp) { lp = pv_hash(lock, pn); /* * comment */ lv = cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL); if (lv != _Q_LOCKED_VAL) { /* we're woken, unhash and return */ WRITE_ONCE(*lp, NULL); return; } } + + pv_wait(l-locked, _Q_SLOW_VAL); If we get a spurious wakeup (due to device interrupts or random kick) we'll loop around but -locked will remain _Q_SLOW_VAL. The purpose of the slow_set flag is not about the lock value. It is to make sure that pv_hash_find() will always find a match. Consider the following scenario: cpu1cpu2cpu3 pv_wait spurious wakeup loop l-locked read _Q_SLOW_VAL pv_hash_find() unlock pv_hash() - same entry cmpxchg(l-locked) clear hash (?) Here, the entry for cpu3 will be removed leading to panic when pv_hash_find() can find the entry. So the hash entry can only be removed if the other cpu has no chance to see _Q_SLOW_VAL. Still confused. Afaict that cannot happen with my code. We only do the cmpxchg(, _Q_SLOW_VAL) _once_. Only on the first time around that loop will we hash the lock and set the slow flag. And cpu3 cannot come in on the same entry because we've not yet released the lock when we find and unhash. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 16/16] unfair qspinlock: a queue based unfair lock
On Thu, Apr 09, 2015 at 09:16:24AM -0400, Rik van Riel wrote: On 04/09/2015 03:01 AM, Peter Zijlstra wrote: On Wed, Apr 08, 2015 at 02:32:19PM -0400, Waiman Long wrote: For a virtual guest with the qspinlock patch, a simple unfair byte lock will be used if PV spinlock is not configured in or the hypervisor isn't either KVM or Xen. The byte lock works fine with small guest of just a few vCPUs. On a much larger guest, however, byte lock can have serious performance problem. Who cares? There are some people out there running guests with dozens of vCPUs. If the code exists to make those setups run better, is there a good reason not to use it? Well use paravirt, !paravirt stuff sucks performance wise anyhow. The question really is: is the added complexity worth the maintenance burden. And I'm just not convinced !paravirt virt is a performance critical target. Having said that, only KVM and Xen seem to support very large guests, and PV spinlock is available there. I believe both VMware and Hyperv have a 32 VCPU limit, anyway. Don't we have Hyperv paravirt drivers? They could add support for paravirt spinlocks too. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock
On Thu, Apr 09, 2015 at 08:13:27PM +0200, Peter Zijlstra wrote: On Mon, Apr 06, 2015 at 10:55:44PM -0400, Waiman Long wrote: +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket)) +static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node) +{ + unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits); + struct pv_hash_bucket *hb, *end; + + if (!hash) + hash = 1; + + init_hash = hash; + hb = pv_lock_hash[hash_align(hash)]; + for (;;) { + for (end = hb + PV_HB_PER_LINE; hb end; hb++) { + if (!cmpxchg(hb-lock, NULL, lock)) { + WRITE_ONCE(hb-node, node); + /* +* We haven't set the _Q_SLOW_VAL yet. So +* the order of writing doesn't matter. +*/ + smp_wmb(); /* matches rmb from pv_hash_find */ + goto done; + } + } + + hash = lfsr(hash, pv_lock_hash_bits, 0); Since pv_lock_hash_bits is a variable, you end up running through that massive if() forest to find the corresponding tap every single time. It cannot compile-time optimize it. Hence: hash = lfsr(hash, pv_taps); (I don't get the bits argument to the lfsr). In any case, like I said before, I think we should try a linear probe sequence first, the lfsr was over engineering from my side. + hb = pv_lock_hash[hash_align(hash)]; So one thing this does -- and one of the reasons I figured I should ditch the LFSR instead of fixing it -- is that you end up scanning each bucket HB_PER_LINE times. The 'fix' would be to LFSR on cachelines instead of HBs but then you're stuck with the 0-th cacheline. + BUG_ON(hash == init_hash); + } + +done: + return hb-lock; +} -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock
On Mon, Apr 06, 2015 at 10:55:44PM -0400, Waiman Long wrote: +++ b/kernel/locking/qspinlock_paravirt.h @@ -0,0 +1,321 @@ +#ifndef _GEN_PV_LOCK_SLOWPATH +#error do not include this file +#endif + +/* + * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead + * of spinning them. + * + * This relies on the architecture to provide two paravirt hypercalls: + * + * pv_wait(u8 *ptr, u8 val) -- suspends the vcpu if *ptr == val + * pv_kick(cpu) -- wakes a suspended vcpu + * + * Using these we implement __pv_queue_spin_lock_slowpath() and + * __pv_queue_spin_unlock() to replace native_queue_spin_lock_slowpath() and + * native_queue_spin_unlock(). + */ + +#define _Q_SLOW_VAL (3U _Q_LOCKED_OFFSET) + +enum vcpu_state { + vcpu_running = 0, + vcpu_halted, +}; + +struct pv_node { + struct mcs_spinlock mcs; + struct mcs_spinlock __res[3]; + + int cpu; + u8 state; +}; + +/* + * Hash table using open addressing with an LFSR probe sequence. + * + * Since we should not be holding locks from NMI context (very rare indeed) the + * max load factor is 0.75, which is around the point where open addressing + * breaks down. + * + * Instead of probing just the immediate bucket we probe all buckets in the + * same cacheline. + * + * http://en.wikipedia.org/wiki/Hash_table#Open_addressing + * + * Dynamically allocate a hash table big enough to hold at least 4X the + * number of possible cpus in the system. Allocation is done on page + * granularity. So the minimum number of hash buckets should be at least + * 256 to fully utilize a 4k page. + */ +#define LFSR_MIN_BITS8 +#define LFSR_MAX_BITS (2 + NR_CPUS_BITS) +#if LFSR_MAX_BITS LFSR_MIN_BITS +#undef LFSR_MAX_BITS +#define LFSR_MAX_BITSLFSR_MIN_BITS +#endif + +struct pv_hash_bucket { + struct qspinlock *lock; + struct pv_node *node; +}; +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket)) +#define HB_RESERVED ((struct qspinlock *)1) This is unused. + +static struct pv_hash_bucket *pv_lock_hash; +static unsigned int pv_lock_hash_bits __read_mostly; static unsigned int pv_taps __read_mostly; + +#include linux/hash.h +#include linux/lfsr.h +#include linux/bootmem.h + +/* + * Allocate memory for the PV qspinlock hash buckets + * + * This function should be called from the paravirt spinlock initialization + * routine. + */ +void __init __pv_init_lock_hash(void) +{ + int pv_hash_size = 4 * num_possible_cpus(); + + if (pv_hash_size (1U LFSR_MIN_BITS)) + pv_hash_size = (1U LFSR_MIN_BITS); + /* + * Allocate space from bootmem which should be page-size aligned + * and hence cacheline aligned. + */ + pv_lock_hash = alloc_large_system_hash(PV qspinlock, +sizeof(struct pv_hash_bucket), +pv_hash_size, 0, HASH_EARLY, +pv_lock_hash_bits, NULL, +pv_hash_size, pv_hash_size); pv_taps = lfsr_taps(pv_lock_hash_bits); +} + +static inline u32 hash_align(u32 hash) +{ + return hash ~(PV_HB_PER_LINE - 1); +} + +static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node) +{ + unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits); + struct pv_hash_bucket *hb, *end; + + if (!hash) + hash = 1; + + init_hash = hash; + hb = pv_lock_hash[hash_align(hash)]; + for (;;) { + for (end = hb + PV_HB_PER_LINE; hb end; hb++) { + if (!cmpxchg(hb-lock, NULL, lock)) { + WRITE_ONCE(hb-node, node); + /* + * We haven't set the _Q_SLOW_VAL yet. So + * the order of writing doesn't matter. + */ + smp_wmb(); /* matches rmb from pv_hash_find */ This doesn't make sense. Both sites do -lock first and -node second. No amount of ordering can 'fix' that. I think we can safely remove this wmb and the rmb below, because the required ordering is already provided by setting/observing l-locked == SLOW. + goto done; + } + } + + hash = lfsr(hash, pv_lock_hash_bits, 0); Since pv_lock_hash_bits is a variable, you end up running through that massive if() forest to find the corresponding tap every single time. It cannot compile-time optimize it. Hence: hash = lfsr(hash, pv_taps); (I don't get the bits argument to the lfsr). In any case, like I said before, I think we should try a linear probe sequence first, the lfsr was over engineering from my side.
Re: [PATCH v15 13/15] pvqspinlock: Only kick CPU at unlock time
On Mon, Apr 06, 2015 at 10:55:48PM -0400, Waiman Long wrote: @@ -219,24 +236,30 @@ static void pv_wait_node(struct mcs_spinlock *node) } /* + * Called after setting next-locked = 1 lock acquired. + * Check if the the CPU has been halted. If so, set the _Q_SLOW_VAL flag + * and put an entry into the lock hash table to be waken up at unlock time. */ -static void pv_kick_node(struct mcs_spinlock *node) +static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node) I'm not too sure about that name change.. { struct pv_node *pn = (struct pv_node *)node; + struct __qspinlock *l = (void *)lock; /* + * Transition CPU state: halted = hashed + * Quit if the transition failed. */ + if (cmpxchg(pn-state, vcpu_halted, vcpu_hashed) != vcpu_halted) + return; + + /* + * Put the lock into the hash table set the _Q_SLOW_VAL in the lock. + * As this is the same CPU that will check the _Q_SLOW_VAL value and + * the hash table later on at unlock time, no atomic instruction is + * needed. + */ + WRITE_ONCE(l-locked, _Q_SLOW_VAL); + (void)pv_hash(lock, pn); } This is broken. The unlock path relies on: pv_hash() MB l-locked = SLOW such that when it observes SLOW, it must then also observe a consistent bucket. The above can have us do pv_hash_find() _before_ we actually hash the lock, which will result in us triggering that BUG_ON() in there. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 13/15] pvqspinlock: Only kick CPU at unlock time
On Thu, Apr 09, 2015 at 09:57:21PM +0200, Peter Zijlstra wrote: On Mon, Apr 06, 2015 at 10:55:48PM -0400, Waiman Long wrote: @@ -219,24 +236,30 @@ static void pv_wait_node(struct mcs_spinlock *node) } /* + * Called after setting next-locked = 1 lock acquired. + * Check if the the CPU has been halted. If so, set the _Q_SLOW_VAL flag + * and put an entry into the lock hash table to be waken up at unlock time. */ -static void pv_kick_node(struct mcs_spinlock *node) +static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node) I'm not too sure about that name change.. { struct pv_node *pn = (struct pv_node *)node; + struct __qspinlock *l = (void *)lock; /* +* Transition CPU state: halted = hashed +* Quit if the transition failed. */ + if (cmpxchg(pn-state, vcpu_halted, vcpu_hashed) != vcpu_halted) + return; + + /* +* Put the lock into the hash table set the _Q_SLOW_VAL in the lock. +* As this is the same CPU that will check the _Q_SLOW_VAL value and +* the hash table later on at unlock time, no atomic instruction is +* needed. +*/ + WRITE_ONCE(l-locked, _Q_SLOW_VAL); + (void)pv_hash(lock, pn); } This is broken. The unlock path relies on: pv_hash() MB l-locked = SLOW such that when it observes SLOW, it must then also observe a consistent bucket. The above can have us do pv_hash_find() _before_ we actually hash the lock, which will result in us triggering that BUG_ON() in there. Urgh, clearly its late and I cannot read. The comment explains it. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v15 16/16] unfair qspinlock: a queue based unfair lock
On Wed, Apr 08, 2015 at 02:32:19PM -0400, Waiman Long wrote: For a virtual guest with the qspinlock patch, a simple unfair byte lock will be used if PV spinlock is not configured in or the hypervisor isn't either KVM or Xen. The byte lock works fine with small guest of just a few vCPUs. On a much larger guest, however, byte lock can have serious performance problem. Who cares? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Thu, Apr 02, 2015 at 09:48:34PM +0200, Peter Zijlstra wrote: @@ -158,20 +257,20 @@ static void pv_wait_head(struct qspinloc void __pv_queue_spin_unlock(struct qspinlock *lock) { struct __qspinlock *l = (void *)lock; + struct pv_hash_bucket *hb; if (xchg(l-locked, 0) != _Q_SLOW_VAL) return; /* * At this point the memory pointed at by lock can be freed/reused, + * however we can still use the pointer value to search in our hash + * table. * + * Also, if we observe _Q_SLOW_VAL we _must_ now observe 'our' hash + * bucket. See pv_wait_head(). */ + hb = pv_hash_find(lock); + pv_kick(hb-cpu); + WRITE_ONCE(hb-lock, NULL); /* unhash */ } So I _think_ I found a problem with this approach :/ If, as per the above, the lock does indeed get freed, it can get re-allocated and stuck in the hash table (again) before we get the lookup and unhash it. I'll have to ponder that a bit more. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Thu, Apr 02, 2015 at 07:20:57PM +0200, Peter Zijlstra wrote: pv_wait_head(): pv_hash() /* MB as per cmpxchg */ cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL); VS __pv_queue_spin_unlock(): if (xchg(l-locked, 0) != _Q_SLOW_VAL) return; /* MB as per xchg */ pv_hash_find(lock); Something like so.. compile tested only. I took out the LFSR because that was likely over engineering from my side :-) --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -2,6 +2,8 @@ #error do not include this file #endif +#include linux/hash.h + /* * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead * of spinning them. @@ -107,7 +109,84 @@ static void pv_kick_node(struct mcs_spin pv_kick(pn-cpu); } -static DEFINE_PER_CPU(struct qspinlock *, __pv_lock_wait); +/* + * Hash table using open addressing with an linear probe sequence. + * + * Since we should not be holding locks from NMI context (very rare indeed) the + * max load factor is 0.75, which is around the point where open adressing + * breaks down. + * + * Instead of probing just the immediate bucket we probe all buckets in the + * same cacheline. + * + * http://en.wikipedia.org/wiki/Hash_table#Open_addressing + * + */ + +struct pv_hash_bucket { + struct qspinlock *lock; + int cpu; +}; + +/* + * XXX dynamic allocate using nr_cpu_ids instead... + */ +#define PV_LOCK_HASH_BITS (2 + NR_CPUS_BITS) + +#if PV_LOCK_HASH_BITS 6 +#undef PV_LOCK_HASH_BITS +#define PB_LOCK_HASH_BITS 6 +#endif + +#define PV_LOCK_HASH_SIZE (1 PV_LOCK_HASH_BITS) + +static struct pv_hash_bucket __pv_lock_hash[PV_LOCK_HASH_SIZE] cacheline_aligned; + +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket)) + +static inline u32 hash_align(u32 hash) +{ + return hash ~(PV_HB_PER_LINE - 1); +} + +#define for_each_hash_bucket(hb, off, hash) \ + for (hash = hash_align(hash), off = 0, hb = __pv_lock_hash[hash + off];\ + off PV_LOCK_HASH_SIZE; \ + off++, hb = __pv_lock_hash[(hash + off) % PV_LOCK_HASH_SIZE]) + +static struct pv_hash_bucket *pv_hash_insert(struct qspinlock *lock) +{ + u32 offset, hash = hash_ptr(lock, PV_LOCK_HASH_BITS); + struct pv_hash_bucket *hb; + + for_each_hash_bucket(hb, offset, hash) { + if (!cmpxchg(hb-lock, NULL, lock)) { + WRITE_ONCE(hb-cpu, smp_processor_id()); + return hb; + } + } + + /* +* Hard assumes there is an empty bucket somewhere. +*/ + BUG(); +} + +static struct pv_hash_bucket *pv_hash_find(struct qspinlock *lock) +{ + u32 offset, hash = hash_ptr(lock, PV_LOCK_HASH_BITS); + struct pv_hash_bucket *hb; + + for_each_hash_bucket(hb, offset, hash) { + if (READ_ONCE(hb-lock) == lock) + return hb; + } + + /* +* Hard assumes we _WILL_ find a match. +*/ + BUG(); +} /* * Wait for l-locked to become clear; halt the vcpu after a short spin. @@ -116,7 +195,9 @@ static DEFINE_PER_CPU(struct qspinlock * static void pv_wait_head(struct qspinlock *lock) { struct __qspinlock *l = (void *)lock; + struct pv_hash_bucket *hb = NULL; int loop; + u8 o; for (;;) { for (loop = SPIN_THRESHOLD; loop; loop--) { @@ -126,29 +207,47 @@ static void pv_wait_head(struct qspinloc cpu_relax(); } - this_cpu_write(__pv_lock_wait, lock); - /* -* __pv_lock_wait must be set before setting _Q_SLOW_VAL -* -* [S] __pv_lock_wait = lock[RmW] l = l-locked = 0 -* MB MB -* [S] l-locked = _Q_SLOW_VAL [L] __pv_lock_wait -* -* Matches the xchg() in pv_queue_spin_unlock(). -*/ - if (!cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) - goto done; + if (!hb) { + hb = pv_hash_insert(lock); + /* +* hb must be set before setting _Q_SLOW_VAL +* +* [S] hb - lock [RmW] l = l-locked = 0 +* MB MB +* [RmW] l-locked ?= _Q_SLOW_VAL [L] hb +* +* Matches the xchg() in pv_queue_spin_unlock(). +*/ + o = cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL); + if (!o) { + /* +* The lock got
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Thu, Apr 02, 2015 at 12:28:30PM -0400, Waiman Long wrote: On 04/01/2015 05:03 PM, Peter Zijlstra wrote: On Wed, Apr 01, 2015 at 03:58:58PM -0400, Waiman Long wrote: On 04/01/2015 02:48 PM, Peter Zijlstra wrote: I am sorry that I don't quite get what you mean here. My point is that in the hashing step, a cpu will need to scan an empty bucket to put the lock in. In the interim, an previously used bucket before the empty one may get freed. In the lookup step for that lock, the scanning will stop because of an empty bucket in front of the target one. Right, that's broken. So we need to do something else to limit the lookup, because without that break, a lookup that needs to iterate the entire array in order to determine -ENOENT, which is expensive. So my alternative proposal is that IFF we can guarantee that every lookup will succeed -- the entry we're looking for is always there, we don't need the break on empty but can probe until we find the entry. This will be bound in cost to the same number if probes we required for insertion and avoids the full array scan. Now I think we can indeed do this, if as said earlier we do not clear the bucket on insert if the cmpxchg succeeds, in that case the unlock will observe _Q_SLOW_VAL and do the lookup, the lookup will then find the entry. And we then need the unlock to clear the entry. _Q_SLOW_VAL Does that explain this? Or should I try again with code? OK, I got your proposal now. However, there is still the issue that setting the _Q_SLOW_VAL flag and the hash bucket are not atomic wrt each other. So? They're strictly ordered, that's sufficient. We first hash the lock, then we set _Q_SLOW_VAL. There's a full memory barrier between them. It is possible a CPU has set the _Q_SLOW_VAL flag but not yet filling in the hash bucket while another one is trying to look for it. Nope. The unlock side does an xchg() on the locked value first, xchg also implies a full barrier, so that guarantees that if we observe _Q_SLOW_VAL we must also observe the hash bucket with the lock value. So we need to have some kind of synchronization mechanism to let the lookup CPU know when is a good time to look up. No, its all already ordered and working. pv_wait_head(): pv_hash() /* MB as per cmpxchg */ cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL); VS __pv_queue_spin_unlock(): if (xchg(l-locked, 0) != _Q_SLOW_VAL) return; /* MB as per xchg */ pv_hash_find(lock); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Wed, Apr 01, 2015 at 12:20:30PM -0400, Waiman Long wrote: After more careful reading, I think the assumption that the presence of an unused bucket means there is no match is not true. Consider the scenario: 1. cpu 0 puts lock1 into hb[0] 2. cpu 1 puts lock2 into hb[1] 3. cpu 2 clears hb[0] 4. cpu 3 looks for lock2 and doesn't find it Hmm, yes. The only way I can see that being true is if we assume entries are never taken out again. The wikipedia page could use some clarification here, this is not clear. At this point, I am thinking using back your previous idea of passing the queue head information down the queue. Having to scan the entire array for a lookup sure sucks, but the wait loops involved in the other idea can get us in the exact predicament we were trying to get out, because their forward progress depends on other CPUs. Hohumm.. time to think more I think ;-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Wed, Apr 01, 2015 at 07:42:39PM +0200, Peter Zijlstra wrote: Hohumm.. time to think more I think ;-) So bear with me, I've not really pondered this well so it could be full of holes (again). After the cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL) succeeds the spin_unlock() must do the hash lookup, right? We can make the lookup unhash. If the cmpxchg() fails the unlock will not do the lookup and we must unhash. The idea being that the result is that any lookup is guaranteed to find an entry, which reduces our worst case lookup cost to whatever the worst case insertion cost was. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Wed, Apr 01, 2015 at 02:54:45PM -0400, Waiman Long wrote: On 04/01/2015 02:17 PM, Peter Zijlstra wrote: On Wed, Apr 01, 2015 at 07:42:39PM +0200, Peter Zijlstra wrote: Hohumm.. time to think more I think ;-) So bear with me, I've not really pondered this well so it could be full of holes (again). After the cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL) succeeds the spin_unlock() must do the hash lookup, right? We can make the lookup unhash. If the cmpxchg() fails the unlock will not do the lookup and we must unhash. The idea being that the result is that any lookup is guaranteed to find an entry, which reduces our worst case lookup cost to whatever the worst case insertion cost was. I think it doesn't matter who did the unhashing. Multiple independent locks can be hashed to the same value. Since they can be unhashed independently, there is no way to know whether you have checked all the possible buckets. oh but the crux is that you guarantee a lookup will find an entry. it will never need to iterate the entire array. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Wed, Apr 01, 2015 at 07:12:23PM +0200, Peter Zijlstra wrote: On Wed, Apr 01, 2015 at 12:20:30PM -0400, Waiman Long wrote: After more careful reading, I think the assumption that the presence of an unused bucket means there is no match is not true. Consider the scenario: 1. cpu 0 puts lock1 into hb[0] 2. cpu 1 puts lock2 into hb[1] 3. cpu 2 clears hb[0] 4. cpu 3 looks for lock2 and doesn't find it Hmm, yes. The only way I can see that being true is if we assume entries are never taken out again. The wikipedia page could use some clarification here, this is not clear. At this point, I am thinking using back your previous idea of passing the queue head information down the queue. Having to scan the entire array for a lookup sure sucks, but the wait loops involved in the other idea can get us in the exact predicament we were trying to get out, because their forward progress depends on other CPUs. Hohumm.. time to think more I think ;-) So bear with me, I've not really pondered this well so it could be full of holes (again). After the cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL) succeeds the spin_unlock() must do the hash lookup, right? We can make the lookup unhash. If the cmpxchg() fails the unlock will not do the lookup and we must unhash. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Wed, Apr 01, 2015 at 03:58:58PM -0400, Waiman Long wrote: On 04/01/2015 02:48 PM, Peter Zijlstra wrote: I am sorry that I don't quite get what you mean here. My point is that in the hashing step, a cpu will need to scan an empty bucket to put the lock in. In the interim, an previously used bucket before the empty one may get freed. In the lookup step for that lock, the scanning will stop because of an empty bucket in front of the target one. Right, that's broken. So we need to do something else to limit the lookup, because without that break, a lookup that needs to iterate the entire array in order to determine -ENOENT, which is expensive. So my alternative proposal is that IFF we can guarantee that every lookup will succeed -- the entry we're looking for is always there, we don't need the break on empty but can probe until we find the entry. This will be bound in cost to the same number if probes we required for insertion and avoids the full array scan. Now I think we can indeed do this, if as said earlier we do not clear the bucket on insert if the cmpxchg succeeds, in that case the unlock will observe _Q_SLOW_VAL and do the lookup, the lookup will then find the entry. And we then need the unlock to clear the entry. Does that explain this? Or should I try again with code? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] qspinlock stuff -v15
On Mon, Mar 30, 2015 at 12:25:12PM -0400, Waiman Long wrote: I did it differently in my PV portion of the qspinlock patch. Instead of just waking up the CPU, the new lock holder will check if the new queue head has been halted. If so, it will set the slowpath flag for the halted queue head in the lock so as to wake it up at unlock time. This should eliminate your concern of dong twice as many VMEXIT in an overcommitted scenario. We can still do that on top of all this right? As you might have realized I'm a fan of gradual complexity :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] qspinlock stuff -v15
On Wed, Mar 25, 2015 at 03:47:39PM -0400, Konrad Rzeszutek Wilk wrote: Ah nice. That could be spun out as a seperate patch to optimize the existing ticket locks I presume. Yes I suppose we can do something similar for the ticket and patch in the right increment. We'd need to restructure the code a bit, but its not fundamentally impossible. We could equally apply the head hashing to the current ticket implementation and avoid the current bitmap iteration. Now with the old pv ticketlock code an vCPU would only go to sleep once and be woken up when it was its turn. With this new code it is woken up twice (and twice it goes to sleep). With an overcommit scenario this would imply that we will have at least twice as many VMEXIT as with the previous code. An astute observation, I had not considered that. I presume when you did benchmarking this did not even register? Thought I wonder if it would if you ran the benchmark for a week or so. You presume I benchmarked :-) I managed to boot something virt and run hackbench in it. I wouldn't know a representative virt setup if I ran into it. The thing is, we want this qspinlock for real hardware because its faster and I really want to avoid having to carry two spinlock implementations -- although I suppose that if we really really have to we could. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 9/9] qspinlock,x86,kvm: Implement KVM support for paravirt qspinlock
On Wed, Mar 18, 2015 at 10:45:55PM -0400, Waiman Long wrote: On 03/16/2015 09:16 AM, Peter Zijlstra wrote: I do have some concern about this call site patching mechanism as the modification is not atomic. The spin_unlock() calls are in many places in the kernel. There is a possibility that a thread is calling a certain spin_unlock call site while it is being patched by another one with the alternative() function call. So far, I don't see any problem with bare metal where paravirt_patch_insns() is used to patch it to the move instruction. However, in a virtual guest enivornment where paravirt_patch_call() was used, there were situations where the system panic because of page fault on some invalid memory in the kthread. If you look at the paravirt_patch_call(), you will see: : b-opcode = 0xe8; /* call */ b-delta = delta; If another CPU reads the instruction at the call site at the right moment, it will get the modified call instruction, but not the new delta value. It will then jump to a random location. I believe that was causing the system panic that I saw. So I think it is kind of risky to use it here unless we can guarantee that call site patching is atomic wrt other CPUs. Just look at where the patching is done: init/main.c:start_kernel() check_bugs() alternative_instructions() apply_paravirt() We're UP and not holding any locks, disable IRQs (see text_poke_early()) and have NMIs 'disabled'. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Wed, Mar 18, 2015 at 04:50:37PM -0400, Waiman Long wrote: +this_cpu_write(__pv_lock_wait, lock); We may run into the same problem of needing to have 4 queue nodes per CPU. If an interrupt happens just after the write and before the actual wait and it goes through the same sequence, it will overwrite the __pv_lock_wait[] entry. So we may have lost wakeup. That is why the pvticket lock code did that just before the actual wait with interrupt disabled. We probably couldn't disable interrupt here. So we may need to move the actual write to the KVM and Xen code if we keep the current logic. Hmm indeed. So I didn't actually mean to keep this code, but yes I missed that. +/* + * At this point the memory pointed at by lock can be freed/reused, + * however we can still use the pointer value to search in our cpu + * array. + * + * XXX: get rid of this loop + */ +for_each_possible_cpu(cpu) { +if (per_cpu(__pv_lock_wait, cpu) == lock) +pv_kick(cpu); +} +} I do want to get rid of this loop too. On average, we have to scan about half the number of CPUs available. So it isn't that different performance-wise compared with my original idea of following the list from tail to head. And how about your idea of propagating the current head down the linked list? Yeah, so I was half done with that patch when I fell asleep on the flight home. Didn't get around to 'fixing' it. It applies and builds but doesn't actually work. _However_ I'm not sure I actually like it much. The problem I have with it are these wait loops, they can generate the same problem we're trying to avoid. So I was now thinking of hashing the lock pointer; let me go and quickly put something together. --- kernel/locking/qspinlock.c |8 +- kernel/locking/qspinlock_paravirt.h | 124 2 files changed, 101 insertions(+), 31 deletions(-) --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -246,10 +246,10 @@ static __always_inline void set_locked(s */ static __always_inline void __pv_init_node(struct mcs_spinlock *node) { } -static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { } +static __always_inline void __pv_wait_node(u32 old, struct mcs_spinlock *node) { } static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { } -static __always_inline void __pv_wait_head(struct qspinlock *lock) { } +static __always_inline void __pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node) { } #define pv_enabled() false @@ -399,7 +399,7 @@ void queue_spin_lock_slowpath(struct qsp prev = decode_tail(old); WRITE_ONCE(prev-next, node); - pv_wait_node(node); + pv_wait_node(old, node); arch_mcs_spin_lock_contended(node-locked); } @@ -414,7 +414,7 @@ void queue_spin_lock_slowpath(struct qsp * sequentiality; this is because the set_locked() function below * does not imply a full barrier. */ - pv_wait_head(lock); + pv_wait_head(lock, node); while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_PENDING_MASK) cpu_relax(); --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -29,8 +29,14 @@ struct pv_node { int cpu; u8 state; + struct pv_node *head; }; +static inline struct pv_node *pv_decode_tail(u32 tail) +{ + return (struct pv_node *)decode_tail(tail); +} + /* * Initialize the PV part of the mcs_spinlock node. */ @@ -42,17 +48,49 @@ static void pv_init_node(struct mcs_spin pn-cpu = smp_processor_id(); pn-state = vcpu_running; + pn-head = NULL; +} + +static void pv_propagate_head(struct pv_node *pn, struct pv_node *ppn) +{ + /* +* When we race against the first waiter or ourselves we have to wait +* until the previous link updates its head pointer before we can +* propagate it. +*/ + while (!READ_ONCE(ppn-head)) { + /* +* queue_spin_lock_slowpath could have been waiting for the +* node-next store above before setting node-locked. +*/ + if (ppn-mcs.locked) + return; + + cpu_relax(); + } + /* +* If we interleave such that the store from pv_set_head() happens +* between our load and store we could have over-written the new head +* value. +* +* Therefore use cmpxchg to only propagate the old head value if our +* head value is unmodified. +*/ + (void)cmpxchg(pn-head, NULL, READ_ONCE(ppn-head)); } /* * Wait for node-locked to become true, halt the vcpu after a short spin. * pv_kick_node() is used to wake the
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Thu, Mar 19, 2015 at 11:12:42AM +0100, Peter Zijlstra wrote: So I was now thinking of hashing the lock pointer; let me go and quickly put something together. A little something like so; ideally we'd allocate the hashtable since NR_CPUS is kinda bloated, but it shows the idea I think. And while this has loops in (the rehashing thing) their fwd progress does not depend on other CPUs. And I suspect that for the typical lock contention scenarios its unlikely we ever really get into long rehashing chains. --- include/linux/lfsr.h| 49 kernel/locking/qspinlock_paravirt.h | 143 2 files changed, 178 insertions(+), 14 deletions(-) --- /dev/null +++ b/include/linux/lfsr.h @@ -0,0 +1,49 @@ +#ifndef _LINUX_LFSR_H +#define _LINUX_LFSR_H + +/* + * Simple Binary Galois Linear Feedback Shift Register + * + * http://en.wikipedia.org/wiki/Linear_feedback_shift_register + * + */ + +extern void __lfsr_needs_more_taps(void); + +static __always_inline u32 lfsr_taps(int bits) +{ + if (bits == 1) return 0x0001; + if (bits == 2) return 0x0001; + if (bits == 3) return 0x0003; + if (bits == 4) return 0x0009; + if (bits == 5) return 0x0012; + if (bits == 6) return 0x0021; + if (bits == 7) return 0x0041; + if (bits == 8) return 0x008E; + if (bits == 9) return 0x0108; + if (bits == 10) return 0x0204; + if (bits == 11) return 0x0402; + if (bits == 12) return 0x0829; + if (bits == 13) return 0x100D; + if (bits == 14) return 0x2015; + + /* +* For more taps see: +* http://users.ece.cmu.edu/~koopman/lfsr/index.html +*/ + __lfsr_needs_more_taps(); + + return 0; +} + +static inline u32 lfsr(u32 val, int bits) +{ + u32 bit = val 1; + + val = 1; + if (bit) + val ^= lfsr_taps(bits); + return val; +} + +#endif /* _LINUX_LFSR_H */ --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -2,6 +2,9 @@ #error do not include this file #endif +#include linux/hash.h +#include linux/lfsr.h + /* * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead * of spinning them. @@ -107,7 +110,120 @@ static void pv_kick_node(struct mcs_spin pv_kick(pn-cpu); } -static DEFINE_PER_CPU(struct qspinlock *, __pv_lock_wait); +/* + * Hash table using open addressing with an LFSR probe sequence. + * + * Since we should not be holding locks from NMI context (very rare indeed) the + * max load factor is 0.75, which is around the point where open addressing + * breaks down. + * + * Instead of probing just the immediate bucket we probe all buckets in the + * same cacheline. + * + * http://en.wikipedia.org/wiki/Hash_table#Open_addressing + * + */ + +#define HB_RESERVED((struct qspinlock *)1) + +struct pv_hash_bucket { + struct qspinlock *lock; + int cpu; +}; + +/* + * XXX dynamic allocate using nr_cpu_ids instead... + */ +#define PV_LOCK_HASH_BITS (2 + NR_CPUS_BITS) + +#if PV_LOCK_HASH_BITS 6 +#undef PV_LOCK_HASH_BITS +#define PB_LOCK_HASH_BITS 6 +#endif + +#define PV_LOCK_HASH_SIZE (1 PV_LOCK_HASH_BITS) + +static struct pv_hash_bucket __pv_lock_hash[PV_LOCK_HASH_SIZE] cacheline_aligned; + +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct pv_hash_bucket)) + +static inline u32 hash_align(u32 hash) +{ + return hash ~(PV_HB_PER_LINE - 1); +} + +static struct qspinlock **pv_hash(struct qspinlock *lock) +{ + u32 hash = hash_ptr(lock, PV_LOCK_HASH_BITS); + struct pv_hash_bucket *hb, *end; + + if (!hash) + hash = 1; + + hb = __pv_lock_hash[hash_align(hash)]; + for (;;) { + for (end = hb + PV_HB_PER_LINE; hb end; hb++) { + if (cmpxchg(hb-lock, NULL, HB_RESERVED)) { + WRITE_ONCE(hb-cpu, smp_processor_id()); + /* +* Since we must read lock first and cpu +* second, we must write cpu first and lock +* second, therefore use HB_RESERVE to mark an +* entry in use before writing the values. +* +* This can cause hb_hash_find() to not find a +* cpu even though _Q_SLOW_VAL, this is not a +* problem since we re-check l-locked before +* going to sleep and the unlock will have +* cleared l-locked already. +*/ + smp_wmb(); /* matches rmb from pv_hash_find */ + WRITE_ONCE(hb-lock, lock); + goto done
Re: [PATCH 8/9] qspinlock: Generic paravirt support
On Thu, Mar 19, 2015 at 01:25:36PM +0100, Peter Zijlstra wrote: +static struct qspinlock **pv_hash(struct qspinlock *lock) +{ + u32 hash = hash_ptr(lock, PV_LOCK_HASH_BITS); + struct pv_hash_bucket *hb, *end; + + if (!hash) + hash = 1; + + hb = __pv_lock_hash[hash_align(hash)]; + for (;;) { + for (end = hb + PV_HB_PER_LINE; hb end; hb++) { + if (cmpxchg(hb-lock, NULL, HB_RESERVED)) { That should be: !cmpxchg(), bit disturbing that that booted. + WRITE_ONCE(hb-cpu, smp_processor_id()); + /* + * Since we must read lock first and cpu + * second, we must write cpu first and lock + * second, therefore use HB_RESERVE to mark an + * entry in use before writing the values. + * + * This can cause hb_hash_find() to not find a + * cpu even though _Q_SLOW_VAL, this is not a + * problem since we re-check l-locked before + * going to sleep and the unlock will have + * cleared l-locked already. + */ + smp_wmb(); /* matches rmb from pv_hash_find */ + WRITE_ONCE(hb-lock, lock); + goto done; + } + } + + hash = lfsr(hash, PV_LOCK_HASH_BITS); + hb = __pv_lock_hash[hash_align(hash)]; + } + +done: + return hb-lock; +} -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15
On Thu, Mar 19, 2015 at 06:01:34PM +, David Vrabel wrote: This seems work for me, but I've not got time to give it a more thorough testing. You can fold this into your series. Thanks! There doesn't seem to be a way to disable QUEUE_SPINLOCKS when supported by the arch, is this intentional? If so, the existing ticketlock code could go. Yeah, its left as a rudiment such that if we find issues with the qspinlock code we can 'revert' with a trivial patch. If no issues show up we can rip out all the old code in a subsequent release. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] qspinlock: Extract out code snippets for the next patch
From: Waiman Long waiman.l...@hp.com This is a preparatory patch that extracts out the following 2 code snippets to prepare for the next performance optimization patch. 1) the logic for the exchange of new and previous tail code words into a new xchg_tail() function. 2) the logic for clearing the pending bit and setting the locked bit into a new clear_pending_set_locked() function. This patch also simplifies the trylock operation before queuing by calling queue_spin_trylock() directly. Cc: Ingo Molnar mi...@redhat.com Cc: David Vrabel david.vra...@citrix.com Cc: Oleg Nesterov o...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Douglas Hatch doug.ha...@hp.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Thomas Gleixner t...@linutronix.de Cc: H. Peter Anvin h...@zytor.com Cc: Rik van Riel r...@redhat.com Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Link: http://lkml.kernel.org/r/1421784755-21945-5-git-send-email-waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h |2 kernel/locking/qspinlock.c| 91 ++ 2 files changed, 62 insertions(+), 31 deletions(-) --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -58,6 +58,8 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) + #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) #define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -97,6 +97,54 @@ static inline struct mcs_spinlock *decod #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) /** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + u32 new, old; + + for (;;) { + new = (val ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; + + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } +} + +/** + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) +{ + u32 old, new, val = atomic_read(lock-val); + + for (;;) { + new = (val _Q_LOCKED_PENDING_MASK) | tail; + old = atomic_cmpxchg(lock-val, val, new); + if (old == val) + break; + + val = old; + } + return old; +} + +/** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word @@ -178,15 +226,7 @@ void queue_spin_lock_slowpath(struct qsp * * *,1,0 - *,0,1 */ - for (;;) { - new = (val ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + clear_pending_set_locked(lock, val); return; /* @@ -203,37 +243,26 @@ void queue_spin_lock_slowpath(struct qsp node-next = NULL; /* -* We have already touched the queueing cacheline; don't bother with -* pending stuff. -* -* trylock || xchg(lock, node) -* -* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked. -* p,y,x - n,y,x ; tail was p - tail is n; preserving locked. +* We touched a (possibly) cold cacheline in the per-cpu queue node; +* attempt the trylock once more in the hope someone let go while we +* weren't watching. */ - for (;;) { - new = _Q_LOCKED_VAL; - if (val) - new = tail | (val _Q_LOCKED_PENDING_MASK); - - old = atomic_cmpxchg(lock-val, val, new); - if (old == val) - break; - - val = old; - } + if (queue_spin_trylock(lock)) + goto release; /* -* we won the trylock; forget about queueing. +* We
[PATCH 5/9] qspinlock: Optimize for smaller NR_CPUS
From: Peter Zijlstra pet...@infradead.org When we allow for a max NR_CPUS 2^14 we can optimize the pending wait-acquire and the xchg_tail() operations. By growing the pending bit to a byte, we reduce the tail to 16bit. This means we can use xchg16 for the tail part and do away with all the repeated compxchg() operations. This in turn allows us to unconditionally acquire; the locked state as observed by the wait loops cannot change. And because both locked and pending are now a full byte we can use simple stores for the state transition, obviating one atomic operation entirely. This optimization is needed to make the qspinlock achieve performance parity with ticket spinlock at light load. All this is horribly broken on Alpha pre EV56 (and any other arch that cannot do single-copy atomic byte stores). Cc: Ingo Molnar mi...@redhat.com Cc: David Vrabel david.vra...@citrix.com Cc: Oleg Nesterov o...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Douglas Hatch doug.ha...@hp.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Thomas Gleixner t...@linutronix.de Cc: H. Peter Anvin h...@zytor.com Cc: Rik van Riel r...@redhat.com Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Link: http://lkml.kernel.org/r/1421784755-21945-6-git-send-email-waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 13 ++ kernel/locking/qspinlock.c| 71 +- 2 files changed, 83 insertions(+), 1 deletion(-) --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -35,6 +35,14 @@ typedef struct qspinlock { /* * Bitfields in the atomic value: * + * When NR_CPUS 16K + * 0- 7: locked byte + * 8: pending + * 9-15: not used + * 16-17: tail index + * 18-31: tail cpu (+1) + * + * When NR_CPUS = 16K * 0- 7: locked byte * 8: pending * 9-10: tail index @@ -47,7 +55,11 @@ typedef struct qspinlock { #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) #define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#if CONFIG_NR_CPUS (1U 14) +#define _Q_PENDING_BITS8 +#else #define _Q_PENDING_BITS1 +#endif #define _Q_PENDING_MASK_Q_SET_MASK(PENDING) #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) @@ -58,6 +70,7 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_BITS (32 - _Q_TAIL_CPU_OFFSET) #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) +#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET #define _Q_TAIL_MASK (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -24,6 +24,7 @@ #include linux/percpu.h #include linux/hardirq.h #include linux/mutex.h +#include asm/byteorder.h #include asm/qspinlock.h /* @@ -56,6 +57,10 @@ * node; whereby avoiding the need to carry a node from lock to unlock, and * preserving existing lock API. This also makes the unlock code simpler and * faster. + * + * N.B. The current implementation only supports architectures that allow + * atomic operations on smaller 8-bit and 16-bit data types. + * */ #include mcs_spinlock.h @@ -96,6 +101,64 @@ static inline struct mcs_spinlock *decod #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) +/* + * By using the whole 2nd least significant byte for the pending bit, we + * can allow better optimization of the lock acquisition for the pending + * bit holder. + */ +#if _Q_PENDING_BITS == 8 + +struct __qspinlock { + union { + atomic_t val; + struct { +#ifdef __LITTLE_ENDIAN + u16 locked_pending; + u16 tail; +#else + u16 tail; + u16 locked_pending; +#endif + }; + }; +}; + +/** + * clear_pending_set_locked - take ownership and clear the pending bit. + * @lock: Pointer to queue spinlock structure + * @val : Current value of the queue spinlock 32-bit word + * + * *,1,0 - *,0,1 + * + * Lock stealing is not allowed if this function is used. + */ +static __always_inline void +clear_pending_set_locked(struct qspinlock *lock, u32 val) +{ + struct __qspinlock *l = (void *)lock; + + WRITE_ONCE(l-locked_pending, _Q_LOCKED_VAL); +} + +/* + * xchg_tail - Put in the new queue tail code word retrieve previous one + * @lock : Pointer to queue spinlock structure + * @tail : The new queue tail code word + * Return: The previous queue tail code word + * + * xchg(lock, tail) + * + * p,*,* - n,*,* ; prev = xchg(lock, node) + */ +static __always_inline u32
[PATCH 2/9] qspinlock, x86: Enable x86-64 to use queue spinlock
From: Waiman Long waiman.l...@hp.com This patch makes the necessary changes at the x86 architecture specific layer to enable the use of queue spinlock for x86-64. As x86-32 machines are typically not multi-socket. The benefit of queue spinlock may not be apparent. So queue spinlock is not enabled. Currently, there is some incompatibilities between the para-virtualized spinlock code (which hard-codes the use of ticket spinlock) and the queue spinlock. Therefore, the use of queue spinlock is disabled when the para-virtualized spinlock is enabled. The arch/x86/include/asm/qspinlock.h header file includes some x86 specific optimization which will make the queue spinlock code perform better than the generic implementation. Cc: Ingo Molnar mi...@redhat.com Cc: David Vrabel david.vra...@citrix.com Cc: Oleg Nesterov o...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Douglas Hatch doug.ha...@hp.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Thomas Gleixner t...@linutronix.de Cc: H. Peter Anvin h...@zytor.com Cc: Rik van Riel r...@redhat.com Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Link: http://lkml.kernel.org/r/1421784755-21945-3-git-send-email-waiman.l...@hp.com --- arch/x86/Kconfig |1 + arch/x86/include/asm/qspinlock.h | 20 arch/x86/include/asm/spinlock.h |5 + arch/x86/include/asm/spinlock_types.h |4 4 files changed, 30 insertions(+) create mode 100644 arch/x86/include/asm/qspinlock.h --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -125,6 +125,7 @@ config X86 select MODULES_USE_ELF_RELA if X86_64 select CLONE_BACKWARDS if X86_32 select ARCH_USE_BUILTIN_BSWAP + select ARCH_USE_QUEUE_SPINLOCK select ARCH_USE_QUEUE_RWLOCK select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION select OLD_SIGACTION if X86_32 --- /dev/null +++ b/arch/x86/include/asm/qspinlock.h @@ -0,0 +1,20 @@ +#ifndef _ASM_X86_QSPINLOCK_H +#define _ASM_X86_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +#definequeue_spin_unlock queue_spin_unlock +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + * + * An smp_store_release() on the least-significant byte. + */ +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + smp_store_release((u8 *)lock, 0); +} + +#include asm-generic/qspinlock.h + +#endif /* _ASM_X86_QSPINLOCK_H */ --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -42,6 +42,10 @@ extern struct static_key paravirt_ticketlocks_enabled; static __always_inline bool static_key_false(struct static_key *key); +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm/qspinlock.h +#else + #ifdef CONFIG_PARAVIRT_SPINLOCKS static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) @@ -196,6 +200,7 @@ static inline void arch_spin_unlock_wait cpu_relax(); } } +#endif /* CONFIG_QUEUE_SPINLOCK */ /* * Read-write spinlocks, allowing multiple readers --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -23,6 +23,9 @@ typedef u32 __ticketpair_t; #define TICKET_SHIFT (sizeof(__ticket_t) * 8) +#ifdef CONFIG_QUEUE_SPINLOCK +#include asm-generic/qspinlock_types.h +#else typedef struct arch_spinlock { union { __ticketpair_t head_tail; @@ -33,6 +36,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif /* CONFIG_QUEUE_SPINLOCK */ #include asm-generic/qrwlock_types.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] qspinlock: Use a simple write to grab the lock
From: Waiman Long waiman.l...@hp.com Currently, atomic_cmpxchg() is used to get the lock. However, this is not really necessary if there is more than one task in the queue and the queue head don't need to reset the tail code. For that case, a simple write to set the lock bit is enough as the queue head will be the only one eligible to get the lock as long as it checks that both the lock and pending bits are not set. The current pending bit waiting code will ensure that the bit will not be set as soon as the tail code in the lock is set. With that change, the are some slight improvement in the performance of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket Westere-EX machine as shown in the tables below. [Standalone/Embedded - same node] # of tasksBefore patchAfter patch %Change ----- -- --- 3 2324/2321 2248/2265-3%/-2% 4 2890/2896 2819/2831-2%/-2% 5 3611/3595 3522/3512-2%/-2% 6 4281/4276 4173/4160-3%/-3% 7 5018/5001 4875/4861-3%/-3% 8 5759/5750 5563/5568-3%/-3% [Standalone/Embedded - different nodes] # of tasksBefore patchAfter patch %Change ----- -- --- 312242/12237 12087/12093 -1%/-1% 410688/10696 10507/10521 -2%/-2% It was also found that this change produced a much bigger performance improvement in the newer IvyBridge-EX chip and was essentially to close the performance gap between the ticket spinlock and queue spinlock. The disk workload of the AIM7 benchmark was run on a 4-socket Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users on a 3.14 based kernel. The results of the test runs were: AIM7 XFS Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock56782333.17 96.61 5.81 qspinlock 57507993.13 94.83 5.97 AIM7 EXT4 Disk Test kernel JPMReal Time Sys TimeUsr Time - ---- ticketlock1114551 16.15 509.72 7.11 qspinlock 21844668.24 232.99 6.01 The ext4 filesystem run had a much higher spinlock contention than the xfs filesystem run. The ebizzy -m test was also run with the following results: kernel records/s Real Time Sys TimeUsr Time -- - ticketlock 2075 10.00 216.35 3.49 qspinlock 3023 10.00 198.20 4.80 Cc: H. Peter Anvin h...@zytor.com Cc: David Vrabel david.vra...@citrix.com Cc: Oleg Nesterov o...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Douglas Hatch doug.ha...@hp.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Rik van Riel r...@redhat.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Link: http://lkml.kernel.org/r/1421784755-21945-7-git-send-email-waiman.l...@hp.com --- kernel/locking/qspinlock.c | 61 + 1 file changed, 45 insertions(+), 16 deletions(-) --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -105,24 +105,33 @@ static inline struct mcs_spinlock *decod * By using the whole 2nd least significant byte for the pending bit, we * can allow better optimization of the lock acquisition for the pending * bit holder. + * + * This internal structure is also used by the set_locked function which + * is not restricted to _Q_PENDING_BITS == 8. */ -#if _Q_PENDING_BITS == 8 - struct __qspinlock { union { atomic_t val; - struct { #ifdef __LITTLE_ENDIAN + u8 locked; + struct { u16 locked_pending; u16 tail; + }; #else + struct { u16 tail; u16 locked_pending; -#endif }; + struct { + u8 reserved[3]; + u8 locked; + }; +#endif }; }; +#if _Q_PENDING_BITS == 8 /** * clear_pending_set_locked - take ownership and clear the pending bit. * @lock: Pointer to queue
[PATCH 7/9] qspinlock: Revert to test-and-set on hypervisors
From: Peter Zijlstra pet...@infradead.org When we detect a hypervisor (!paravirt, see qspinlock paravirt support patches), revert to a simple test-and-set lock to avoid the horrors of queue preemption. Cc: Ingo Molnar mi...@redhat.com Cc: David Vrabel david.vra...@citrix.com Cc: Oleg Nesterov o...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Douglas Hatch doug.ha...@hp.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Thomas Gleixner t...@linutronix.de Cc: H. Peter Anvin h...@zytor.com Cc: Rik van Riel r...@redhat.com Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Link: http://lkml.kernel.org/r/1421784755-21945-8-git-send-email-waiman.l...@hp.com --- arch/x86/include/asm/qspinlock.h | 14 ++ include/asm-generic/qspinlock.h |7 +++ kernel/locking/qspinlock.c |3 +++ 3 files changed, 24 insertions(+) --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -1,6 +1,7 @@ #ifndef _ASM_X86_QSPINLOCK_H #define _ASM_X86_QSPINLOCK_H +#include asm/cpufeature.h #include asm-generic/qspinlock_types.h #definequeue_spin_unlock queue_spin_unlock @@ -15,6 +16,19 @@ static inline void queue_spin_unlock(str smp_store_release((u8 *)lock, 0); } +#define virt_queue_spin_lock virt_queue_spin_lock + +static inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) + return false; + + while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0) + cpu_relax(); + + return true; +} + #include asm-generic/qspinlock.h #endif /* _ASM_X86_QSPINLOCK_H */ --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -111,6 +111,13 @@ static inline void queue_spin_unlock_wai cpu_relax(); } +#ifndef virt_queue_spin_lock +static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock) +{ + return false; +} +#endif + /* * Initializier */ --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -259,6 +259,9 @@ void queue_spin_lock_slowpath(struct qsp BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + if (virt_queue_spin_lock(lock)) + return; + /* * wait for in-progress pending-locked hand-overs * -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/9] qspinlock: Add pending bit
From: Peter Zijlstra pet...@infradead.org Because the qspinlock needs to touch a second cacheline (the per-cpu mcs_nodes[]); add a pending bit and allow a single in-word spinner before we punt to the second cacheline. It is possible so observe the pending bit without the locked bit when the last owner has just released but the pending owner has not yet taken ownership. In this case we would normally queue -- because the pending bit is already taken. However, in this case the pending bit is guaranteed to be released 'soon', therefore wait for it and avoid queueing. Cc: Ingo Molnar mi...@redhat.com Cc: David Vrabel david.vra...@citrix.com Cc: Oleg Nesterov o...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Douglas Hatch doug.ha...@hp.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Thomas Gleixner t...@linutronix.de Cc: H. Peter Anvin h...@zytor.com Cc: Rik van Riel r...@redhat.com Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Link: http://lkml.kernel.org/r/1421784755-21945-4-git-send-email-waiman.l...@hp.com --- include/asm-generic/qspinlock_types.h | 12 ++- kernel/locking/qspinlock.c| 119 -- 2 files changed, 107 insertions(+), 24 deletions(-) --- a/include/asm-generic/qspinlock_types.h +++ b/include/asm-generic/qspinlock_types.h @@ -36,8 +36,9 @@ typedef struct qspinlock { * Bitfields in the atomic value: * * 0- 7: locked byte - * 8- 9: tail index - * 10-31: tail cpu (+1) + * 8: pending + * 9-10: tail index + * 11-31: tail cpu (+1) */ #define_Q_SET_MASK(type) (((1U _Q_ ## type ## _BITS) - 1)\ _Q_ ## type ## _OFFSET) @@ -45,7 +46,11 @@ typedef struct qspinlock { #define _Q_LOCKED_BITS 8 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED) -#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS) +#define _Q_PENDING_BITS1 +#define _Q_PENDING_MASK_Q_SET_MASK(PENDING) + +#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS) #define _Q_TAIL_IDX_BITS 2 #define _Q_TAIL_IDX_MASK _Q_SET_MASK(TAIL_IDX) @@ -54,5 +59,6 @@ typedef struct qspinlock { #define _Q_TAIL_CPU_MASK _Q_SET_MASK(TAIL_CPU) #define _Q_LOCKED_VAL (1U _Q_LOCKED_OFFSET) +#define _Q_PENDING_VAL (1U _Q_PENDING_OFFSET) #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */ --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decod return per_cpu_ptr(mcs_nodes[idx], cpu); } +#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK) + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure * @val: Current value of the queue spinlock 32-bit word * - * (queue tail, lock value) - * - * fast :slow : unlock - *: : - * uncontended (0,0) --:-- (0,1) :-- (*,0) - *: | ^./ : - *: v \ | : - * uncontended:(n,x) --+-- (n,0) | : - * queue: | ^--' | : - *: v | : - * contended :(*,x) --+-- (*,0) - (*,1) ---' : - * queue: ^--' : + * (queue tail, pending bit, lock value) * + * fast :slow :unlock + * : : + * uncontended (0,0,0) -:-- (0,0,1) --:-- (*,*,0) + * : | ^.--. / : + * : v \ \| : + * pending :(0,1,1) +-- (0,1,0) \ | : + * : | ^--' | | : + * : v | | : + * uncontended :(n,x,y) +-- (n,0,0) --' | : + * queue : | ^--' | : + * : v | : + * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -' : + * queue : ^--' : */ void queue_spin_lock_slowpath(struct
[PATCH 8/9] qspinlock: Generic paravirt support
Implement simple paravirt support for the qspinlock. Provide a separate (second) version of the spin_lock_slowpath for paravirt along with a special unlock path. The second slowpath is generated by adding a few pv hooks to the normal slowpath, but where those will compile away for the native case, they expand into special wait/wake code for the pv version. The actual MCS queue can use extra storage in the mcs_nodes[] array to keep track of state and therefore uses directed wakeups. The head contender has no such storage available and reverts to the per-cpu lock entry similar to the current kvm code. We can do a single enrty because any nesting will wake the vcpu and cause the lower loop to retry. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- include/asm-generic/qspinlock.h |3 kernel/locking/qspinlock.c | 69 +- kernel/locking/qspinlock_paravirt.h | 177 3 files changed, 248 insertions(+), 1 deletion(-) --- a/include/asm-generic/qspinlock.h +++ b/include/asm-generic/qspinlock.h @@ -118,6 +118,9 @@ static __always_inline bool virt_queue_s } #endif +extern void __pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_queue_spin_unlock(struct qspinlock *lock); + /* * Initializier */ --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -18,6 +18,9 @@ * Authors: Waiman Long waiman.l...@hp.com * Peter Zijlstra pet...@infradead.org */ + +#ifndef _GEN_PV_LOCK_SLOWPATH + #include linux/smp.h #include linux/bug.h #include linux/cpumask.h @@ -65,13 +68,21 @@ #include mcs_spinlock.h +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define MAX_NODES 8 +#else +#define MAX_NODES 4 +#endif + /* * Per-CPU queue node structures; we can never have more than 4 nested * contexts: task, softirq, hardirq, nmi. * * Exactly fits one 64-byte cacheline on a 64-bit architecture. + * + * PV doubles the storage and uses the second cacheline for PV state. */ -static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]); +static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]); /* * We must be able to distinguish between no-tail and the tail at 0:0, @@ -230,6 +241,32 @@ static __always_inline void set_locked(s WRITE_ONCE(l-locked, _Q_LOCKED_VAL); } + +/* + * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for + * all the PV callbacks. + */ + +static __always_inline void __pv_init_node(struct mcs_spinlock *node) { } +static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { } +static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { } + +static __always_inline void __pv_wait_head(struct qspinlock *lock) { } + +#define pv_enabled() false + +#define pv_init_node __pv_init_node +#define pv_wait_node __pv_wait_node +#define pv_kick_node __pv_kick_node + +#define pv_wait_head __pv_wait_head + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define queue_spin_lock_slowpath native_queue_spin_lock_slowpath +#endif + +#endif /* _GEN_PV_LOCK_SLOWPATH */ + /** * queue_spin_lock_slowpath - acquire the queue spinlock * @lock: Pointer to queue spinlock structure @@ -259,6 +296,9 @@ void queue_spin_lock_slowpath(struct qsp BUILD_BUG_ON(CONFIG_NR_CPUS = (1U _Q_TAIL_CPU_BITS)); + if (pv_enabled()) + goto queue; + if (virt_queue_spin_lock(lock)) return; @@ -335,6 +375,7 @@ void queue_spin_lock_slowpath(struct qsp node += idx; node-locked = 0; node-next = NULL; + pv_init_node(node); /* * We touched a (possibly) cold cacheline in the per-cpu queue node; @@ -360,6 +401,7 @@ void queue_spin_lock_slowpath(struct qsp prev = decode_tail(old); WRITE_ONCE(prev-next, node); + pv_wait_node(node); arch_mcs_spin_lock_contended(node-locked); } @@ -374,6 +416,7 @@ void queue_spin_lock_slowpath(struct qsp * sequentiality; this is because the set_locked() function below * does not imply a full barrier. */ + pv_wait_head(lock); while ((val = smp_load_acquire(lock-val.counter)) _Q_LOCKED_PENDING_MASK) cpu_relax(); @@ -406,6 +449,7 @@ void queue_spin_lock_slowpath(struct qsp cpu_relax(); arch_mcs_spin_unlock_contended(next-locked); + pv_kick_node(next); release: /* @@ -414,3 +458,26 @@ void queue_spin_lock_slowpath(struct qsp this_cpu_dec(mcs_nodes[0].count); } EXPORT_SYMBOL(queue_spin_lock_slowpath); + +/* + * Generate the paravirt code for queue_spin_unlock_slowpath(). + */ +#if !defined(_GEN_PV_LOCK_SLOWPATH) defined(CONFIG_PARAVIRT_SPINLOCKS) +#define _GEN_PV_LOCK_SLOWPATH + +#undef pv_enabled +#define pv_enabled() true + +#undef pv_init_node +#undef pv_wait_node +#undef
[PATCH 1/9] qspinlock: A simple generic 4-byte queue spinlock
From: Waiman Long waiman.l...@hp.com This patch introduces a new generic queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be almost as fair as the ticket spinlock. It has about the same speed in single-thread and it can be much faster in high contention situations especially when the spinlock is embedded within the data structure to be protected. Only in light to moderate contention where the average queue depth is around 1-3 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The cost of contention is also higher because of slower inter-node memory traffic. Due to the fact that spinlocks are acquired with preemption disabled, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Counting soft IRQ, hard IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting activities. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 24-bit size (including CPU number and queue node index) leaving one byte for the lock. Please note that the queue node is only needed when waiting for the lock. Once the lock is acquired, the queue node can be released to be used later. Cc: H. Peter Anvin h...@zytor.com Cc: David Vrabel david.vra...@citrix.com Cc: Oleg Nesterov o...@redhat.com Cc: Scott J Norton scott.nor...@hp.com Cc: Paolo Bonzini paolo.bonz...@gmail.com Cc: Douglas Hatch doug.ha...@hp.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: Boris Ostrovsky boris.ostrov...@oracle.com Cc: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Rik van Riel r...@redhat.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Signed-off-by: Waiman Long waiman.l...@hp.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org Link: http://lkml.kernel.org/r/1421784755-21945-2-git-send-email-waiman.l...@hp.com --- include/asm-generic/qspinlock.h | 132 + include/asm-generic/qspinlock_types.h | 58 + kernel/Kconfig.locks |7 + kernel/locking/Makefile |1 kernel/locking/mcs_spinlock.h |1 kernel/locking/qspinlock.c| 209 ++ 6 files changed, 408 insertions(+) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 include/asm-generic/qspinlock_types.h create mode 100644 kernel/locking/qspinlock.c --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,132 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long waiman.l...@hp.com + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include asm-generic/qspinlock_types.h + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return atomic_read(lock-val); +} + +/** + * queue_spin_value_unlocked - is the spinlock structure unlocked? + * @lock: queue spinlock structure + * Return: 1 if it is unlocked, 0 otherwise + * + * N.B. Whenever there are tasks waiting for the lock, it is considered + * locked wrt the lockref code to avoid lock stealing by the lockref + * code and change things underneath the lock. This also allows some + * optimizations to be applied without conflict with lockref. + */ +static __always_inline int queue_spin_value_unlocked(struct qspinlock lock) +{ + return !atomic_read(lock.val); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return atomic_read(lock-val) ~_Q_LOCKED_MASK; +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer
[PATCH 9/9] qspinlock,x86,kvm: Implement KVM support for paravirt qspinlock
Implement the paravirt qspinlock for x86-kvm. We use the regular paravirt call patching to switch between: native_queue_spin_lock_slowpath() __pv_queue_spin_lock_slowpath() native_queue_spin_unlock()__pv_queue_spin_unlock() We use a callee saved call for the unlock function which reduces the i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions again. We further optimize the unlock path by patching the direct call with a movb $0,%arg1 if we are indeed using the native unlock code. This makes the unlock code almost as fast as the !PARAVIRT case. This significantly lowers the overhead of having CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code. Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- arch/x86/Kconfig |2 - arch/x86/include/asm/paravirt.h | 28 - arch/x86/include/asm/paravirt_types.h | 10 +++ arch/x86/include/asm/qspinlock.h | 22 - arch/x86/kernel/kvm.c | 44 ++ arch/x86/kernel/paravirt-spinlocks.c | 24 +- arch/x86/kernel/paravirt_patch_32.c | 22 + arch/x86/kernel/paravirt_patch_64.c | 22 + kernel/Kconfig.locks |2 - 9 files changed, 163 insertions(+), 13 deletions(-) --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -692,7 +692,7 @@ config PARAVIRT_DEBUG config PARAVIRT_SPINLOCKS bool Paravirtualization layer for spinlocks depends on PARAVIRT SMP - select UNINLINE_SPIN_UNLOCK + select UNINLINE_SPIN_UNLOCK if !QUEUE_SPINLOCK ---help--- Paravirtualized spinlocks allow a pvops backend to replace the spinlock implementation with something virtualization-friendly --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -712,6 +712,30 @@ static inline void __set_fixmap(unsigned #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) +#ifdef CONFIG_QUEUE_SPINLOCK + +static __always_inline void pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) +{ + PVOP_VCALL2(pv_lock_ops.queue_spin_lock_slowpath, lock, val); +} + +static __always_inline void pv_queue_spin_unlock(struct qspinlock *lock) +{ + PVOP_VCALLEE1(pv_lock_ops.queue_spin_unlock, lock); +} + +static __always_inline void pv_wait(u8 *ptr, u8 val) +{ + PVOP_VCALL2(pv_lock_ops.wait, ptr, val); +} + +static __always_inline void pv_kick(int cpu) +{ + PVOP_VCALL1(pv_lock_ops.kick, cpu); +} + +#else /* !CONFIG_QUEUE_SPINLOCK */ + static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, __ticket_t ticket) { @@ -724,7 +748,9 @@ static __always_inline void __ticket_unl PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } -#endif +#endif /* CONFIG_QUEUE_SPINLOCK */ + +#endif /* SMP PARAVIRT_SPINLOCKS */ #ifdef CONFIG_X86_32 #define PV_SAVE_REGS pushl %ecx; pushl %edx; --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -333,9 +333,19 @@ struct arch_spinlock; typedef u16 __ticket_t; #endif +struct qspinlock; + struct pv_lock_ops { +#ifdef CONFIG_QUEUE_SPINLOCK + void (*queue_spin_lock_slowpath)(struct qspinlock *lock, u32 val); + struct paravirt_callee_save queue_spin_unlock; + + void (*wait)(u8 *ptr, u8 val); + void (*kick)(int cpu); +#else /* !CONFIG_QUEUE_SPINLOCK */ struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket); +#endif /* !CONFIG_QUEUE_SPINLOCK */ }; /* This contains all the paravirt structures: we get a convenient --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -3,6 +3,7 @@ #include asm/cpufeature.h #include asm-generic/qspinlock_types.h +#include asm/paravirt.h #definequeue_spin_unlock queue_spin_unlock /** @@ -11,11 +12,30 @@ * * An smp_store_release() on the least-significant byte. */ -static inline void queue_spin_unlock(struct qspinlock *lock) +static inline void native_queue_spin_unlock(struct qspinlock *lock) { smp_store_release((u8 *)lock, 0); } +#ifdef CONFIG_PARAVIRT_SPINLOCKS +extern void native_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +static inline void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) +{ + pv_queue_spin_lock_slowpath(lock, val); +} + +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + pv_queue_spin_unlock(lock); +} +#else +static inline void queue_spin_unlock(struct qspinlock *lock) +{ + native_queue_spin_unlock(lock); +} +#endif + #define virt_queue_spin_lock virt_queue_spin_lock static inline bool virt_queue_spin_lock(struct qspinlock *lock) --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -584,6 +584,41 @@ static void kvm_kick_cpu(int cpu
[PATCH 0/9] qspinlock stuff -v15
Hi Waiman, As promised; here is the paravirt stuff I did during the trip to BOS last week. All the !paravirt patches are more or less the same as before (the only real change is the copyright lines in the first patch). The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more convoluted and I've no real way to test that but it should be stright fwd to make work. I ran this using the virtme tool (thanks Andy) on my laptop with a 4x overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and it both booted and survived a hackbench run (perf bench sched messaging -g 20 -l 5000). So while the paravirt code isn't the most optimal code ever conceived it does work. Also, the paravirt patching includes replacing the call with movb $0, %arg1 for the native case, which should greatly reduce the cost of having CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware. I feel that if someone were to do a Xen patch we can go ahead and merge this stuff (finally!). These patches do not implement the paravirt spinlock debug stats currently implemented (separately) by KVM and Xen, but that should not be too hard to do on top and in the 'generic' code -- no reason to duplicate all that. Of course; once this lands people can look at improving the paravirt nonsense. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio balloon: do not call blocking ops when !TASK_RUNNING
On Thu, Feb 26, 2015 at 09:30:31AM +0100, Michael S. Tsirkin wrote: On Thu, Feb 26, 2015 at 11:50:42AM +1030, Rusty Russell wrote: Thomas Huth th...@linux.vnet.ibm.com writes: Hi all, with the recent kernel 3.19, I get a kernel warning when I start my KVM guest on s390 with virtio balloon enabled: The deeper problem is that virtio_ccw_get_config just silently fails on OOM. Neither get_config nor set_config are expected to fail. Cornelia, I think ccw and config_area should be allocated inside vcdev. You could either use pointers, or simply allocate vcdev with GDP_DMA. This would avoid the kmalloc inside these calls. Thanks, Rusty. But it won't solve the problem of nested sleepers with ccw: ATM is invokes ccw_io_helper to execute commands, and that one calls wait_event to wait for an interrupt. Might be fixable but I think my patch looks like a safer solution for 4.0/3.19, no? I've no idea what your patch was since I'm not subscribed to any of the lists this discussion is had on. But you can annotate the warning away; _however_ with the annotation needs to be a big comment explaining why its safe to do so. Typically to involved talking about how its actually rare for the call to sleep. So occasional sleeps inside a wait_event() are ok-ish, we'll just get to go around once more. But once you consistently sleep inside a wait_event() things go a bit funny. So for instance; if in ccw_io_helper() we expect that wait_event(, !doing_io()) to be (mostly) true on first go, then we'll never get into __wait_event() and -state won't actually be mucked about with. The thing to avoid is not actually sleeping (much) but setting TASK_RUNNING and turning the entire thing into a giant poll loop. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On Tue, Feb 17, 2015 at 06:44:19PM +0100, Sebastian Andrzej Siewior wrote: * Peter Zijlstra | 2015-01-21 16:07:16 [+0100]: On Tue, Jan 20, 2015 at 01:16:13PM -0500, Steven Rostedt wrote: I'm actually wondering if we should just nuke the _interruptible() version of swait. As it should only be all interruptible or all not interruptible, that the swait_wake() should just do the wake up regardless. In which case, swait_wake() is good enough. No need to have different versions where people may think do something special. Peter? Yeah, I think the lastest thing I have sitting here on my disk only has the swake_up() which does TASK_NORMAL, no choice there. what is the swait status in terms of mainline? This sounds like it beeing worked on. I could take the series but then I would drop it again if the mainline implementation changes… Well, 'worked' on might be putting too much on it, its one of the many many 'spare' time things that never get attention unless people bug me ;-) The below is my current patch, I've not actually tried it yet, I (think I) had one patch doing some conversions but I'm having trouble locating it. Mostly-Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- include/linux/swait.h | 172 ++ kernel/sched/swait.c | 122 +++ 2 files changed, 294 insertions(+) --- /dev/null +++ b/include/linux/swait.h @@ -0,0 +1,172 @@ +#ifndef _LINUX_SWAIT_H +#define _LINUX_SWAIT_H + +#include linux/list.h +#include linux/stddef.h +#include linux/spinlock.h +#include asm/current.h + +/* + * Simple wait queues + * + * While these are very similar to the other/complex wait queues (wait.h) the + * most important difference is that the simple waitqueue allows for + * deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold + * times. + * + * In order to make this so, we had to drop a fair number of features of the + * other waitqueue code; notably: + * + * - mixing INTERRUPTIBLE and UNINTERRUPTIBLE sleeps on the same waitqueue; + *all wakeups are TASK_NORMAL in order to avoid O(n) lookups for the right + *sleeper state. + * + * - the exclusive mode; because this requires preserving the list order + *and this is hard. + * + * - custom wake functions; because you cannot give any guarantees about + *random code. + * + * As a side effect of this; the data structures are slimmer. + * + * One would recommend using this wait queue where possible. + */ + +struct task_struct; + +struct swait_queue_head { + raw_spinlock_t lock; + struct list_headtask_list; +}; + +struct swait_queue { + struct task_struct *task; + struct list_headtask_list; +}; + +#define __SWAITQUEUE_INITIALIZER(name) { \ + .task = current, \ + .task_list = LIST_HEAD_INIT((name).task_list), \ +} + +#define DECLARE_SWAITQUEUE(name) \ + struct swait_queue name = __SWAITQUEUE_INITIALIZER(name) + +#define __SWAIT_QUEUE_HEAD_INITIALIZER(name) { \ + .lock = __RAW_SPIN_LOCK_UNLOCKED(name.lock), \ + .task_list = LIST_HEAD_INIT((name).task_list), \ +} + +#define DECLARE_SWAIT_QUEUE_HEAD(name) \ + struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INITIALIZER(name) + +extern void __init_swait_queue_head(struct swait_queue_head *q, const char *name, + struct lock_class_key *key); + +#define init_swait_queue_head(q) \ + do {\ + static struct lock_class_key __key; \ + __init_swait_queue_head((q), #q, __key); \ + } while (0) + +#ifdef CONFIG_LOCKDEP +# define __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) \ + ({ init_swait_queue_head(name); name; }) +# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\ + struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) +#else +# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\ + DECLARE_SWAIT_QUEUE_HEAD(name) +#endif + +static inline int swait_active(struct swait_queue_head *q) +{ + return !list_empty(q-task_list); +} + +extern void swake_up(struct swait_queue_head *q); +extern void swake_up_all(struct swait_queue_head *q); +extern void swake_up_locked(struct swait_queue_head *q); + +extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait); +extern void prepare_to_swait(struct swait_queue_head *q, struct swait_queue *wait, int state); +extern long prepare_to_swait_event(struct swait_queue_head *q, struct swait_queue *wait, int state); + +extern void __finish_swait(struct
Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions
On Thu, Feb 12, 2015 at 05:17:27PM +0530, Raghavendra K T wrote: Paravirt spinlock clears slowpath flag after doing unlock. As explained by Linus currently it does: prev = *lock; add_smp(lock-tickets.head, TICKET_LOCK_INC); /* add_smp() is a full mb() */ if (unlikely(lock-tickets.tail TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); which is *exactly* the kind of things you cannot do with spinlocks, because after you've done the add_smp() and released the spinlock for the fast-path, you can't access the spinlock any more. Exactly because a fast-path lock might come in, and release the whole data structure. Linus suggested that we should not do any writes to lock after unlock(), and we can move slowpath clearing to fastpath lock. So this patch implements the fix with: 1. Moving slowpath flag to head (Oleg). 2. use xadd to avoid read/write after unlock that checks the need for unlock_kick (Linus). Maybe spend a few more words explaining these things; something like: Unlocked locks don't care about the slowpath flag; therefore we can keep it set after the last unlock, as long as we clear it again on the first (try)lock -- this removes the write after unlock. By moving the slowpath flag from the tail to the head ticket we avoid the need to access both the head and tail tickets on unlock. We can further avoid the need for a read-after-release by using xadd; the prev head value will include the slowpath flag and indicate if we need to do PV kicking of suspended spinners -- on modern chips xadd isn't (much) more expensive than an add + load. Its 'obvious' now, but maybe not so much after we've all not looked at this for a few months. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions
On Mon, Feb 09, 2015 at 03:04:22PM +0530, Raghavendra K T wrote: So we have 3 choices, 1. xadd 2. continue with current approach. 3. a read before unlock and also after that. For the truly paranoid we have probe_kernel_address(), suppose the lock was in module space and the module just got unloaded under us. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On Tue, Jan 20, 2015 at 01:16:13PM -0500, Steven Rostedt wrote: I'm actually wondering if we should just nuke the _interruptible() version of swait. As it should only be all interruptible or all not interruptible, that the swait_wake() should just do the wake up regardless. In which case, swait_wake() is good enough. No need to have different versions where people may think do something special. Peter? Yeah, I think the lastest thing I have sitting here on my disk only has the swake_up() which does TASK_NORMAL, no choice there. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq
On Fri, Jan 16, 2015 at 11:48:46AM -0500, Steven Rostedt wrote: I notice everywhere you have a swait_wake_interruptible() but here. Is there a reason why? IIRC, Peter wants to make swait wakeup usage homogenous. That is, you either sleep in an interruptible state, or you don't. You can't mix and match it. Peter is that what you plan? Yes, this is required for the single wakeup case to be bounded. If you have both INTERRUPTIBLE and UNINTERRUPTIBLE waiters on a list, and you were allowed to do INTERRUPTIBLE/UNINTERRUPTIBLE wakeups, the wakeup would have to do O(n) iteration to find a matching waiter. Seeing that the entire purpose of simple wait queues was to retain RT properties, that's entirely retarded ;-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v13 09/11] pvqspinlock, x86: Add para-virtualization support
On Wed, Oct 29, 2014 at 04:19:09PM -0400, Waiman Long wrote: arch/x86/include/asm/pvqspinlock.h| 411 + I do wonder why all this needs to live in x86.. +#ifdef CONFIG_QUEUE_SPINLOCK + +static __always_inline void pv_kick_cpu(int cpu) +{ + PVOP_VCALLEE1(pv_lock_ops.kick_cpu, cpu); +} + +static __always_inline void pv_lockwait(u8 *lockbyte) +{ + PVOP_VCALLEE1(pv_lock_ops.lockwait, lockbyte); +} + +static __always_inline void pv_lockstat(enum pv_lock_stats type) +{ + PVOP_VCALLEE1(pv_lock_ops.lockstat, type); +} Why are any of these PV ops? they're only called from other pv_*() functions. What's the point of pv ops you only call from pv code? +/* + * Queue Spinlock Para-Virtualization (PV) Support + * + * The PV support code for queue spinlock is roughly the same as that + * of the ticket spinlock. Relative comments are bad, esp. since we'll make the ticket code go away if this works, at which point this is a reference into a black hole. Each CPU waiting for the lock will spin until it + * reaches a threshold. When that happens, it will put itself to a halt state + * so that the hypervisor can reuse the CPU cycles in some other guests as + * well as returning other hold-up CPUs faster. +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + * + * N.B. INLINE_SPIN_LOCK should not be enabled when PARAVIRT_SPINLOCK is on. One should write a compile time fail for that, not a comment. + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + u32 val; + + val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL); + if (likely(val == 0)) + return; + if (static_key_false(paravirt_spinlocks_enabled)) + pv_queue_spin_lock_slowpath(lock, val); + else + queue_spin_lock_slowpath(lock, val); +} No, this is just vile.. _that_ is what we have PV ops for. And at that point its the same function it was before the PV stuff, so that whole inline thing is then gone. +extern void queue_spin_unlock_slowpath(struct qspinlock *lock); + /** * queue_spin_unlock - release a queue spinlock * @lock : Pointer to queue spinlock structure * * An effective smp_store_release() on the least-significant byte. + * + * Inlining of the unlock function is disabled when CONFIG_PARAVIRT_SPINLOCKS + * is defined. So _raw_spin_unlock() will be the only call site that will + * have to be patched. again if you hard rely on the not inlining make a build fail not a comment. */ static inline void queue_spin_unlock(struct qspinlock *lock) { barrier(); + if (!static_key_false(paravirt_spinlocks_enabled)) { + native_spin_unlock(lock); + return; + } + /* + * Need to atomically clear the lock byte to avoid racing with + * queue head waiter trying to set _QLOCK_LOCKED_SLOWPATH. + */ + if (unlikely(cmpxchg((u8 *)lock, _Q_LOCKED_VAL, 0) != _Q_LOCKED_VAL)) + queue_spin_unlock_slowpath(lock); +} Idem, that static key stuff is wrong, use PV ops to switch between unlock paths. @@ -354,7 +394,7 @@ queue: * if there was a previous node; link it and wait until reaching the * head of the waitqueue. */ - if (old _Q_TAIL_MASK) { + if (!pv_link_and_wait_node(old, node) (old _Q_TAIL_MASK)) { prev = decode_tail(old); ACCESS_ONCE(prev-next) = node; @@ -369,9 +409,11 @@ queue: * * *,x,y - *,0,0 */ - while ((val = smp_load_acquire(lock-val.counter)) - _Q_LOCKED_PENDING_MASK) + val = pv_wait_head(lock, node); + while (val _Q_LOCKED_PENDING_MASK) { cpu_relax(); + val = smp_load_acquire(lock-val.counter); + } /* * claim the lock: Please make the pv_*() calls return void and reduce to NOPs. This keeps the logic invariant of the pv stuff. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On Mon, Oct 27, 2014 at 01:15:53PM -0400, Waiman Long wrote: On 10/24/2014 06:04 PM, Peter Zijlstra wrote: On Fri, Oct 24, 2014 at 04:53:27PM -0400, Waiman Long wrote: The additional register pressure may just cause a few more register moves which should be negligible in the overall performance . The additional icache pressure, however, may have some impact on performance. I was trying to balance the performance of the pv and non-pv versions so that we won't penalize the pv code too much for a bit more performance in the non-pv code. Doing it your way will add a lot of function call and register saving/restoring to the pv code. If people care about performance they should not be using virt crap :-) I only really care about bare metal. Yes, I am aware of that. However, the whole point of doing PV spinlock is to improve performance in a virtual guest. Anything that avoids the lock holder preemption nonsense is a _massive_ win for them, a few function calls should not even register on that scale. +#ifdef _GEN_PV_LOCK_SLOWPATH +static void pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) +#else void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val) +#endif If you have two functions you might as well use the PV stuff to patch in the right function call at the usage sites and avoid: + if (pv_enabled()) { + pv_queue_spin_lock_slowpath(lock, val); + return; + } this alltogether. this_cpu_dec(mcs_nodes[0].count); } EXPORT_SYMBOL(queue_spin_lock_slowpath); + +#if !defined(_GEN_PV_LOCK_SLOWPATH) defined(CONFIG_PARAVIRT_SPINLOCKS) +/* + * Generate the PV version of the queue_spin_lock_slowpath function + */ +#undef pv_init_node +#undef pv_wait_check +#undef pv_link_and_wait_node +#undef pv_wait_head +#undef EXPORT_SYMBOL +#undef in_pv_code + +#define _GEN_PV_LOCK_SLOWPATH +#define EXPORT_SYMBOL(x) +#define in_pv_code return_true +#define pv_enabled return_false + +#include qspinlock.c + +#endif That's properly disgusting :-) But a lot better than actually duplicating everything I suppose. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote: On 10/24/2014 04:54 AM, Peter Zijlstra wrote: On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote: Since enabling paravirt spinlock will disable unlock function inlining, a jump label can be added to the unlock function without adding patch sites all over the kernel. But you don't have to. My patches allowed for the inline to remain, again reducing the overhead of enabling PV spinlocks while running on a real machine. Look at: http://lkml.kernel.org/r/20140615130154.213923...@chello.nl In particular this hunk: Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c === --- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c +++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c @@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs) DEF_NATIVE(, mov32, mov %edi, %eax); DEF_NATIVE(, mov64, mov %rdi, %rax); +#if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_QUEUE_SPINLOCK) +DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi)); +#endif + unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len) { return paravirt_patch_insns(insnbuf, len, @@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb PATCH_SITE(pv_cpu_ops, clts); PATCH_SITE(pv_mmu_ops, flush_tlb_single); PATCH_SITE(pv_cpu_ops, wbinvd); +#if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_QUEUE_SPINLOCK) + PATCH_SITE(pv_lock_ops, queue_unlock); +#endif patch_site: ret = paravirt_patch_insns(ibuf, len, start, end); That makes sure to overwrite the callee-saved call to the pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi). Therefore you can retain the inlined unlock with hardly (there might be some NOP padding) any overhead at all. On PV it reverts to a callee saved function call. My concern is that spin_unlock() can be called in many places, including loadable kernel modules. Can the paravirt_patch_ident_32() function able to patch all of them in reasonable time? How about a kernel module loaded later at run time? modules should be fine, see arch/x86/kernel/module.c:module_finalize() - apply_paravirt(). Also note that the 'default' text is an indirect call into the paravirt ops table which routes to the 'right' function, so even if the text patching would be 'late' calls would 'work' as expected, just slower. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote: +static inline void pv_init_node(struct mcs_spinlock *node) +{ + struct pv_qnode *pn = (struct pv_qnode *)node; + + BUILD_BUG_ON(sizeof(struct pv_qnode) 5*sizeof(struct mcs_spinlock)); + + if (!pv_enabled()) + return; + + pn-cpustate = PV_CPU_ACTIVE; + pn-mayhalt = false; + pn-mycpu= smp_processor_id(); + pn-head = PV_INVALID_HEAD; +} @@ -333,6 +393,7 @@ queue: node += idx; node-locked = 0; node-next = NULL; + pv_init_node(node); /* * We touched a (possibly) cold cacheline in the per-cpu queue node; So even if !pv_enabled() the compiler will still have to emit the code for that inline, which will generate additional register pressure, icache pressure and lovely stuff like that. The patch I had used pv-ops for these things that would turn into NOPs in the regular case and callee-saved function calls for the PV case. That still does not entirely eliminate cost, but does reduce it significant. Please consider using that. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote: Since enabling paravirt spinlock will disable unlock function inlining, a jump label can be added to the unlock function without adding patch sites all over the kernel. But you don't have to. My patches allowed for the inline to remain, again reducing the overhead of enabling PV spinlocks while running on a real machine. Look at: http://lkml.kernel.org/r/20140615130154.213923...@chello.nl In particular this hunk: Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c === --- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c +++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c @@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs) DEF_NATIVE(, mov32, mov %edi, %eax); DEF_NATIVE(, mov64, mov %rdi, %rax); +#if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_QUEUE_SPINLOCK) +DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi)); +#endif + unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len) { return paravirt_patch_insns(insnbuf, len, @@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb PATCH_SITE(pv_cpu_ops, clts); PATCH_SITE(pv_mmu_ops, flush_tlb_single); PATCH_SITE(pv_cpu_ops, wbinvd); +#if defined(CONFIG_PARAVIRT_SPINLOCKS) defined(CONFIG_QUEUE_SPINLOCK) + PATCH_SITE(pv_lock_ops, queue_unlock); +#endif patch_site: ret = paravirt_patch_insns(ibuf, len, start, end); That makes sure to overwrite the callee-saved call to the pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi). Therefore you can retain the inlined unlock with hardly (there might be some NOP padding) any overhead at all. On PV it reverts to a callee saved function call. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 00/11] qspinlock: a 4-byte queue spinlock with PV support
On Thu, Oct 16, 2014 at 02:10:29PM -0400, Waiman Long wrote: v11-v12: - Based on PeterZ's version of the qspinlock patch (https://lkml.org/lkml/2014/6/15/63). - Incorporated many of the review comments from Konrad Wilk and Paolo Bonzini. - The pvqspinlock code is largely from my previous version with PeterZ's way of going from queue tail to head and his idea of using callee saved calls to KVM and XEN codes. Thanks for taking the time to refresh this.. I would prefer you use a little more of the PV techniques I outlined in my latest PV patch to further reduce the overhead of PV enabled kernels on real hardware. This is an important use case, because distro kernels will have to enable PV support while their majority of installations will be on physical hardware. Other than that I see no reason not to move this forward. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support
On Fri, Oct 24, 2014 at 04:53:27PM -0400, Waiman Long wrote: The additional register pressure may just cause a few more register moves which should be negligible in the overall performance . The additional icache pressure, however, may have some impact on performance. I was trying to balance the performance of the pv and non-pv versions so that we won't penalize the pv code too much for a bit more performance in the non-pv code. Doing it your way will add a lot of function call and register saving/restoring to the pv code. If people care about performance they should not be using virt crap :-) I only really care about bare metal. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] mm: gup: add FOLL_TRIED
On Wed, Oct 01, 2014 at 10:56:34AM +0200, Andrea Arcangeli wrote: From: Andres Lagar-Cavilla andre...@google.com This needs a changelog Reviewed-by: Radim Krčmář rkrc...@redhat.com Signed-off-by: Andres Lagar-Cavilla andre...@google.com Signed-off-by: Andrea Arcangeli aarca...@redhat.com --- include/linux/mm.h | 1 + mm/gup.c | 4 2 files changed, 5 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 8981cc8..0f4196a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1985,6 +1985,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, #define FOLL_HWPOISON0x100 /* check page is hwpoisoned */ #define FOLL_NUMA0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/mm/gup.c b/mm/gup.c index 91d044b..af7ea3e 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, fault_flags |= FAULT_FLAG_ALLOW_RETRY; if (*flags FOLL_NOWAIT) fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT; + if (*flags FOLL_TRIED) { + VM_WARN_ON_ONCE(fault_flags FAULT_FLAG_ALLOW_RETRY); + fault_flags |= FAULT_FLAG_TRIED; + } ret = handle_mm_fault(mm, vma, address, fault_flags); if (ret VM_FAULT_ERROR) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] mm: gup: add get_user_pages_locked and get_user_pages_unlocked
On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote: +static inline long __get_user_pages_locked(struct task_struct *tsk, +struct mm_struct *mm, +unsigned long start, +unsigned long nr_pages, +int write, int force, +struct page **pages, +struct vm_area_struct **vmas, +int *locked, +bool notify_drop) You might want to consider __always_inline to make sure it does indeed get inlined and constant propagation works for @locked and @notify_drop. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] mm: gup: add get_user_pages_locked and get_user_pages_unlocked
On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote: +static inline long __get_user_pages_locked(struct task_struct *tsk, +struct mm_struct *mm, +unsigned long start, +unsigned long nr_pages, +int write, int force, +struct page **pages, +struct vm_area_struct **vmas, +int *locked, +bool notify_drop) +{ + if (notify_drop lock_dropped *locked) { + /* + * We must let the caller know we temporarily dropped the lock + * and so the critical section protected by it was lost. + */ + up_read(mm-mmap_sem); + *locked = 0; + } + return pages_done; +} +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm, +unsigned long start, unsigned long nr_pages, +int write, int force, struct page **pages, +int *locked) +{ + return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force, +pages, NULL, locked, true); +} +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm, + unsigned long start, unsigned long nr_pages, + int write, int force, struct page **pages) +{ + long ret; + int locked = 1; + down_read(mm-mmap_sem); + ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force, + pages, NULL, locked, false); + if (locked) + up_read(mm-mmap_sem); + return ret; +} long get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, unsigned long nr_pages, int write, int force, struct page **pages, struct vm_area_struct **vmas) { + return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force, +pages, vmas, NULL, false); } I'm wondering about that notify_drop parameter, what's the added benefit? If you look at these 3 callers we can do away with it, since in the second called where we have locked but !notify_drop we seem to do the exact same thing afterwards anyway. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] mm: gup: use get_user_pages_fast and get_user_pages_unlocked
On Wed, Oct 01, 2014 at 10:56:36AM +0200, Andrea Arcangeli wrote: Just an optimization. Does it make sense to split the thing in two? One where you apply _unlocked and then one where you apply _fast? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote: On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote: For all these and the other _fast() users, is there an actual limit to the nr_pages passed in? Because we used to have the 64 pages limit from DIO, but without that we get rather long IRQ-off latencies. Ok, I would tend to think this is an issue to solve in gup_fast implementation, I wouldn't blame or modify the callers for it. I don't think there's anything that prevents gup_fast to enable irqs after certain number of pages have been taken, nop; and disable the irqs again. Agreed, I once upon a time had a patch set converting the 2 (x86 and powerpc) gup_fast implementations at the time, but somehow that never got anywhere. Just saying we should probably do that before we add callers with unlimited nr_pages. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
On Thu, Oct 02, 2014 at 02:50:52PM +0200, Peter Zijlstra wrote: On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote: On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote: For all these and the other _fast() users, is there an actual limit to the nr_pages passed in? Because we used to have the 64 pages limit from DIO, but without that we get rather long IRQ-off latencies. Ok, I would tend to think this is an issue to solve in gup_fast implementation, I wouldn't blame or modify the callers for it. I don't think there's anything that prevents gup_fast to enable irqs after certain number of pages have been taken, nop; and disable the irqs again. Agreed, I once upon a time had a patch set converting the 2 (x86 and powerpc) gup_fast implementations at the time, but somehow that never got anywhere. Just saying we should probably do that before we add callers with unlimited nr_pages. https://lkml.org/lkml/2009/6/24/457 Clearly there's more work these days. Many more archs grew a gup.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY
On Fri, Sep 26, 2014 at 07:25:35PM +0200, Andrea Arcangeli wrote: diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c index bb48a57..12ea7c3 100644 --- a/drivers/dma/iovlock.c +++ b/drivers/dma/iovlock.c @@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec *iov, size_t len) pages += page_list-nr_pages; /* pin pages down */ - down_read(current-mm-mmap_sem); - ret = get_user_pages( - current, - current-mm, + ret = get_user_pages_fast( (unsigned long) iov[i].iov_base, page_list-nr_pages, 1, /* write */ - 0, /* force */ - page_list-pages, - NULL); - up_read(current-mm-mmap_sem); + page_list-pages); if (ret != page_list-nr_pages) goto unpin; --- a/drivers/misc/sgi-gru/grufault.c +++ b/drivers/misc/sgi-gru/grufault.c @@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma, #else *pageshift = PAGE_SHIFT; #endif - if (get_user_pages - (current, current-mm, vaddr, 1, write, 0, page, NULL) = 0) + if (get_user_pages_fast(vaddr, 1, write, page) = 0) return -EFAULT; *paddr = page_to_phys(page); put_page(page); diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c index aff9689..c89dcfa 100644 --- a/drivers/scsi/st.c +++ b/drivers/scsi/st.c @@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp, return -ENOMEM; /* Try to fault in all of the necessary pages */ - down_read(current-mm-mmap_sem); /* rw==READ means read from drive, write into memory area */ - res = get_user_pages( - current, - current-mm, + res = get_user_pages_fast( uaddr, nr_pages, rw == READ, - 0, /* don't force */ - pages, - NULL); - up_read(current-mm-mmap_sem); + pages); /* Errors and no page mapped should return here */ if (res nr_pages) For all these and the other _fast() users, is there an actual limit to the nr_pages passed in? Because we used to have the 64 pages limit from DIO, but without that we get rather long IRQ-off latencies. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND PATCH 1/3] x86: Adding structs to reflect cpuid fields
On Wed, Sep 17, 2014 at 03:54:12PM +0300, Nadav Amit wrote: Adding structs that reflect various cpuid fields in x86 architecture. Structs were added only for functions that are not pure bitmaps. Signed-off-by: Nadav Amit na...@cs.technion.ac.il --- arch/x86/include/asm/cpuid_def.h | 163 +++ 1 file changed, 163 insertions(+) create mode 100644 arch/x86/include/asm/cpuid_def.h diff --git a/arch/x86/include/asm/cpuid_def.h b/arch/x86/include/asm/cpuid_def.h new file mode 100644 index 000..0cf43ba --- /dev/null +++ b/arch/x86/include/asm/cpuid_def.h @@ -0,0 +1,163 @@ +#ifndef ARCH_X86_KVM_CPUID_DEF_H +#define ARCH_X86_KVM_CPUID_DEF_H Stale name? Its not exclusively used for KVM and the header name itself doesn't include KVM either. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] x86: structs for cpuid info in x86
On Wed, Sep 17, 2014 at 02:37:10PM +0200, Ingo Molnar wrote: If hpa, tglx or Linus objects I'll yield to that objection though. Opinions, objections? They generally look fine to me. I appreciate the bitfields for readability. I often use the same when having to deal with hardware bitfields. See for example the cpuid10_a?x unions in asm/perf_event.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V6 1/2] perf ignore LBR and extra_rsp
On Mon, Jul 14, 2014 at 12:25:56PM -0700, kan.li...@intel.com wrote: Close enough. +#define EVENT_EXTRA_REG(e, ms, m, vm, i, a) {\ + .event = (e), \ + .msr = (ms),\ + .config_mask = (m), \ + .valid_mask = (vm), \ + .idx = EXTRA_REG_##i, \ + .extra_msr_access = (a),\ } #define INTEL_EVENT_EXTRA_REG(event, msr, vm, idx) \ - EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, idx) + EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, idx, true) #define INTEL_UEVENT_EXTRA_REG(event, msr, vm, idx) \ EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \ - ARCH_PERFMON_EVENTSEL_UMASK, vm, idx) + ARCH_PERFMON_EVENTSEL_UMASK, vm, idx, true) #define INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(c) \ INTEL_UEVENT_EXTRA_REG(c, \ @@ -318,7 +320,7 @@ struct extra_reg { 0x, \ LDLAT) -#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0) +#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0, false) diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h b/arch/x86/kernel/cpu/perf_event_intel_uncore.h index 90236f0..d860e26 100644 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h @@ -366,11 +366,11 @@ NHMEX_M_PMON_CTL_FLAG_MODE) #define MBOX_INC_SEL_EXTAR_REG(c, r) \ EVENT_EXTRA_REG(MBOX_INC_SEL(c), NHMEX_M0_MSR_PMU_##r, \ - MBOX_INC_SEL_MASK, (u64)-1, NHMEX_M_##r) + MBOX_INC_SEL_MASK, (u64)-1, NHMEX_M_##r, true) #define MBOX_SET_FLAG_SEL_EXTRA_REG(c, r) \ EVENT_EXTRA_REG(MBOX_SET_FLAG_SEL(c), NHMEX_M0_MSR_PMU_##r, \ MBOX_SET_FLAG_SEL_MASK, \ - (u64)-1, NHMEX_M_##r) + (u64)-1, NHMEX_M_##r, true) /* NHM-EX Rbox */ #define NHMEX_R_MSR_GLOBAL_CTL 0xe00 Since nobody ever treats EVENT_EXTRA_END as an actual event, the value of .extra_msr_access is irrelevant, this leaves the only 'possible' value 'true' and we can delete all those changes. Which, combined with a few whitespace cleanups, gives the below patch. --- Subject: x86, perf: Protect LBR and extra_regs against KVM lying From: Kan Liang kan.li...@intel.com Date: Mon, 14 Jul 2014 12:25:56 -0700 With -cpu host, KVM reports LBR and extra_regs support, if the host has support. When the guest perf driver tries to access LBR or extra_regs MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP Cc: a...@firstfloor.org Signed-off-by: Kan Liang kan.li...@intel.com Signed-off-by: Peter Zijlstra pet...@infradead.org Link: http://lkml.kernel.org/r/1405365957-20202-1-git-send-email-kan.li...@intel.com --- arch/x86/kernel/cpu/perf_event.c |3 + arch/x86/kernel/cpu/perf_event.h | 12 +++--- arch/x86/kernel/cpu/perf_event_intel.c | 66 - 3 files changed, 75 insertions(+), 6 deletions(-) --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config continue; if (event-attr.config1 ~er-valid_mask) return -EINVAL; + /* Check if the extra msrs can be safely accessed*/ + if (!er-extra_msr_access) + return -ENXIO; reg-idx = er-idx; reg-config = event-attr.config1; --- a/arch/x86/kernel/cpu/perf_event.h +++ b/arch/x86/kernel/cpu/perf_event.h @@ -295,14 +295,16 @@ struct extra_reg { u64 config_mask; u64 valid_mask; int idx; /* per_xxx-regs[] reg index */ + boolextra_msr_access; }; #define EVENT_EXTRA_REG(e, ms, m, vm, i) { \ - .event = (e), \ - .msr = (ms),\ - .config_mask = (m), \ - .valid_mask = (vm), \ - .idx = EXTRA_REG_##i, \ + .event = (e), \ + .msr = (ms),\ + .config_mask = (m), \ + .valid_mask = (vm), \ + .idx = EXTRA_REG_##i
Re: [PATCH V6 1/2] perf ignore LBR and extra_rsp
On Tue, Jul 15, 2014 at 03:32:36PM +, Liang, Kan wrote: Since nobody ever treats EVENT_EXTRA_END as an actual event, the value of .extra_msr_access is irrelevant, this leaves the only 'possible' value 'true' and we can delete all those changes. Right. Which, combined with a few whitespace cleanups, gives the below patch. Thanks. Your change is perfect. Do I need to resubmit the patch to the mailing list? Nope, got it queued. pgpnelvoOS7ed.pgp Description: PGP signature
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
On Thu, Jul 10, 2014 at 03:59:43AM -0700, kan.li...@intel.com wrote: From: Kan Liang kan.li...@intel.com x86, perf: Protect LBR and extra_regs against KVM lying With -cpu host, KVM reports LBR and extra_regs support, if the host has support. When the guest perf driver tries to access LBR or extra_regs MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR #GP. Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP This is still not properly wrapped at 78 chars. Signed-off-by: Kan Liang kan.li...@intel.com V2: Move the check code to initialization time. V3: Add flag for each extra register. Check all LBR MSRs at initialization time. V4: Remove lbr_msr_access. For LBR msr, simply set lbr_nr to 0 if check_msr failed. Disable all extra msrs in creation places if check_msr failed. V5: Fix check_msr broken Don't check any more MSRs after the first fail Return error when checking fail to stop creating the event Remove the checking code path which never get These things should go below the --- so they get thrown away when applying the patch, its of no relevance once applied. --- arch/x86/kernel/cpu/perf_event.c | 3 +++ arch/x86/kernel/cpu/perf_event.h | 45 ++ arch/x86/kernel/cpu/perf_event_intel.c | 38 +++- 3 files changed, 85 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 2bdfbff..a7c5e4b 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event) continue; if (event-attr.config1 ~er-valid_mask) return -EINVAL; + /* Check if the extra msrs can be safely accessed*/ + if (!x86_pmu.extra_msr_access[er-idx]) + return -EFAULT; This is not a correct usage of -EFAULT. Event creation did not fail because we took a fault dereferencing a user provided pointer. Possibly ENXIO is appropriate. reg-idx = er-idx; reg-config = event-attr.config1; diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h index 3b2f9bd..992c678 100644 --- a/arch/x86/kernel/cpu/perf_event.h +++ b/arch/x86/kernel/cpu/perf_event.h @@ -464,6 +464,12 @@ struct x86_pmu { */ struct extra_reg *extra_regs; unsigned int er_flags; + /* + * EXTRA REG MSR can be accessed + * The extra registers are completely unrelated to each other. + * So it needs a flag for each extra register. + */ + boolextra_msr_access[EXTRA_REG_MAX]; So why not in struct extra_reg again? You didn't give a straight answer there. +/* + * Under certain circumstances, access certain MSR may cause #GP. + * The function tests if the input MSR can be safely accessed. + */ +static inline bool check_msr(unsigned long msr) +{ This reads like a generic function; + u64 val_old, val_new, val_tmp; + + /* + * Read the current value, change it and read it back to see if it + * matches, this is needed to detect certain hardware emulators + * (qemu/kvm) that don't trap on the MSR access and always return 0s. + */ + if (rdmsrl_safe(msr, val_old)) + goto msr_fail; + /* + * Only chagne it slightly, + * since the higher bits of some MSRs cannot be updated by wrmsrl. + * E.g. MSR_LBR_TOS + */ + val_tmp = val_old ^ 0x3UL; but this is not generally true; not all MSRs can write the 2 LSB, can they? One option would be to extend the function with a u64 mask. + if (wrmsrl_safe(msr, val_tmp) || + rdmsrl_safe(msr, val_new)) + goto msr_fail; + + if (val_new != val_tmp) + goto msr_fail; + + /* Here it's sure that the MSR can be safely accessed. + * Restore the old value and return. + */ + wrmsrl(msr, val_old); + + return true; + +msr_fail: + return false; +} Also, by now this function is far too large to be inline and in a header. + /* + * Access LBR MSR may cause #GP under certain circumstances. + * E.g. KVM doesn't support LBR MSR + * Check all LBT MSR here. + * Disable LBR access if any LBR MSRs can not be accessed. + */ + if (x86_pmu.lbr_nr) { + if
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
On Thu, Jul 10, 2014 at 03:59:43AM -0700, kan.li...@intel.com wrote: +/* + * Under certain circumstances, access certain MSR may cause #GP. + * The function tests if the input MSR can be safely accessed. + */ +static inline bool check_msr(unsigned long msr) +{ + u64 val_old, val_new, val_tmp; + + /* + * Read the current value, change it and read it back to see if it + * matches, this is needed to detect certain hardware emulators + * (qemu/kvm) that don't trap on the MSR access and always return 0s. + */ + if (rdmsrl_safe(msr, val_old)) + goto msr_fail; + /* + * Only chagne it slightly, + * since the higher bits of some MSRs cannot be updated by wrmsrl. + * E.g. MSR_LBR_TOS + */ + val_tmp = val_old ^ 0x3UL; + if (wrmsrl_safe(msr, val_tmp) || + rdmsrl_safe(msr, val_new)) + goto msr_fail; + + if (val_new != val_tmp) + goto msr_fail; + + /* Here it's sure that the MSR can be safely accessed. + * Restore the old value and return. + */ + wrmsrl(msr, val_old); + + return true; + +msr_fail: + return false; +} I don't think we need the msr_fail thing, there's no state to clean up, you can return false at all places you now have goto. pgpkp9qBX8LOU.pgp Description: PGP signature
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
On Mon, Jul 14, 2014 at 01:55:03PM +0200, Paolo Bonzini wrote: Il 10/07/2014 12:59, kan.li...@intel.com ha scritto: From: Kan Liang kan.li...@intel.com x86, perf: Protect LBR and extra_regs against KVM lying With -cpu host, KVM reports LBR and extra_regs support, if the host has support. When the guest perf driver tries to access LBR or extra_regs MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). I'm not sure this is a useful patch. This is #GP'ing just because of a limitation in the PMU; just compile the kernel with CONFIG_PARAVIRT How's that going to help? If you run kvm -host the VM is lying through its teeth is the kernel is going to assume all those MSRs present, PARAVIRT isn't going to help with this. , or split the rdmsr is always rdmsr_safe behavior out of CONFIG_PARAVIRT and into a new Kconfig symbol. That's not useful either, because non of these code-paths are going to check the return value. In fact there's no reason why LBR cannot be virtualized (though it does need support from the processor), and it may even be possible to support OFFCORE_RSP_X in the KVM virtual PMU. But its not, so something needs to be done, right? pgpmnEwGpgwpV.pgp Description: PGP signature
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
On Mon, Jul 14, 2014 at 02:40:33PM +0200, Paolo Bonzini wrote: Hmmm, I thought rdmsr_safe was going to return zero, but it just returns whatever happened to be in edx:eax which maybe should also be fixed. Kan Liang, what happens if CONFIG_PARAVIRT=y? Do you get garbage or just no events reported? Either way; its not functioning according to 'spec'. Therefore we should 'disable' things. In fact there's no reason why LBR cannot be virtualized (though it does need support from the processor), and it may even be possible to support OFFCORE_RSP_X in the KVM virtual PMU. But its not, so something needs to be done, right? A first thing that could be done, is to have a way for the kernel to detect absence of LBR Which is exactly what this patch does, no? , for example an all-1s setting of the LBR format field of IA32_PERF_CAPABILITIES. If Intel can tell us all 1s will never be used, we can have KVM set the field that way. The kernel then should disable LBR. So what's wrong with testing if the MSRs that provide LBR actually work or not? That doesn't require any magic co-operation of the host/vm machinery and is therefore far more robust. pgpBAg5uoyJMK.pgp Description: PGP signature
Re: [PATCH V5 1/2] perf ignore LBR and extra_regs
so once more; and then I'm going to route your emails to /dev/null, wrap text at 78 chars. On Mon, Jul 14, 2014 at 02:28:36PM +, Liang, Kan wrote: +++ b/arch/x86/kernel/cpu/perf_event.h @@ -464,6 +464,12 @@ struct x86_pmu { */ struct extra_reg *extra_regs; unsigned int er_flags; + /* + * EXTRA REG MSR can be accessed + * The extra registers are completely unrelated to each other. + * So it needs a flag for each extra register. + */ + boolextra_msr_access[EXTRA_REG_MAX]; So why not in struct extra_reg again? You didn't give a straight answer there. I think I did in the email. You mentioned that there's still (only) 4 empty bytes at the tail of extra_reg itself. However, the extra_reg_type may be extended in the near future. So that may not be a reason to move to extra_reg. Well, you can always grow. Also be explicit, 'may be' is an empty statement. Furthermore, if we move extra_msr_access to extra_reg, I guess we have to modify all the related micros (i.e EVENT_EXTRA_REG) to initialize the new items. That could be a big change. Nah, trivial stuff :-) On the other side, in x86_pmu structure, there are extra_regs related items defined under the comments Extra registers for events. And the bit holes are enough for current usage and future extension. So I guess x86_pmu should be a good place to store the availability of the reg. It just doesn't make sense to me to have multiple arrays of the same thing. pgpllkEQRnDii.pgp Description: PGP signature
Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.
/me reminds you of 78 char text wrap. On Wed, Jul 09, 2014 at 07:32:09PM +, Liang, Kan wrote: Sure; but what I meant was, check_msr() is broken when ran on such a kernel. You need to fix check_msr() to return failure on these 'ignored' MSRs, after all they don't function as expected, they're effectively broken. The function is designed to check if the MSRs can be safely accessed (no #GP). It cannot guarantee the correctness of the MSRs. If KVM applied patch 2 and guest applied patch 1, from the guest's perspective, the MSRs can be accessed (no #GP triggered). So return true is expected. It should not be a broken. You're not understanding. I know you wrote that function to do that. I'm saying that's wrong. Look at check_hw_exists() it explicitly checks for fake MSRs and reports them broken. These fake MSRs _ARE_ broken, they do not behave as expected. Not crashing is not the right consideration here, we're interested in higher order correct behaviour. The only unexpected thing for guest is that the counting/sampling result for LBR/extra reg is always 0. But the patch is a short term fix to stop things from crashing. I think it should be acceptable. Patch 2 is fine, patch 1, in particular your check_msr() routine is not. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote: diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 2bdfbff..f0e8022 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event) continue; if (event-attr.config1 ~er-valid_mask) return -EINVAL; + /* Check if the extra msrs can be safely accessed*/ + if (!x86_pmu.extra_msr_access[er-idx]) + continue; reg-idx = er-idx; reg-config = event-attr.config1; You should return an error here; doing continue will eventually make it return 0, which is no-error and things will happily create the event and malfunction. pgpCz6668PC2f.pgp Description: PGP signature
Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote: diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c index 2bdfbff..f0e8022 100644 --- a/arch/x86/kernel/cpu/perf_event.c +++ b/arch/x86/kernel/cpu/perf_event.c @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct perf_event *event) continue; if (event-attr.config1 ~er-valid_mask) return -EINVAL; + /* Check if the extra msrs can be safely accessed*/ + if (!x86_pmu.extra_msr_access[er-idx]) + continue; If you fail here; reg-idx = er-idx; reg-config = event-attr.config1; diff --git a/arch/x86/kernel/cpu/perf_event_intel.c b/arch/x86/kernel/cpu/perf_event_intel.c index adb02aa..2be44be 100644 --- a/arch/x86/kernel/cpu/perf_event_intel.c +++ b/arch/x86/kernel/cpu/perf_event_intel.c @@ -1471,11 +1471,14 @@ static void intel_fixup_er(struct perf_event *event, int idx) { event-hw.extra_reg.idx = idx; - if (idx == EXTRA_REG_RSP_0) { + /* The extra_reg doesn't update if msrs cannot be accessed safely */ + if ((idx == EXTRA_REG_RSP_0) + x86_pmu.extra_msr_access[EXTRA_REG_RSP_0]) { event-hw.config = ~INTEL_ARCH_EVENT_MASK; event-hw.config |= x86_pmu.extra_regs[EXTRA_REG_RSP_0].event; event-hw.extra_reg.reg = MSR_OFFCORE_RSP_0; - } else if (idx == EXTRA_REG_RSP_1) { + } else if ((idx == EXTRA_REG_RSP_1) + x86_pmu.extra_msr_access[EXTRA_REG_RSP_1]) { event-hw.config = ~INTEL_ARCH_EVENT_MASK; event-hw.config |= x86_pmu.extra_regs[EXTRA_REG_RSP_1].event; event-hw.extra_reg.reg = MSR_OFFCORE_RSP_1; You should never get here.. pgpO7wyItAz6_.pgp Description: PGP signature
Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote: + /* + * Access LBR MSR may cause #GP under certain circumstances. + * E.g. KVM doesn't support LBR MSR + * Check all LBT MSR here. + * Disable LBR access if any LBR MSRs can not be accessed. + */ + if (x86_pmu.lbr_nr) { + access = check_msr(x86_pmu.lbr_tos); + for (i = 0; i x86_pmu.lbr_nr; i++) { + access = check_msr(x86_pmu.lbr_from + i); + access = check_msr(x86_pmu.lbr_to + i); + } + if (!access) + x86_pmu.lbr_nr = 0; + } You _could_ write that as: if (x86_pmu.lbr_nr) { if (!check_msr(x86_pmu.lbr_tos)) x86_pmu.lbr_nr = 0; for (i = 0; i x86_pmu.lbr_nr; i++) { if (!(check_msr(x86_pmu.lbr_from + i) check_msr(x86_pmu.lbr_to + i))) x86_pmu.lbr_nr = 0; } } There is no point in checking any more MSRs after the first fail after all. pgpyqfgHDvqcm.pgp Description: PGP signature
Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote: --- a/arch/x86/kernel/cpu/perf_event.h +++ b/arch/x86/kernel/cpu/perf_event.h @@ -464,6 +464,12 @@ struct x86_pmu { */ struct extra_reg *extra_regs; unsigned int er_flags; + /* + * EXTRA REG MSR can be accessed + * The extra registers are completely unrelated to each other. + * So it needs a flag for each extra register. + */ + boolextra_msr_access[EXTRA_REG_MAX]; /* * Intel host/guest support (KVM) # pahole -C extra_reg arch/x86/kernel/cpu/perf_event_intel.o struct extra_reg { unsigned int event;/* 0 4 */ unsigned int msr; /* 4 4 */ u64config_mask; /* 8 8 */ u64valid_mask; /*16 8 */ intidx; /*24 4 */ /* size: 32, cachelines: 1, members: 5 */ /* padding: 4 */ /* last cacheline: 32 bytes */ }; There's still 4 empty bytes at the tail of extra_reg itself; would it make sense to store the availability of the reg in there? After all; the place we use it (x86_pmu_extra_regs) already has the pointer to the structure. pgpXVLCMc15Qs.pgp Description: PGP signature
Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote: +/* + * Under certain circumstances, access certain MSR may cause #GP. + * The function tests if the input MSR can be safely accessed. + */ +static inline bool check_msr(unsigned long msr) +{ + u64 value; + + if (rdmsrl_safe(msr, value) 0) + return false; + if (wrmsrl_safe(msr, value) 0) + return false; + return true; +} What does this thing return after patch 2? does the write still fault or will KVM silently take writes too? pgpfavVlVCcga.pgp Description: PGP signature
Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.
On Wed, Jul 09, 2014 at 02:32:28PM +, Liang, Kan wrote: On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote: +/* + * Under certain circumstances, access certain MSR may cause #GP. + * The function tests if the input MSR can be safely accessed. + */ +static inline bool check_msr(unsigned long msr) { + u64 value; + + if (rdmsrl_safe(msr, value) 0) + return false; + if (wrmsrl_safe(msr, value) 0) + return false; + return true; +} What does this thing return after patch 2? does the write still fault or will KVM silently take writes too? If applying patch 2, the function will return true. The KVM just simply ignore the reads/writes. OK, then that's broken too. We want a function to return false for any malfunctioning MSR, ignoring writes and returning 0s is not proper functioning. pgpw9EoY9DQwh.pgp Description: PGP signature
Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.
On Wed, Jul 09, 2014 at 03:43:45PM +, Liang, Kan wrote: -Original Message- From: Peter Zijlstra [mailto:pet...@infradead.org] Sent: Wednesday, July 09, 2014 10:58 AM To: Liang, Kan Cc: a...@firstfloor.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org Subject: Re: [PATCH V4 1/2] perf ignore LBR and extra_regs. On Wed, Jul 09, 2014 at 02:32:28PM +, Liang, Kan wrote: On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote: +/* + * Under certain circumstances, access certain MSR may cause #GP. + * The function tests if the input MSR can be safely accessed. + */ +static inline bool check_msr(unsigned long msr) { + u64 value; + + if (rdmsrl_safe(msr, value) 0) + return false; + if (wrmsrl_safe(msr, value) 0) + return false; + return true; +} What does this thing return after patch 2? does the write still fault or will KVM silently take writes too? If applying patch 2, the function will return true. The KVM just simply ignore the reads/writes. OK, then that's broken too. We want a function to return false for any malfunctioning MSR, ignoring writes and returning 0s is not proper functioning. The patch 2 is to handle the case that the administrator can only patch the host. Don't need to force user to upgrade their guest to fix the crash. And ignore the annoying unhandled KVM messages Sure; but what I meant was, check_msr() is broken when ran on such a kernel. You need to fix check_msr() to return failure on these 'ignored' MSRs, after all they don't function as expected, they're effectively broken. pgpjTPCCq4qqc.pgp Description: PGP signature