Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Peter Zijlstra
On Thu, May 14, 2015 at 07:31:11PM +0200, Andrea Arcangeli wrote:
> @@ -255,21 +259,23 @@ int handle_userfault(struct vm_area_struct *vma, 
> unsigned long address,
>* through poll/read().
>*/
>   __add_wait_queue(>fault_wqh, );
> - for (;;) {
> - set_current_state(TASK_KILLABLE);
> - if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
> - fatal_signal_pending(current))
> - break;
> - spin_unlock(>fault_wqh.lock);
> + set_current_state(TASK_KILLABLE);
> + spin_unlock(>fault_wqh.lock);
>  
> + if (likely(!ACCESS_ONCE(ctx->released) &&
> +!fatal_signal_pending(current))) {
>   wake_up_poll(>fd_wqh, POLLIN);
>   schedule();
> + ret |= VM_FAULT_MAJOR;
> + }

So what happens here if schedule() spontaneously wakes for no reason?

I'm not sure enough of userfaultfd semantics to say if that would be
bad, but the code looks suspiciously like it relies on schedule() not to
do that; which is wrong.

> + __set_current_state(TASK_RUNNING);
> + /* see finish_wait() comment for why list_empty_careful() */
> + if (!list_empty_careful(_list)) {
>   spin_lock(>fault_wqh.lock);
> + list_del_init(_list);
> + spin_unlock(>fault_wqh.lock);
>   }
> - __remove_wait_queue(>fault_wqh, );
> - __set_current_state(TASK_RUNNING);
> - spin_unlock(>fault_wqh.lock);
>  
>   /*
>* ctx may go away after this if the userfault pseudo fd is
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Peter Zijlstra
On Thu, Oct 22, 2015 at 03:20:15PM +0200, Andrea Arcangeli wrote:

> If schedule spontaneously wakes up a task in TASK_KILLABLE state that
> would be a bug in the scheduler in my view. Luckily there doesn't seem
> to be such a bug, or at least we never experienced it.

Well, there will be a wakeup, just not the one you were hoping for.

We have code that does:

@cond = true;
get_task_struct(p);
queue(p)

/* random wait somewhere */
for (;;) {
prepare_to_wait();
if (@cond)
  break;

...

handle_userfault()
  ...
  schedule();
...

dequeue(p)
wake_up_process(p) ---> wakeup without userfault wakeup


These races are (extremely) rare, but they do exist. Therefore one must
never assume schedule() will not spuriously wake because of these
things.

Also, see:

lkml.kernel.org/r/CA+55aFwHkOo+YGWKYROmce1-H_uG3KfEUmCkJUerTj=ojy2...@mail.gmail.com

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Peter Zijlstra
On Thu, Oct 22, 2015 at 04:18:31PM +0200, Andrea Arcangeli wrote:

> The risk of memory corruption is still zero no matter what happens
> here, in the extremely rare case the app will get a SIGBUS or a

That might still upset people, SIGBUS isn't something an app can really
recover from.

> I'm not exactly sure why we allow VM_FAULT_RETRY only once currently
> so I'm tempted to drop FAULT_FLAG_TRIED entirely.

I think to ensure we make forward progress.

> I've no real preference on how to tweak the page fault code to be able
> to return VM_FAULT_RETRY indefinitely and I would aim for the smallest
> change possible, so if you've suggestions now it's good time.

Indefinitely is such a long time, we should try and finish
computation before the computer dies etc. :-)

Yes, yes.. I know, extremely unlikely etc. Still guarantees are good.


In any case, I'm not really too bothered how you fix it, just figured
I'd let you know.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: fix waitqueue_active without memory barrier in virt/kvm/async_pf.c

2015-10-09 Thread Peter Zijlstra
On Fri, Oct 09, 2015 at 10:45:32AM +0200, Paolo Bonzini wrote:
> So you need another smp_mb() after prepare_to_wait().  I'm not sure
> if it's needed also for your original tty report, but I think it is
> for https://lkml.org/lkml/2015/10/8/989 ("mei: fix waitqueue_active
> without memory barrier in mei drivers").
> 
> I wonder if it makes sense to introduce smp_mb__before_spin_lock()
> and smp_mb__after_spin_unlock().  On x86 the former could be a
> simple compiler barrier, and on s390 both of them could.  But that
> should be a separate patch.

Not having actually read or thought about the issue at hand, its
perfectly valid to pair an smp_mb() with either spin_lock() or
spin_unlock().

IOW. MB <-> {ACQUIRE, RELEASE} is a valid pairing.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] kvm: fix waitqueue_active without memory barrier in virt/kvm/async_pf.c

2015-10-09 Thread Peter Zijlstra
On Fri, Oct 09, 2015 at 12:21:55PM +, Kosuke Tatsukawa wrote:

> +  * Memory barrier is required here to make sure change to
> +  * vcpu->async_pf.done is visible from other CPUs.  This memory
> +  * barrier pairs with prepare_to_wait's set_current_state()

That is not how memory barriers work; they don't 'make visible'. They
simply ensure order between operations.

  X = Y = 0

CPU0CPU1

[S] X=1 [S] Y=1
 MB  MB
[L] y=Y [L] x=X

  assert(x || y)

The issue of the memory barrier does not mean the store is visible, it
merely means that the load _must_ happen after the store (in the above
scenario).

This gives a guarantee that not both x and y can be 0. Because either
being 0, means the other has not yet executed and must therefore observe
your store.

Nothing more, nothing less.

So your comment is misleading at best.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-30 Thread Peter Zijlstra
On Mon, Sep 21, 2015 at 09:36:15AM -0700, Linus Torvalds wrote:
> On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar  wrote:
> >
> > Linus, what's your preference?
> 
> So quite frankly, is there any reason we don't just implement
> native_read_msr() as just
> 
>unsigned long long native_read_msr(unsigned int msr)
>{
>   int err;
>   unsigned long long val;
> 
>   val = native_read_msr_safe(msr, );
>   WARN_ON_ONCE(err);
>   return val;
>}
> 
> Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
> done with it. I don't see the downside.
> 
> How many msr reads are so critical that the function call
> overhead would matter? Get rid of the inline version of the _safe()
> thing too, and put that thing there too.

There are a few in the perf code, and esp. on cores without a stack
engine the call overhead is noticeable. Also note that the perf MSRs are
generally optimized MSRs and less slow (we cannot say fast, they're
still MSRs) than regular MSRs.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] sched: access local runqueue directly in single_task_running

2015-09-18 Thread Peter Zijlstra
On Fri, Sep 18, 2015 at 01:26:53PM +0200, Paolo Bonzini wrote:
> > +++ b/kernel/sched/core.c
> > @@ -2614,13 +2614,13 @@ unsigned long nr_running(void)
> >  
> >  /*
> >   * Check if only the current task is running on the cpu.
> > + *
> > + * Caution result is subject to time-of-check-to-time-of-use race,
> > + * every caller is responsible to set up additional fences if necessary.
> 
> Let's expand it a bit more:
> 
>  * Caution: this function does not check that the caller has disabled
>  * preemption, thus the result might have a time-of-check-to-time-of-use
>  * race.  The caller is responsible to use this correctly, for example:
>  *
>  * - use it from a non-preemptable section
>  *
>  * - use it from a thread that is bound to a single CPU
>  *
>  * - use it in a loop where each iteration takes very little time
>  *   (e.g. a polling loop)
>  */
> 
> I'll include it in my pull request.

In which case:

Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: single_task_running() vs. preemption warnings (was Re: [PATCH] kvm: fix preemption warnings in kvm_vcpu_block)

2015-09-18 Thread Peter Zijlstra
On Thu, Sep 17, 2015 at 01:32:55PM -0700, Tim Chen wrote:
> I have no objection to change single_task_running to use
> raw_smp_processor_id.  The worker in mcryptd is bound to
> the cpu so it has no migration/preemption issue.  So it shouldn't care
> which smp_processor_id version is being used.  Yes, please add a comment
> to alert the user of this caveat should you change single_task_running.


We actually have raw_rq() for that, and the whole if thing looks rather
superfluous. So something like the below, except with a suitable comment
on and tested etc.. ;-)

---
 kernel/sched/core.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 6ab415aa15c4..f39c0498e284 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2666,10 +2666,7 @@ unsigned long nr_running(void)
  */
 bool single_task_running(void)
 {
-   if (cpu_rq(smp_processor_id())->nr_running == 1)
-   return true;
-   else
-   return false;
+   return raw_rq()->nr_running == 1;
 }
 EXPORT_SYMBOL(single_task_running);
 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops

2015-09-17 Thread Peter Zijlstra
On Wed, Sep 16, 2015 at 04:33:11PM -0700, Andy Lutomirski wrote:
> Setting CONFIG_PARAVIRT=y has an unintended side effect: it silently
> turns all rdmsr and wrmsr operations into the safe variants without
> any checks that the operations actually succeed.
> 
> This is IMO awful: it papers over bugs.  In particular, KVM gueests
> might be unwittingly depending on this behavior because
> CONFIG_KVM_GUEST currently depends on CONFIG_PARAVIRT.  I'm not
> aware of any such problems, but applying this series would be a good
> way to shake them out.
> 
> Fix it so that the MSR operations work the same on CONFIG_PARAVIRT=n
> and CONFIG_PARAVIRT=y as long as Xen isn't being used.  The Xen
> maintainers are welcome to make a similar change on top of this.
> 
> Since there's plenty of time before the next merge window, I think
> we should apply and fix anything that breaks.
> 
> Doing this is probably a prerequisite to sanely decoupling
> CONFIG_KVM_GUEST and CONFIG_PARAVIRT, which would probably make
> Arjan and the rest of the Clear Containers people happy :)

So I actually like this, although by Ingo's argument, its a tad risky.

But the far greater problem I have with the whole virt thing is that
you cannot use rdmsr_safe() to probe if an MSR exists at all because, as
you told me, these virt thingies return 0 for all 'unknown' MSRs instead
of faulting.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops

2015-09-17 Thread Peter Zijlstra
On Thu, Sep 17, 2015 at 01:40:30PM +0200, Paolo Bonzini wrote:
> 
> 
> On 17/09/2015 10:58, Peter Zijlstra wrote:
> > But the far greater problem I have with the whole virt thing is that
> > you cannot use rdmsr_safe() to probe if an MSR exists at all because, as
> > you told me, these virt thingies return 0 for all 'unknown' MSRs instead
> > of faulting.
> 
> At least for KVM, that behavior is opt-in (the ignore_msrs parameter)
> and no distro that I know enables it by default.

Ah, that would be good news. Andy earlier argued I could not rely on
rdmsr_safe() faulting on unknown MSRs. If practically we can there's
some code I can simplify :-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] x86/paravirt: Fix baremetal paravirt MSR ops

2015-09-17 Thread Peter Zijlstra
On Thu, Sep 17, 2015 at 08:17:18AM -0700, Andy Lutomirski wrote:

> > Ah, that would be good news. Andy earlier argued I could not rely on
> > rdmsr_safe() faulting on unknown MSRs. If practically we can there's
> > some code I can simplify :-)
> 
> I was taking about QEMU TCG, not KVM.

Just for my education, TCG is the thing without _any_ hardware assist?
The thing you fear to use because it cannot boot a kernel this side of
tomorrow etc.. ?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fixes: 805de8f43c20 (atomic: Replace atomic_{set,clear}_mask() usage)

2015-09-16 Thread Peter Zijlstra
On Wed, Sep 16, 2015 at 09:13:50AM -0400, Jason J. Herne wrote:
> The offending commit accidentally replaces an atomic_clear with an
> atomic_or instead of an atomic_andnot in kvm_s390_vcpu_request_handled.
> The symptom is that kvm guests on s390 hang on startup.
> This patch simply replaces the incorrect atomic_or with atomic_andnot
> 
> Signed-off-by: Jason J. Herne <jjhe...@linux.vnet.ibm.com>

Urgh, sorry about that.

Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.2] commit d59cfc09c32 (sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem) causes regression for libvirt/kvm

2015-09-15 Thread Peter Zijlstra
On Tue, Sep 15, 2015 at 02:05:14PM +0200, Christian Borntraeger wrote:
> Tejun,
> 
> 
> commit d59cfc09c32a2ae31f1c3bc2983a0cd79afb3f14 (sched, cgroup: replace 
> signal_struct->group_rwsem with a global percpu_rwsem) causes some noticably
> hickups when starting several kvm guests (which libvirt will move into cgroups
> - each vcpu thread and each i/o thread)
> When you now start lots of guests in parallel on a bigger system (32CPUs with
> 2way smt in my case) the system is so busy that systemd runs into several 
> timeouts
> like "Did not receive a reply. Possible causes include: the remote 
> application did
> not send a reply, the message bus security policy blocked the reply, the reply
> timeout expired, or the network connection was broken."
> 
> The problem seems to be that the newly used percpu_rwsem does a
> rcu_synchronize_sched_expedited for all write downs/ups.

Can you try:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git 
dev.2015.09.11ab

those include Oleg's rework of the percpu rwsem which should hopefully
improve things somewhat.

But yes, pounding a global lock on a big machine will always suck.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-08-07 Thread Peter Zijlstra
On Fri, Aug 07, 2015 at 12:57:38PM +0200, Peter Zijlstra wrote:

  +void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait)

  this one has no users the __ suggests that it is locked edition. Maybe
  it is for the completions…
 
 Yeah, who knows, I certainly do not anymore ;-)

On that, we cannot convert completions to swait. Because swait wake_all
must not happen from IRQ context, and complete_all() typically is used
from just that.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-08-07 Thread Peter Zijlstra
On Wed, Feb 25, 2015 at 10:02:50PM +0100, Sebastian Andrzej Siewior wrote:

 +static inline int swait_active(struct swait_queue_head *q)
 +{
 +return !list_empty(q-task_list);
 
 In RT there was a smp_mb() which you dropped and I assume you had
 reasons for it.

Yeah, RT didn't have a reason for the smp_mb() -- any barrier without a
comment is a bug :-)

Also waitqueue_active(), its counterpart, does not have a barrier there
either.

Nor did I see any reason for that mb to be there.

 I assumed that one can perform list_empty_careful()
 without a lock if the items were removed with list_del_init(). But since
 nothing in -RT blow up so far I guess this here is legal, too :)

Nobody will come and arrest us for software bugs -- yet ;-)

 +/*
 + * The thing about the wake_up_state() return value; I think we can ignore 
 it.
 + *
 + * If for some reason it would return 0, that means the previously waiting
 + * task is already running, so it will observe condition true (or has 
 already).
 + */
 +void swake_up_locked(struct swait_queue_head *q)
 +{
 +struct swait_queue *curr;
 +
 +list_for_each_entry(curr, q-task_list, task_list) {
 +wake_up_process(curr-task);
 
 okay. So since we limit everything to TASK_NORMAL which has to sleep
 while on the list there is no need to check if we actually woken up
 someone.

Partly that, also that I don't see how that return value is meaningful
in the first place.

If it were to return false, the task was/is already running and it will
observe whatever condition we just satisfied to allow waking things up.

So either way around, we'll get (at least) 1 task running.

 +list_del_init(curr-task_list);
 +break;
 +}
 +}
 +EXPORT_SYMBOL(swake_up_locked);



 +void swake_up(struct swait_queue_head *q)
 +{
 +unsigned long flags;
 +
 +if (!swait_active(q))
 +return;
 +
 +raw_spin_lock_irqsave(q-lock, flags);
 +__swake_up_locked(q);
 
 I thing this should have been swake_up_locked() instead since
 __swake_up_locked() isn't part of this patch.
 
 Just a nitpick: later there is __prepare_to_swait() and __finish_swait()
 which have the __ prefix instead a _locked suffix. Not sure what is
 better for a better for a public API but maybe one way would be good.

Yeah, I suppose that's true ;-)

 +raw_spin_unlock_irqrestore(q-lock, flags);
 +}
 +EXPORT_SYMBOL(swake_up);
 +
 +/*
 + * Does not allow usage from IRQ disabled, since we must be able to
 + * release IRQs to guarantee bounded hold time.
 + */
 +void swake_up_all(struct swait_queue_head *q)
 +{
 +struct swait_queue *curr, *next;
 +LIST_HEAD(tmp);
 
 WARN_ON(irqs_disabled()) ?

Lockdep should already catch that by virtue of using unconditional _irq
spinlock primitives.

 +if (!swait_active(q))
 +return;
 +
 +raw_spin_lock_irq(q-lock);
 +list_splice_init(q-task_list, tmp);
 +while (!list_empty(tmp)) {
 +curr = list_first_entry(tmp, typeof(curr), task_list);
 +
 +wake_up_state(curr-task, state);
 +list_del_init(curr-task_list);
 
 So because the task may timeout and remove itself from the list at
 anytime you need to hold the lock during wakeup and the removal from the
 list

Indeed.

 +
 +if (list_empty(tmp))
 +break;
 +
 +raw_spin_unlock_irq(q-lock);
 
 and you drop the lock after each iteration in case there is an IRQ 
 pending or the task, that has been just woken up, has a higher priority
 than the current task and needs to get on the CPU.

 Not sure if this case matters:
 - _this_ task (wake_all) prio 120
 - first task in queue prio 10, RR
 - second task in queue prio 9, RR

Why complicate things? Better to not assume anything and just do the
simple correct thing.

 the *old* behavior would put the second task before the first task on
 CPU. The *new* behaviour puts the first task on the CPU after dropping
 the lock. The second task (that has a higher priority but nobody knows)
 has to wait until the first one is done (and anything else that might
 been woken up in the meantime with a higher prio than 120).

Irrelevant, we _must_ drop the lock in order to maintain bounded
behaviour.

 +raw_spin_lock_irq(q-lock);
 +}
 +raw_spin_unlock_irq(q-lock);
 +}
 +EXPORT_SYMBOL(swake_up_all);

 +void __finish_swait(struct swait_queue_head *q, struct swait_queue *wait)
 this one has no users the __ suggests that it is locked edition. Maybe
 it is for the completions…

Yeah, who knows, I certainly do not anymore ;-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-08-07 Thread Peter Zijlstra
On Fri, Aug 07, 2015 at 09:41:31AM -0700, Christoph Hellwig wrote:
 On Fri, Aug 07, 2015 at 01:14:15PM +0200, Peter Zijlstra wrote:
  On that, we cannot convert completions to swait. Because swait wake_all
  must not happen from IRQ context, and complete_all() typically is used
  from just that.
 
 If swait queues aren't useable from IRQ context they will be fairly
 useless.  What's the problem with making them irq safe?

Its just the swait_wake_all() that is not. The entire purpose of them
was to have something that allows bounded execution (RT and all).

Since you can have unbounded numbers of tasks waiting on a waitqueue
(well, reality has bounds of course, like total memory available etc..)
a wake_all() can end up being many many wake_process() calls.

We've had this be a problem in RT.

So the proposed swait_wake_all() requires being called from task
context, such that it can drop the lock (and IRQ disable) after every
wakeup, and thereby guarantee that higher priority things will not
experience undue latencies.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Should KVM_GUEST stop depending on PARAVIRT?

2015-07-29 Thread Peter Zijlstra
On Wed, Jul 29, 2015 at 11:24:09AM +0200, Paolo Bonzini wrote:
 
 
 On 29/07/2015 11:19, Peter Zijlstra wrote:
  On Tue, Jul 28, 2015 at 05:23:22PM -0700, Andy Lutomirski wrote:
  PeterZ, can we fix this for real instead of relying on
  CONFIG_PARAVIRT=y accidentally turning all msr accesses into safe
  accesses?  We have the CPUID hypervisor bit, but I still don't fully
  understand the problem.
  
  So a whole bunch of PMU drivers already probe the MSRs at init time
  using a combination of rdmsrl_safe() and wrmsrl_safe() to see if:
  
   1) its there at all
   2) if its there, it 'works' like it ought to
  
  See for example: arch/x86/kernel/cpu/perf_event.c:check_hw_exists().
 
 Great.  Of course it's even better if you can probe it with CPUID (e.g.
 aperfmperf in intel_pstate.c), then there's no need for the _safe variant.

Reliable CPUID enumeration would be nice indeed. Note that CPUID isn't
ideal either, for instance the CPUID HT bit doesn't indicate if the CPU
has hyperthreading enabled, just that it is able to report extended
topology information.

Determining if a CPU has hyperthreading enabled is extremely difficult,
and basically boils down to init all cpus, see if there are siblings
anywhere.

And then try and distinguish between the case where HT is disabled and
someone hot-unplugged all siblings :-)

  As to the 'bug' at hand, I really don't see the problem. The RAPL driver
  says there's no valid RAPL domains, this is true, there aren't any on
  the virtual machine, so what?
 
 Well, people have complained about it because it's KERN_ERR.  Do you
 think it is okay to downgrade this (perhaps not even just on VMs) to info?

Ah, do people really have nothing better to do? ;-) Seems like a petty
complaint.

Sure, it seems our check_hw_exists() does the same thing:

printk(%sFailed to access perfctr msr (MSR %x is %Lx)\n,
boot_cpu_has(X86_FEATURE_HYPERVISOR) ? KERN_INFO : KERN_ERR,
reg, val_new);

so ERR for real hardware, INFO for hypervisor thingies.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Should KVM_GUEST stop depending on PARAVIRT?

2015-07-29 Thread Peter Zijlstra
On Tue, Jul 28, 2015 at 05:23:22PM -0700, Andy Lutomirski wrote:
 
 PeterZ, can we fix this for real instead of relying on
 CONFIG_PARAVIRT=y accidentally turning all msr accesses into safe
 accesses?  We have the CPUID hypervisor bit, but I still don't fully
 understand the problem.

So a whole bunch of PMU drivers already probe the MSRs at init time
using a combination of rdmsrl_safe() and wrmsrl_safe() to see if:

 1) its there at all
 2) if its there, it 'works' like it ought to

See for example: arch/x86/kernel/cpu/perf_event.c:check_hw_exists().

I have no problem using the _safe() methods to probe MSRs in init
routines and failing the init of the driver etc.

What I do object to is sprinkling _safe() all over the driver itself and
creating horrid error paths all over (yes, people have proposed crazy
things like that).

Another example is the LBR probing in
arch/x86/kernel/cpu/perf_event_intel.c:intel_pmu_init(). This is the one
people wanted to use _safe() crud for all throughout the driver, instead
of simply disabling the LBR at init time.


As to the 'bug' at hand, I really don't see the problem. The RAPL driver
says there's no valid RAPL domains, this is true, there aren't any on
the virtual machine, so what?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] sched/preempt: fix cond_resched_lock() and cond_resched_softirq()

2015-07-15 Thread Peter Zijlstra
On Wed, Jul 15, 2015 at 03:52:34PM +0300, Konstantin Khlebnikov wrote:
 On 15.07.2015 15:16, Eric Dumazet wrote:
 On Wed, 2015-07-15 at 12:52 +0300, Konstantin Khlebnikov wrote:
 These functions check should_resched() before unlocking spinlock/bh-enable:
 preempt_count always non-zero = should_resched() always returns false.
 cond_resched_lock() worked iff spin_needbreak is set.
 
 Interesting, this definitely used to work (linux-3.11)
 
 Any idea of which commit broke things ?
 
 
 Searching... done
 
 This one: bdb43806589096ac4272fe1307e789846ac08d7c in v3.13
 
 before
 
 -static inline int should_resched(void)
 -{
 -   return need_resched()  !(preempt_count()  PREEMPT_ACTIVE);
 -}
 
 after
 
 +static __always_inline bool should_resched(void)
 +{
 +   return unlikely(!*preempt_count_ptr());
 +}

Argh, indeed!

 
 So,
 
 Fixes: bdb438065890 (sched: Extract the basic add/sub preempt_count
 modifiers)

Thanks!
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] signals: Generate warning when flush_signals() is called from non-kthread context

2015-05-07 Thread Peter Zijlstra
On Sat, May 02, 2015 at 09:27:49AM -0700, Linus Torvalds wrote:
 On Sat, May 2, 2015 at 1:30 AM, NeilBrown ne...@suse.de wrote:
 
  All the calls in md.c are in a kernel thread so safe, but I'd rather have an
  explicit uninterruptible, but no load-average wait
 
 Hmm. Our task state is a bitmask anyway, so we could probably just add a
 
 #define __TASK_NOLOAD  16
 
 (and move the EXIT_xyz defines *away* from the list that is actually
 the task state), and teach our load average thing to not count those
 kinds of waits. Then you could just use
 
 TASK_UNINTERRUPTIBLE | __TASK_NOLOAD
 
 to make processes not count towards the load.
 
 Or - probably preferably - we could really clean things up, and make
 things much more like the bitmask it *should* be, and have explicit
 bits for
 
  - SLEEPING/STOPPED/EXITING (why not running?)
  - LOADAVG (accounted towards load)
  - WAKESIG (ie interruptible)
  - WAKEKILL (this we already have)
 
 and just make the rule be that we use __TASK_xyz for the actual
 individual bits, and TASK_xyz for the helper combinations. So then
 we'd have
 
#define TASK_UNINTERRUPTIBLE (__TASK_SLEEPING | __TASK_LOADAVG)
#define TASK_INTERRUPTIBLE (__TASK_SLEEPING | __TASK_WAKESIG)
#define TASK_KILLABLE (__TASK_SLEEPING | __TASK_WAKEKILL)
#define TASK_NOLOADAVG (__TASK_SLEEPING)
 
 which is almost certainly how this *should* have been done, but isn't,
 because of historical use.
 
 Cleaning up like that *should* be fairly simple, but I'd be a bit
 nervous about getting all the state comparisons right (we have an
 unholy mix of check this bit and check this whole state, and we'd
 need to make sure we get those cases all right).
 
 Ingo, what do you think? This is mostly a scheduler interface issue..

Hehe, a little something like this:

  https://lkml.org/lkml/2013/11/12/710

Lemme go clean that up and finish it :-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Non-exiting rdpmc on KVM guests?

2015-04-22 Thread Peter Zijlstra
On Tue, Apr 21, 2015 at 02:10:53PM -0700, Andy Lutomirski wrote:

 One question is whether we care if we leak unrelated counters to the
 guest.  (We already leak them to unrelated user tasks, so this is
 hopefully not a big deal.  OTOH, the API is different for guests.)

Good question indeed. I really do not know.

 Another question is whether it's even worth trying to optimize this.

I think I just ran into a bunch of people who think virt pmu stuff is
important, but we'll have to see if they follow up with the effort of
actually doing the work.

My only concern is that they'll not make a mess of things ;-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Non-exiting rdpmc on KVM guests?

2015-04-21 Thread Peter Zijlstra
On Mon, Apr 20, 2015 at 05:47:23PM -0700, Andy Lutomirski wrote:
 I just wrote a little perf self-monitoring tool that uses rdpmc to
 count cycles.  Performance sucks under KVM (VMX).
 
 How hard would it be to avoid rdpmc exits in cases where the host and
 guest pmu configurations are compatible as seen by rdpmc?  I'm mostly
 ignorant of how the PMU counter offsets and such work.

I'm not sure how all the VMX stuff works. how is rdpmc seem? It it at
all possible to not trap that?

But as it stands there is not even a hint from the emulated pmu events
on which counters they 'ought' to run so they landing the same way the
guest thinks they ought to is rather small.

'someone' would have to go look at the event scheduling and see how to
implement it, my first suggestion would be to set hwc-idx to the
desired number and hope x86_schedule_events() takes the fast path.

Of course, the moment there's host events along with the guest events
the whole thing comes tumbling down, but there's nothing much one can do
about that.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Non-exiting rdpmc on KVM guests?

2015-04-21 Thread Peter Zijlstra
On Tue, Apr 21, 2015 at 06:32:54PM +0200, Paolo Bonzini wrote:

 However, if you take into account that RDPMC can also be used
 to read an inactive counter, and that multiple guests fight for the
 same host counters, it's even harder to ensure that the guest counter
 indices match those on the host.

That doesn't make sense, only a single vcpu task will ever run at any
one time.

There cannot be inter guest counter contention.  Only the host can
compete for the same resources.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 10:52:38AM +0200, Peter Zijlstra wrote:
 On Fri, Apr 10, 2015 at 05:01:29PM +0200, Paolo Bonzini wrote:
   include/linux/sched.h  |8 +
   kernel/sched/core.c|   15 +
 
 Can you please not puke over the scheduler without Acks from at least
 one maintainer?
 
 I complained about this very thing two years ago:
 
   http://marc.info/?l=linux-kernelm=137345253916751
 
 And now it magically re-appears WTF!

And I really don't understand _why_ you need that extra callback in the
first place. You already have preempt notifiers, just track if you came
in on another cpu than you went out on and voila!
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 10, 2015 at 05:01:29PM +0200, Paolo Bonzini wrote:
  include/linux/sched.h  |8 +
  kernel/sched/core.c|   15 +

Can you please not puke over the scheduler without Acks from at least
one maintainer?

I complained about this very thing two years ago:

  http://marc.info/?l=linux-kernelm=137345253916751

And now it magically re-appears WTF!

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 12:38:07PM +0200, Paolo Bonzini wrote:
 
 
 On 17/04/2015 12:36, Peter Zijlstra wrote:
  Now you make everybody pay for your crap, x86-64 paravirt or not. Keep
  the cost by those who need it.
  
  Please take it out, ASAP.
 
 I'll just implement the static key.

Can you first show that:

preempt_out:
int cpu = smp_processor_id();
if (vcpu-cpu != cpu)
vcpu-cpu = cpu;

preempt_in:
int cpu = smp_processor_id();
if (unlikely(vcpu-cpu != cpu))
do_vcpu_migration_callback(cpu);

Is actually a measurable performance hit and we actually _need_ the
migration callback?

Also, it looks like you already do exactly this for other things, look
at:

kvm_sched_in()
  kvm_arch_vcpu_load()
if (unlikely(vcpu-cpu != cpu) ... )

So no, I don't believe for one second you need this.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 02:46:57PM +0200, Paolo Bonzini wrote:
 On 17/04/2015 12:55, Peter Zijlstra wrote:
  Also, it looks like you already do exactly this for other things, look
  at:
  
  kvm_sched_in()
kvm_arch_vcpu_load()
  if (unlikely(vcpu-cpu != cpu) ... )
  
  So no, I don't believe for one second you need this.
 
 You're missing that this snippet is running in the host, while this 
 patch is concerned with the guest (paravirt).

This doesn't make sense; but it brings us back to where we were last
time. There is _0_ justification for this in the patches, that alone is
grounds enough to reject it.

Why should the guest task care about the physical cpu of the vcpu;
that's a layering fail if ever there was one.

Furthermore, the only thing that migration handler seems to do is
increment a variable that is not actually used in that file.

 And frankly, I think the static key is snake oil.  The cost of task 
 migration in terms of cache misses and TLB misses is in no way 
 comparable to the cost of filling in a structure on the stack, 
 dereferencing the head of the notifiers list and seeing that it's NULL.

The path this notifier is called from has nothing to do with those
costs. And the fact you're inflicting these costs on _everyone_ for a
single x86_64-paravirt case is insane.

I've had enough of this, the below goes into sched/urgent and you can
come back with sane patches if and when you're ready.

---
 arch/x86/kernel/pvclock.c | 24 
 include/linux/sched.h |  8 
 kernel/sched/core.c   | 15 ---
 3 files changed, 47 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index e5ecd20e72dd..82f116de3835 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -160,27 +160,6 @@ struct pvclock_vcpu_time_info 
*pvclock_get_vsyscall_time_info(int cpu)
 }
 
 #ifdef CONFIG_X86_64
-static int pvclock_task_migrate(struct notifier_block *nb, unsigned long l,
-   void *v)
-{
-   struct task_migration_notifier *mn = v;
-   struct pvclock_vsyscall_time_info *pvti;
-
-   pvti = pvclock_get_vsyscall_user_time_info(mn-from_cpu);
-
-   /* this is NULL when pvclock vsyscall is not initialized */
-   if (unlikely(pvti == NULL))
-   return NOTIFY_DONE;
-
-   pvti-migrate_count++;
-
-   return NOTIFY_DONE;
-}
-
-static struct notifier_block pvclock_migrate = {
-   .notifier_call = pvclock_task_migrate,
-};
-
 /*
  * Initialize the generic pvclock vsyscall state.  This will allocate
  * a/some page(s) for the per-vcpu pvclock information, set up a
@@ -202,9 +181,6 @@ int __init pvclock_init_vsyscall(struct 
pvclock_vsyscall_time_info *i,
 PAGE_KERNEL_VVAR);
}
 
-
-   register_task_migration_notifier(pvclock_migrate);
-
return 0;
 }
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3f3308824fa4..0eabab99e126 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -176,14 +176,6 @@ extern void get_iowait_load(unsigned long *nr_waiters, 
unsigned long *load);
 extern void calc_global_load(unsigned long ticks);
 extern void update_cpu_load_nohz(void);
 
-/* Notifier for when a task gets migrated to a new CPU */
-struct task_migration_notifier {
-   struct task_struct *task;
-   int from_cpu;
-   int to_cpu;
-};
-extern void register_task_migration_notifier(struct notifier_block *n);
-
 extern unsigned long get_parent_ip(unsigned long addr);
 
 extern void dump_cpu_task(int cpu);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1d0bc4fe266d..dbfc93d40292 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1013,13 +1013,6 @@ void check_preempt_curr(struct rq *rq, struct 
task_struct *p, int flags)
rq_clock_skip_update(rq, true);
 }
 
-static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
-
-void register_task_migration_notifier(struct notifier_block *n)
-{
-   atomic_notifier_chain_register(task_migration_notifier, n);
-}
-
 #ifdef CONFIG_SMP
 void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 {
@@ -1050,18 +1043,10 @@ void set_task_cpu(struct task_struct *p, unsigned int 
new_cpu)
trace_sched_migrate_task(p, new_cpu);
 
if (task_cpu(p) != new_cpu) {
-   struct task_migration_notifier tmn;
-
if (p-sched_class-migrate_task_rq)
p-sched_class-migrate_task_rq(p, new_cpu);
p-se.nr_migrations++;
perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
-
-   tmn.task = p;
-   tmn.from_cpu = task_cpu(p);
-   tmn.to_cpu = new_cpu;
-
-   atomic_notifier_call_chain(task_migration_notifier, 0, tmn);
}
 
__set_task_cpu(p, new_cpu);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord

Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 03:38:58PM +0200, Paolo Bonzini wrote:
  The path this notifier is called from has nothing to do with those
  costs.
 
 How not?  The task is going to incur those costs, it's not like half
 a dozen extra instruction make any difference.  But anyway...

Its attributed to the entity doing the migration, which can be the
wakeup path or a softirq. And we very much do care about the wakeup
path.

 ... that's a valid objection.  Please look at the patch below.

Still a NAK on that, distros have no choice but to enable that CONFIG
option because people might want to run KVM.

CONFIG options are pointless if they end up being mandatory.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] First batch of KVM changes for 4.1

2015-04-17 Thread Peter Zijlstra
On Fri, Apr 17, 2015 at 12:09:49PM +0200, Paolo Bonzini wrote:
 
 
 On 17/04/2015 11:17, Peter Zijlstra wrote:
  On Fri, Apr 17, 2015 at 10:52:38AM +0200, Peter Zijlstra wrote:
  On Fri, Apr 10, 2015 at 05:01:29PM +0200, Paolo Bonzini wrote:
   include/linux/sched.h  |8 +
   kernel/sched/core.c|   15 +
 
  Can you please not puke over the scheduler without Acks from at least
  one maintainer?
 
 Sorry, this was done while I was not handling the KVM tree.  At the very
 least the commit message should have included the original hashes of the
 commit and the revert.  This way one could have found the original Acks:
 
 commit 582b336ec2c0f0076f5650a029fcc9abd4a906f7
 Author: Marcelo Tosatti mtosa...@redhat.com
 Date:   Tue Nov 27 23:28:54 2012 -0200
 
 sched: add notifier for cross-cpu migrations
 
 Originally from Jeremy Fitzhardinge.
 
 Acked-by: Ingo Molnar mi...@redhat.com
 Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Still not a good reason to sneak it back it now, after I got it taken
out. There was a reason it was removed, prior acks (esp. 2 year old
ones) do not count one whit _NOW_.

Also, Ingo later agreed that is was a mistake,

  http://marc.info/?l=linux-kernelm=137346715521978w=2

which is an effective retract of whatever ACK that was.

It was crap code then and its crap code now.

  I complained about this very thing two years ago:
 
http://marc.info/?l=linux-kernelm=137345253916751
 
  And now it magically re-appears WTF!
  
  And I really don't understand _why_ you need that extra callback in the
  first place. You already have preempt notifiers, just track if you came
  in on another cpu than you went out on and voila!
 
 Then you pay for _all_ preemptions of _all_ processes in the guest,
 instead of the hopefully rare ones that do a CPU migration.

Now you make everybody pay for your crap, x86-64 paravirt or not. Keep
the cost by those who need it.

Please take it out, ASAP.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-13 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:
 +void __init __pv_init_lock_hash(void)
 +{
 +   int pv_hash_size = 4 * num_possible_cpus();
 +
 +   if (pv_hash_size  (1U  LFSR_MIN_BITS))
 +   pv_hash_size = (1U  LFSR_MIN_BITS);
 +   /*
 +* Allocate space from bootmem which should be page-size aligned
 +* and hence cacheline aligned.
 +*/
 +   pv_lock_hash = alloc_large_system_hash(PV qspinlock,
 +  sizeof(struct pv_hash_bucket),
 +  pv_hash_size, 0, HASH_EARLY,
 +   pv_lock_hash_bits, NULL,
 +  pv_hash_size, pv_hash_size);
  pv_taps = lfsr_taps(pv_lock_hash_bits);
 
 
 I don't understand what you meant here.

Let me explain (even though I propose taking all the LFSR stuff out).

pv_lock_hash_bit is a runtime variable, therefore it cannot compile time
evaluate the forest of if statements required to compute the taps value.

Therefore its best to compute the taps _once_, and this seems like a
good place to do so.

 +   goto done;
 +   }
 +   }
 +
 +   hash = lfsr(hash, pv_lock_hash_bits, 0);
 Since pv_lock_hash_bits is a variable, you end up running through that
 massive if() forest to find the corresponding tap every single time. It
 cannot compile-time optimize it.
 
 The minimum bits size is now 8. So unless the system has more than 64 vCPUs,
 it will get the right value in the first if statement.

Still, no reason to not pre-compute the taps value, its simple enough.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-13 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:
 +__visible void __pv_queue_spin_unlock(struct qspinlock *lock)
 +{
 +   struct __qspinlock *l = (void *)lock;
 +   struct pv_node *node;
 +
 +   if (likely(cmpxchg(l-locked, _Q_LOCKED_VAL, 0) == _Q_LOCKED_VAL))
 +   return;
 +
 +   /*
 +* The queue head has been halted. Need to locate it and wake it up.
 +*/
 +   node = pv_hash_find(lock);
 +   smp_store_release(l-locked, 0);
 Ah yes, clever that.
 
 +   /*
 +* At this point the memory pointed at by lock can be freed/reused,
 +* however we can still use the PV node to kick the CPU.
 +*/
 +   if (READ_ONCE(node-state) == vcpu_halted)
 +   pv_kick(node-cpu);
 +}
 +PV_CALLEE_SAVE_REGS_THUNK(__pv_queue_spin_unlock);
 However I feel the PV_CALLEE_SAVE_REGS_THUNK thing belongs in the x86
 code.
 
 That is why I originally put my version of the qspinlock_paravirt.h header
 file under arch/x86/include/asm. Maybe we should move it back there. Putting
 the thunk in arch/x86/kernel/kvm.c didn't work when you consider that the
 Xen code also need that.

Well the function is 'generic' and belong here I think. Its just the
PV_CALLEE_SAVE_REGS_THUNK thing that arch specific. Should have live in
arch/x86/kernel/paravirt-spinlocks.c instead?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-13 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 05:41:44PM -0400, Waiman Long wrote:

 +static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
 +{
 +   struct __qspinlock *l = (void *)lock;
 +   struct qspinlock **lp = NULL;
 +   struct pv_node *pn = (struct pv_node *)node;
 +   int slow_set = false;
 +   int loop;
 +
 +   for (;;) {
 +   for (loop = SPIN_THRESHOLD; loop; loop--) {
 +   if (!READ_ONCE(l-locked))
 +   return;
 +
 +   cpu_relax();
 +   }
 +
 +   WRITE_ONCE(pn-state, vcpu_halted);
 +   if (!lp)
 +   lp = pv_hash(lock, pn);
 +   /*
 +* lp must be set before setting _Q_SLOW_VAL
 +*
 +* [S] lp = lock[RmW] l = l-locked = 0
 +* MB MB
 +* [S] l-locked = _Q_SLOW_VAL  [L]   lp
 +*
 +* Matches the cmpxchg() in pv_queue_spin_unlock().
 +*/
 +   if (!slow_set
 +   !cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
 +   /*
 +* The lock is free and _Q_SLOW_VAL has never been
 +* set. Need to clear the hash bucket before getting
 +* the lock.
 +*/
 +   WRITE_ONCE(*lp, NULL);
 +   return;
 +   } else if (slow_set  !READ_ONCE(l-locked))
 +   return;
 +   slow_set = true;

 I'm somewhat puzzled by the slow_set thing; what is wrong with the thing
 I had, namely:
 
  if (!lp) {
  lp = pv_hash(lock, pn);
 
  /*
   * comment
   */
  lv = cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
  if (lv != _Q_LOCKED_VAL) {
  /* we're woken, unhash and return */
  WRITE_ONCE(*lp, NULL);
  return;
  }
  }
 +
 +   pv_wait(l-locked, _Q_SLOW_VAL);
 
 If we get a spurious wakeup (due to device interrupts or random kick)
 we'll loop around but -locked will remain _Q_SLOW_VAL.
 
 The purpose of the slow_set flag is not about the lock value. It is to make
 sure that pv_hash_find() will always find a match. Consider the following
 scenario:
 
 cpu1cpu2cpu3
 
 pv_wait
 spurious wakeup
 loop l-locked
 
 read _Q_SLOW_VAL
 pv_hash_find()
 unlock
 
 pv_hash() - same entry
 
 cmpxchg(l-locked)
 clear hash (?)
 
 Here, the entry for cpu3 will be removed leading to panic when
 pv_hash_find() can find the entry. So the hash entry can only be
 removed if the other cpu has no chance to see _Q_SLOW_VAL.

Still confused. Afaict that cannot happen with my code. We only do the
cmpxchg(, _Q_SLOW_VAL) _once_.

Only on the first time around that loop will we hash the lock and set
the slow flag. And cpu3 cannot come in on the same entry because we've
not yet released the lock when we find and unhash.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 16/16] unfair qspinlock: a queue based unfair lock

2015-04-09 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 09:16:24AM -0400, Rik van Riel wrote:
 On 04/09/2015 03:01 AM, Peter Zijlstra wrote:
  On Wed, Apr 08, 2015 at 02:32:19PM -0400, Waiman Long wrote:
  For a virtual guest with the qspinlock patch, a simple unfair byte lock
  will be used if PV spinlock is not configured in or the hypervisor
  isn't either KVM or Xen. The byte lock works fine with small guest
  of just a few vCPUs. On a much larger guest, however, byte lock can
  have serious performance problem.
  
  Who cares?
 
 There are some people out there running guests with dozens
 of vCPUs. If the code exists to make those setups run better,
 is there a good reason not to use it?

Well use paravirt, !paravirt stuff sucks performance wise anyhow.

The question really is: is the added complexity worth the maintenance
burden. And I'm just not convinced !paravirt virt is a performance
critical target.

 Having said that, only KVM and Xen seem to support very
 large guests, and PV spinlock is available there.
 
 I believe both VMware and Hyperv have a 32 VCPU limit, anyway.

Don't we have Hyperv paravirt drivers? They could add support for
paravirt spinlocks too.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-09 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 08:13:27PM +0200, Peter Zijlstra wrote:
 On Mon, Apr 06, 2015 at 10:55:44PM -0400, Waiman Long wrote:
  +#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct 
  pv_hash_bucket))

  +static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node 
  *node)
  +{
  +   unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
  +   struct pv_hash_bucket *hb, *end;
  +
  +   if (!hash)
  +   hash = 1;
  +
  +   init_hash = hash;
  +   hb = pv_lock_hash[hash_align(hash)];
  +   for (;;) {
  +   for (end = hb + PV_HB_PER_LINE; hb  end; hb++) {
  +   if (!cmpxchg(hb-lock, NULL, lock)) {
  +   WRITE_ONCE(hb-node, node);
  +   /*
  +* We haven't set the _Q_SLOW_VAL yet. So
  +* the order of writing doesn't matter.
  +*/
  +   smp_wmb(); /* matches rmb from pv_hash_find */
  +   goto done;
  +   }
  +   }
  +
  +   hash = lfsr(hash, pv_lock_hash_bits, 0);
 
 Since pv_lock_hash_bits is a variable, you end up running through that
 massive if() forest to find the corresponding tap every single time. It
 cannot compile-time optimize it.
 
 Hence:
   hash = lfsr(hash, pv_taps);
 
 (I don't get the bits argument to the lfsr).
 
 In any case, like I said before, I think we should try a linear probe
 sequence first, the lfsr was over engineering from my side.
 
  +   hb = pv_lock_hash[hash_align(hash)];

So one thing this does -- and one of the reasons I figured I should
ditch the LFSR instead of fixing it -- is that you end up scanning each
bucket HB_PER_LINE times.

The 'fix' would be to LFSR on cachelines instead of HBs but then you're
stuck with the 0-th cacheline.

  +   BUG_ON(hash == init_hash);
  +   }
  +
  +done:
  +   return hb-lock;
  +}
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 09/15] pvqspinlock: Implement simple paravirt support for the qspinlock

2015-04-09 Thread Peter Zijlstra
On Mon, Apr 06, 2015 at 10:55:44PM -0400, Waiman Long wrote:
 +++ b/kernel/locking/qspinlock_paravirt.h
 @@ -0,0 +1,321 @@
 +#ifndef _GEN_PV_LOCK_SLOWPATH
 +#error do not include this file
 +#endif
 +
 +/*
 + * Implement paravirt qspinlocks; the general idea is to halt the vcpus 
 instead
 + * of spinning them.
 + *
 + * This relies on the architecture to provide two paravirt hypercalls:
 + *
 + *   pv_wait(u8 *ptr, u8 val) -- suspends the vcpu if *ptr == val
 + *   pv_kick(cpu) -- wakes a suspended vcpu
 + *
 + * Using these we implement __pv_queue_spin_lock_slowpath() and
 + * __pv_queue_spin_unlock() to replace native_queue_spin_lock_slowpath() and
 + * native_queue_spin_unlock().
 + */
 +
 +#define _Q_SLOW_VAL  (3U  _Q_LOCKED_OFFSET)
 +
 +enum vcpu_state {
 + vcpu_running = 0,
 + vcpu_halted,
 +};
 +
 +struct pv_node {
 + struct mcs_spinlock mcs;
 + struct mcs_spinlock __res[3];
 +
 + int cpu;
 + u8  state;
 +};
 +
 +/*
 + * Hash table using open addressing with an LFSR probe sequence.
 + *
 + * Since we should not be holding locks from NMI context (very rare indeed) 
 the
 + * max load factor is 0.75, which is around the point where open addressing
 + * breaks down.
 + *
 + * Instead of probing just the immediate bucket we probe all buckets in the
 + * same cacheline.
 + *
 + * http://en.wikipedia.org/wiki/Hash_table#Open_addressing
 + *
 + * Dynamically allocate a hash table big enough to hold at least 4X the
 + * number of possible cpus in the system. Allocation is done on page
 + * granularity. So the minimum number of hash buckets should be at least
 + * 256 to fully utilize a 4k page.
 + */
 +#define LFSR_MIN_BITS8
 +#define  LFSR_MAX_BITS   (2 + NR_CPUS_BITS)
 +#if LFSR_MAX_BITS  LFSR_MIN_BITS
 +#undef  LFSR_MAX_BITS
 +#define LFSR_MAX_BITSLFSR_MIN_BITS
 +#endif
 +
 +struct pv_hash_bucket {
 + struct qspinlock *lock;
 + struct pv_node   *node;
 +};
 +#define PV_HB_PER_LINE   (SMP_CACHE_BYTES / sizeof(struct 
 pv_hash_bucket))
 +#define HB_RESERVED  ((struct qspinlock *)1)

This is unused.

 +
 +static struct pv_hash_bucket *pv_lock_hash;
 +static unsigned int pv_lock_hash_bits __read_mostly;

static unsigned int pv_taps __read_mostly;

 +
 +#include linux/hash.h
 +#include linux/lfsr.h
 +#include linux/bootmem.h
 +
 +/*
 + * Allocate memory for the PV qspinlock hash buckets
 + *
 + * This function should be called from the paravirt spinlock initialization
 + * routine.
 + */
 +void __init __pv_init_lock_hash(void)
 +{
 + int pv_hash_size = 4 * num_possible_cpus();
 +
 + if (pv_hash_size  (1U  LFSR_MIN_BITS))
 + pv_hash_size = (1U  LFSR_MIN_BITS);
 + /*
 +  * Allocate space from bootmem which should be page-size aligned
 +  * and hence cacheline aligned.
 +  */
 + pv_lock_hash = alloc_large_system_hash(PV qspinlock,
 +sizeof(struct pv_hash_bucket),
 +pv_hash_size, 0, HASH_EARLY,
 +pv_lock_hash_bits, NULL,
 +pv_hash_size, pv_hash_size);

pv_taps = lfsr_taps(pv_lock_hash_bits);

 +}
 +
 +static inline u32 hash_align(u32 hash)
 +{
 + return hash  ~(PV_HB_PER_LINE - 1);
 +}
 +
 +static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node 
 *node)
 +{
 + unsigned long init_hash, hash = hash_ptr(lock, pv_lock_hash_bits);
 + struct pv_hash_bucket *hb, *end;
 +
 + if (!hash)
 + hash = 1;
 +
 + init_hash = hash;
 + hb = pv_lock_hash[hash_align(hash)];
 + for (;;) {
 + for (end = hb + PV_HB_PER_LINE; hb  end; hb++) {
 + if (!cmpxchg(hb-lock, NULL, lock)) {
 + WRITE_ONCE(hb-node, node);
 + /*
 +  * We haven't set the _Q_SLOW_VAL yet. So
 +  * the order of writing doesn't matter.
 +  */
 + smp_wmb(); /* matches rmb from pv_hash_find */

This doesn't make sense. Both sites do -lock first and -node second.
No amount of ordering can 'fix' that.

I think we can safely remove this wmb and the rmb below, because the
required ordering is already provided by setting/observing l-locked ==
SLOW.

 + goto done;
 + }
 + }
 +
 + hash = lfsr(hash, pv_lock_hash_bits, 0);

Since pv_lock_hash_bits is a variable, you end up running through that
massive if() forest to find the corresponding tap every single time. It
cannot compile-time optimize it.

Hence:
hash = lfsr(hash, pv_taps);

(I don't get the bits argument to the lfsr).

In any case, like I said before, I think we should try a linear probe
sequence first, the lfsr was over engineering from my side.


Re: [PATCH v15 13/15] pvqspinlock: Only kick CPU at unlock time

2015-04-09 Thread Peter Zijlstra
On Mon, Apr 06, 2015 at 10:55:48PM -0400, Waiman Long wrote:

 @@ -219,24 +236,30 @@ static void pv_wait_node(struct mcs_spinlock *node)
  }
  
  /*
 + * Called after setting next-locked = 1  lock acquired.
 + * Check if the the CPU has been halted. If so, set the _Q_SLOW_VAL flag
 + * and put an entry into the lock hash table to be waken up at unlock time.
   */
 -static void pv_kick_node(struct mcs_spinlock *node)
 +static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node)

I'm not too sure about that name change..

  {
   struct pv_node *pn = (struct pv_node *)node;
 + struct __qspinlock *l = (void *)lock;
  
   /*
 +  * Transition CPU state: halted = hashed
 +  * Quit if the transition failed.
*/
 + if (cmpxchg(pn-state, vcpu_halted, vcpu_hashed) != vcpu_halted)
 + return;
 +
 + /*
 +  * Put the lock into the hash table  set the _Q_SLOW_VAL in the lock.
 +  * As this is the same CPU that will check the _Q_SLOW_VAL value and
 +  * the hash table later on at unlock time, no atomic instruction is
 +  * needed.
 +  */
 + WRITE_ONCE(l-locked, _Q_SLOW_VAL);
 + (void)pv_hash(lock, pn);
  }

This is broken. The unlock path relies on:

  pv_hash()
   MB
  l-locked = SLOW

such that when it observes SLOW, it must then also observe a consistent
bucket.

The above can have us do pv_hash_find() _before_ we actually hash the
lock, which will result in us triggering that BUG_ON() in there.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 13/15] pvqspinlock: Only kick CPU at unlock time

2015-04-09 Thread Peter Zijlstra
On Thu, Apr 09, 2015 at 09:57:21PM +0200, Peter Zijlstra wrote:
 On Mon, Apr 06, 2015 at 10:55:48PM -0400, Waiman Long wrote:
 
  @@ -219,24 +236,30 @@ static void pv_wait_node(struct mcs_spinlock *node)
   }
   
   /*
  + * Called after setting next-locked = 1  lock acquired.
  + * Check if the the CPU has been halted. If so, set the _Q_SLOW_VAL flag
  + * and put an entry into the lock hash table to be waken up at unlock time.
*/
  -static void pv_kick_node(struct mcs_spinlock *node)
  +static void pv_scan_next(struct qspinlock *lock, struct mcs_spinlock *node)
 
 I'm not too sure about that name change..
 
   {
  struct pv_node *pn = (struct pv_node *)node;
  +   struct __qspinlock *l = (void *)lock;
   
  /*
  +* Transition CPU state: halted = hashed
  +* Quit if the transition failed.
   */
  +   if (cmpxchg(pn-state, vcpu_halted, vcpu_hashed) != vcpu_halted)
  +   return;
  +
  +   /*
  +* Put the lock into the hash table  set the _Q_SLOW_VAL in the lock.
  +* As this is the same CPU that will check the _Q_SLOW_VAL value and
  +* the hash table later on at unlock time, no atomic instruction is
  +* needed.
  +*/
  +   WRITE_ONCE(l-locked, _Q_SLOW_VAL);
  +   (void)pv_hash(lock, pn);
   }
 
 This is broken. The unlock path relies on:
 
   pv_hash()
MB
   l-locked = SLOW
 
 such that when it observes SLOW, it must then also observe a consistent
 bucket.
 
 The above can have us do pv_hash_find() _before_ we actually hash the
 lock, which will result in us triggering that BUG_ON() in there.

Urgh, clearly its late and I cannot read. The comment explains it.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v15 16/16] unfair qspinlock: a queue based unfair lock

2015-04-09 Thread Peter Zijlstra
On Wed, Apr 08, 2015 at 02:32:19PM -0400, Waiman Long wrote:
 For a virtual guest with the qspinlock patch, a simple unfair byte lock
 will be used if PV spinlock is not configured in or the hypervisor
 isn't either KVM or Xen. The byte lock works fine with small guest
 of just a few vCPUs. On a much larger guest, however, byte lock can
 have serious performance problem.

Who cares?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-03 Thread Peter Zijlstra
On Thu, Apr 02, 2015 at 09:48:34PM +0200, Peter Zijlstra wrote:
 @@ -158,20 +257,20 @@ static void pv_wait_head(struct qspinloc
  void __pv_queue_spin_unlock(struct qspinlock *lock)
  {
   struct __qspinlock *l = (void *)lock;
 + struct pv_hash_bucket *hb;
  
   if (xchg(l-locked, 0) != _Q_SLOW_VAL)
   return;
  
   /*
* At this point the memory pointed at by lock can be freed/reused,
 +  * however we can still use the pointer value to search in our hash
 +  * table.
*
 +  * Also, if we observe _Q_SLOW_VAL we _must_ now observe 'our' hash
 +  * bucket. See pv_wait_head().
*/
 + hb = pv_hash_find(lock);
 + pv_kick(hb-cpu);
 + WRITE_ONCE(hb-lock, NULL); /* unhash */
  }

So I _think_ I found a problem with this approach :/

If, as per the above, the lock does indeed get freed, it can get
re-allocated and stuck in the hash table (again) before we get the
lookup and unhash it.

I'll have to ponder that a bit more.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-02 Thread Peter Zijlstra
On Thu, Apr 02, 2015 at 07:20:57PM +0200, Peter Zijlstra wrote:
 pv_wait_head():
 
   pv_hash()
   /* MB as per cmpxchg */
   cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
 
 VS
 
 __pv_queue_spin_unlock():
 
   if (xchg(l-locked, 0) != _Q_SLOW_VAL)
   return;
 
   /* MB as per xchg */
   pv_hash_find(lock);
 
 


Something like so.. compile tested only.

I took out the LFSR because that was likely over engineering from my
side :-)

--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -2,6 +2,8 @@
 #error do not include this file
 #endif
 
+#include linux/hash.h
+
 /*
  * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead
  * of spinning them.
@@ -107,7 +109,84 @@ static void pv_kick_node(struct mcs_spin
pv_kick(pn-cpu);
 }
 
-static DEFINE_PER_CPU(struct qspinlock *, __pv_lock_wait);
+/*
+ * Hash table using open addressing with an linear probe sequence.
+ *
+ * Since we should not be holding locks from NMI context (very rare indeed) the
+ * max load factor is 0.75, which is around the point where open adressing
+ * breaks down.
+ *
+ * Instead of probing just the immediate bucket we probe all buckets in the
+ * same cacheline.
+ *
+ * http://en.wikipedia.org/wiki/Hash_table#Open_addressing
+ *
+ */
+
+struct pv_hash_bucket {
+   struct qspinlock *lock;
+   int cpu;
+};
+
+/*
+ * XXX dynamic allocate using nr_cpu_ids instead...
+ */
+#define PV_LOCK_HASH_BITS  (2 + NR_CPUS_BITS)
+
+#if PV_LOCK_HASH_BITS  6
+#undef PV_LOCK_HASH_BITS
+#define PB_LOCK_HASH_BITS  6
+#endif
+
+#define PV_LOCK_HASH_SIZE  (1  PV_LOCK_HASH_BITS)
+
+static struct pv_hash_bucket __pv_lock_hash[PV_LOCK_HASH_SIZE] 
cacheline_aligned;
+
+#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct 
pv_hash_bucket))
+
+static inline u32 hash_align(u32 hash)
+{
+   return hash  ~(PV_HB_PER_LINE - 1);
+}
+
+#define for_each_hash_bucket(hb, off, hash)
\
+   for (hash = hash_align(hash), off = 0, hb = __pv_lock_hash[hash + 
off];\
+   off  PV_LOCK_HASH_SIZE;
\
+   off++, hb = __pv_lock_hash[(hash + off) % PV_LOCK_HASH_SIZE])
+
+static struct pv_hash_bucket *pv_hash_insert(struct qspinlock *lock)
+{
+   u32 offset, hash = hash_ptr(lock, PV_LOCK_HASH_BITS);
+   struct pv_hash_bucket *hb;
+
+   for_each_hash_bucket(hb, offset, hash) {
+   if (!cmpxchg(hb-lock, NULL, lock)) {
+   WRITE_ONCE(hb-cpu, smp_processor_id());
+   return hb;
+   }
+   }
+
+   /*
+* Hard assumes there is an empty bucket somewhere.
+*/
+   BUG();
+}
+
+static struct pv_hash_bucket *pv_hash_find(struct qspinlock *lock)
+{
+   u32 offset, hash = hash_ptr(lock, PV_LOCK_HASH_BITS);
+   struct pv_hash_bucket *hb;
+
+   for_each_hash_bucket(hb, offset, hash) {
+   if (READ_ONCE(hb-lock) == lock)
+   return hb;
+   }
+
+   /*
+* Hard assumes we _WILL_ find a match.
+*/
+   BUG();
+}
 
 /*
  * Wait for l-locked to become clear; halt the vcpu after a short spin.
@@ -116,7 +195,9 @@ static DEFINE_PER_CPU(struct qspinlock *
 static void pv_wait_head(struct qspinlock *lock)
 {
struct __qspinlock *l = (void *)lock;
+   struct pv_hash_bucket *hb = NULL;
int loop;
+   u8 o;
 
for (;;) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
@@ -126,29 +207,47 @@ static void pv_wait_head(struct qspinloc
cpu_relax();
}
 
-   this_cpu_write(__pv_lock_wait, lock);
-   /*
-* __pv_lock_wait must be set before setting _Q_SLOW_VAL
-*
-* [S] __pv_lock_wait = lock[RmW] l = l-locked = 0
-* MB MB
-* [S] l-locked = _Q_SLOW_VAL  [L]   __pv_lock_wait
-*
-* Matches the xchg() in pv_queue_spin_unlock().
-*/
-   if (!cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL))
-   goto done;
+   if (!hb) {
+   hb = pv_hash_insert(lock);
+   /*
+* hb  must be set before setting _Q_SLOW_VAL
+*
+* [S]   hb - lock   [RmW] l = l-locked = 0
+*   MB MB
+* [RmW] l-locked ?= _Q_SLOW_VAL [L]   hb
+*
+* Matches the xchg() in pv_queue_spin_unlock().
+*/
+   o = cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);
+   if (!o) {
+   /*
+* The lock got

Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-02 Thread Peter Zijlstra
On Thu, Apr 02, 2015 at 12:28:30PM -0400, Waiman Long wrote:
 On 04/01/2015 05:03 PM, Peter Zijlstra wrote:
 On Wed, Apr 01, 2015 at 03:58:58PM -0400, Waiman Long wrote:
 On 04/01/2015 02:48 PM, Peter Zijlstra wrote:
 I am sorry that I don't quite get what you mean here. My point is that in
 the hashing step, a cpu will need to scan an empty bucket to put the lock
 in. In the interim, an previously used bucket before the empty one may get
 freed. In the lookup step for that lock, the scanning will stop because of
 an empty bucket in front of the target one.
 Right, that's broken. So we need to do something else to limit the
 lookup, because without that break, a lookup that needs to iterate the
 entire array in order to determine -ENOENT, which is expensive.
 
 So my alternative proposal is that IFF we can guarantee that every
 lookup will succeed -- the entry we're looking for is always there, we
 don't need the break on empty but can probe until we find the entry.
 This will be bound in cost to the same number if probes we required for
 insertion and avoids the full array scan.
 
 Now I think we can indeed do this, if as said earlier we do not clear
 the bucket on insert if the cmpxchg succeeds, in that case the unlock
 will observe _Q_SLOW_VAL and do the lookup, the lookup will then find
 the entry. And we then need the unlock to clear the entry.
 _Q_SLOW_VAL
 Does that explain this? Or should I try again with code?
 
 OK, I got your proposal now. However, there is still the issue that setting
 the _Q_SLOW_VAL flag and the hash bucket are not atomic wrt each other.

So? They're strictly ordered, that's sufficient. We first hash the lock,
then we set _Q_SLOW_VAL. There's a full memory barrier between them.

 It
 is possible a CPU has set the _Q_SLOW_VAL flag but not yet filling in the
 hash bucket while another one is trying to look for it.

Nope. The unlock side does an xchg() on the locked value first, xchg
also implies a full barrier, so that guarantees that if we observe
_Q_SLOW_VAL we must also observe the hash bucket with the lock value.

 So we need to have
 some kind of synchronization mechanism to let the lookup CPU know when is a
 good time to look up.

No, its all already ordered and working.

pv_wait_head():

pv_hash()
/* MB as per cmpxchg */
cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL);

VS

__pv_queue_spin_unlock():

if (xchg(l-locked, 0) != _Q_SLOW_VAL)
return;

/* MB as per xchg */
pv_hash_find(lock);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-01 Thread Peter Zijlstra
On Wed, Apr 01, 2015 at 12:20:30PM -0400, Waiman Long wrote:
 After more careful reading, I think the assumption that the presence of an
 unused bucket means there is no match is not true. Consider the scenario:
 
 1. cpu 0 puts lock1 into hb[0]
 2. cpu 1 puts lock2 into hb[1]
 3. cpu 2 clears hb[0]
 4. cpu 3 looks for lock2 and doesn't find it

Hmm, yes. The only way I can see that being true is if we assume entries
are never taken out again.

The wikipedia page could use some clarification here, this is not clear.

 At this point, I am thinking using back your previous idea of passing the
 queue head information down the queue.

Having to scan the entire array for a lookup sure sucks, but the wait
loops involved in the other idea can get us in the exact predicament we
were trying to get out, because their forward progress depends on other
CPUs.

Hohumm.. time to think more I think ;-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-01 Thread Peter Zijlstra
On Wed, Apr 01, 2015 at 07:42:39PM +0200, Peter Zijlstra wrote:
  Hohumm.. time to think more I think ;-)
 
 So bear with me, I've not really pondered this well so it could be full
 of holes (again).
 
 After the cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL) succeeds the
 spin_unlock() must do the hash lookup, right? We can make the lookup
 unhash.
 
 If the cmpxchg() fails the unlock will not do the lookup and we must
 unhash.

The idea being that the result is that any lookup is guaranteed to find
an entry, which reduces our worst case lookup cost to whatever the worst
case insertion cost was.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-01 Thread Peter Zijlstra
On Wed, Apr 01, 2015 at 02:54:45PM -0400, Waiman Long wrote:
 On 04/01/2015 02:17 PM, Peter Zijlstra wrote:
 On Wed, Apr 01, 2015 at 07:42:39PM +0200, Peter Zijlstra wrote:
 Hohumm.. time to think more I think ;-)
 So bear with me, I've not really pondered this well so it could be full
 of holes (again).
 
 After the cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL) succeeds the
 spin_unlock() must do the hash lookup, right? We can make the lookup
 unhash.
 
 If the cmpxchg() fails the unlock will not do the lookup and we must
 unhash.
 The idea being that the result is that any lookup is guaranteed to find
 an entry, which reduces our worst case lookup cost to whatever the worst
 case insertion cost was.
 
 
 I think it doesn't matter who did the unhashing. Multiple independent locks
 can be hashed to the same value. Since they can be unhashed independently,
 there is no way to know whether you have checked all the possible buckets.

oh but the crux is that you guarantee a lookup will find an entry. it will
never need to iterate the entire array.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-01 Thread Peter Zijlstra
On Wed, Apr 01, 2015 at 07:12:23PM +0200, Peter Zijlstra wrote:
 On Wed, Apr 01, 2015 at 12:20:30PM -0400, Waiman Long wrote:
  After more careful reading, I think the assumption that the presence of an
  unused bucket means there is no match is not true. Consider the scenario:
  
  1. cpu 0 puts lock1 into hb[0]
  2. cpu 1 puts lock2 into hb[1]
  3. cpu 2 clears hb[0]
  4. cpu 3 looks for lock2 and doesn't find it
 
 Hmm, yes. The only way I can see that being true is if we assume entries
 are never taken out again.
 
 The wikipedia page could use some clarification here, this is not clear.
 
  At this point, I am thinking using back your previous idea of passing the
  queue head information down the queue.
 
 Having to scan the entire array for a lookup sure sucks, but the wait
 loops involved in the other idea can get us in the exact predicament we
 were trying to get out, because their forward progress depends on other
 CPUs.
 
 Hohumm.. time to think more I think ;-)

So bear with me, I've not really pondered this well so it could be full
of holes (again).

After the cmpxchg(l-locked, _Q_LOCKED_VAL, _Q_SLOW_VAL) succeeds the
spin_unlock() must do the hash lookup, right? We can make the lookup
unhash.

If the cmpxchg() fails the unlock will not do the lookup and we must
unhash.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-04-01 Thread Peter Zijlstra
On Wed, Apr 01, 2015 at 03:58:58PM -0400, Waiman Long wrote:
 On 04/01/2015 02:48 PM, Peter Zijlstra wrote:

 I am sorry that I don't quite get what you mean here. My point is that in
 the hashing step, a cpu will need to scan an empty bucket to put the lock
 in. In the interim, an previously used bucket before the empty one may get
 freed. In the lookup step for that lock, the scanning will stop because of
 an empty bucket in front of the target one.

Right, that's broken. So we need to do something else to limit the
lookup, because without that break, a lookup that needs to iterate the
entire array in order to determine -ENOENT, which is expensive.

So my alternative proposal is that IFF we can guarantee that every
lookup will succeed -- the entry we're looking for is always there, we
don't need the break on empty but can probe until we find the entry.
This will be bound in cost to the same number if probes we required for
insertion and avoids the full array scan.

Now I think we can indeed do this, if as said earlier we do not clear
the bucket on insert if the cmpxchg succeeds, in that case the unlock
will observe _Q_SLOW_VAL and do the lookup, the lookup will then find
the entry. And we then need the unlock to clear the entry.

Does that explain this? Or should I try again with code?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9] qspinlock stuff -v15

2015-03-30 Thread Peter Zijlstra
On Mon, Mar 30, 2015 at 12:25:12PM -0400, Waiman Long wrote:
 I did it differently in my PV portion of the qspinlock patch. Instead of
 just waking up the CPU, the new lock holder will check if the new queue head
 has been halted. If so, it will set the slowpath flag for the halted queue
 head in the lock so as to wake it up at unlock time. This should eliminate
 your concern of dong twice as many VMEXIT in an overcommitted scenario.

We can still do that on top of all this right? As you might have
realized I'm a fan of gradual complexity :-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9] qspinlock stuff -v15

2015-03-26 Thread Peter Zijlstra
On Wed, Mar 25, 2015 at 03:47:39PM -0400, Konrad Rzeszutek Wilk wrote:
 Ah nice. That could be spun out as a seperate patch to optimize the existing
 ticket locks I presume.

Yes I suppose we can do something similar for the ticket and patch in
the right increment. We'd need to restructure the code a bit, but
its not fundamentally impossible.

We could equally apply the head hashing to the current ticket
implementation and avoid the current bitmap iteration.

 Now with the old pv ticketlock code an vCPU would only go to sleep once and
 be woken up when it was its turn. With this new code it is woken up twice 
 (and twice it goes to sleep). With an overcommit scenario this would imply
 that we will have at least twice as many VMEXIT as with the previous code.

An astute observation, I had not considered that.

 I presume when you did benchmarking this did not even register? Thought
 I wonder if it would if you ran the benchmark for a week or so.

You presume I benchmarked :-) I managed to boot something virt and run
hackbench in it. I wouldn't know a representative virt setup if I ran
into it.

The thing is, we want this qspinlock for real hardware because its
faster and I really want to avoid having to carry two spinlock
implementations -- although I suppose that if we really really have to
we could.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] qspinlock,x86,kvm: Implement KVM support for paravirt qspinlock

2015-03-19 Thread Peter Zijlstra
On Wed, Mar 18, 2015 at 10:45:55PM -0400, Waiman Long wrote:
 On 03/16/2015 09:16 AM, Peter Zijlstra wrote:
 I do have some concern about this call site patching mechanism as the
 modification is not atomic. The spin_unlock() calls are in many places in
 the kernel. There is a possibility that a thread is calling a certain
 spin_unlock call site while it is being patched by another one with the
 alternative() function call.
 
 So far, I don't see any problem with bare metal where paravirt_patch_insns()
 is used to patch it to the move instruction. However, in a virtual guest
 enivornment where paravirt_patch_call() was used, there were situations
 where the system panic because of page fault on some invalid memory in the
 kthread. If you look at the paravirt_patch_call(), you will see:
 
 :
 b-opcode = 0xe8; /* call */
 b-delta = delta;
 
 If another CPU reads the instruction at the call site at the right moment,
 it will get the modified call instruction, but not the new delta value. It
 will then jump to a random location. I believe that was causing the system
 panic that I saw.
 
 So I think it is kind of risky to use it here unless we can guarantee that
 call site patching is atomic wrt other CPUs.

Just look at where the patching is done:

init/main.c:start_kernel()
  check_bugs()
alternative_instructions()
  apply_paravirt()

We're UP and not holding any locks, disable IRQs (see text_poke_early())
and have NMIs 'disabled'.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-03-19 Thread Peter Zijlstra
On Wed, Mar 18, 2015 at 04:50:37PM -0400, Waiman Long wrote:
 +this_cpu_write(__pv_lock_wait, lock);
 
 We may run into the same problem of needing to have 4 queue nodes per CPU.
 If an interrupt happens just after the write and before the actual wait and
 it goes through the same sequence, it will overwrite the __pv_lock_wait[]
 entry. So we may have lost wakeup. That is why the pvticket lock code did
 that just before the actual wait with interrupt disabled. We probably
 couldn't disable interrupt here. So we may need to move the actual write to
 the KVM and Xen code if we keep the current logic.

Hmm indeed. So I didn't actually mean to keep this code, but yes I
missed that.

 +/*
 + * At this point the memory pointed at by lock can be freed/reused,
 + * however we can still use the pointer value to search in our cpu
 + * array.
 + *
 + * XXX: get rid of this loop
 + */
 +for_each_possible_cpu(cpu) {
 +if (per_cpu(__pv_lock_wait, cpu) == lock)
 +pv_kick(cpu);
 +}
 +}
 
 I do want to get rid of this loop too. On average, we have to scan about
 half the number of CPUs available. So it isn't that different
 performance-wise compared with my original idea of following the list from
 tail to head. And how about your idea of propagating the current head down
 the linked list?

Yeah, so I was half done with that patch when I fell asleep on the
flight home. Didn't get around to 'fixing' it. It applies and builds but
doesn't actually work.

_However_ I'm not sure I actually like it much.

The problem I have with it are these wait loops, they can generate the
same problem we're trying to avoid.

So I was now thinking of hashing the lock pointer; let me go and quickly
put something together.

---
 kernel/locking/qspinlock.c  |8 +-
 kernel/locking/qspinlock_paravirt.h |  124 
 2 files changed, 101 insertions(+), 31 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -246,10 +246,10 @@ static __always_inline void set_locked(s
  */
 
 static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(u32 old, struct mcs_spinlock *node) 
{ }
 static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
 
-static __always_inline void __pv_wait_head(struct qspinlock *lock) { }
+static __always_inline void __pv_wait_head(struct qspinlock *lock, struct 
mcs_spinlock *node) { }
 
 #define pv_enabled()   false
 
@@ -399,7 +399,7 @@ void queue_spin_lock_slowpath(struct qsp
prev = decode_tail(old);
WRITE_ONCE(prev-next, node);
 
-   pv_wait_node(node);
+   pv_wait_node(old, node);
arch_mcs_spin_lock_contended(node-locked);
}
 
@@ -414,7 +414,7 @@ void queue_spin_lock_slowpath(struct qsp
 * sequentiality; this is because the set_locked() function below
 * does not imply a full barrier.
 */
-   pv_wait_head(lock);
+   pv_wait_head(lock, node);
while ((val = smp_load_acquire(lock-val.counter))  
_Q_LOCKED_PENDING_MASK)
cpu_relax();
 
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -29,8 +29,14 @@ struct pv_node {
 
int cpu;
u8  state;
+   struct pv_node  *head;
 };
 
+static inline struct pv_node *pv_decode_tail(u32 tail)
+{
+   return (struct pv_node *)decode_tail(tail);
+}
+
 /*
  * Initialize the PV part of the mcs_spinlock node.
  */
@@ -42,17 +48,49 @@ static void pv_init_node(struct mcs_spin
 
pn-cpu = smp_processor_id();
pn-state = vcpu_running;
+   pn-head = NULL;
+}
+
+static void pv_propagate_head(struct pv_node *pn, struct pv_node *ppn)
+{
+   /*
+* When we race against the first waiter or ourselves we have to wait
+* until the previous link updates its head pointer before we can
+* propagate it.
+*/
+   while (!READ_ONCE(ppn-head)) {
+   /*
+* queue_spin_lock_slowpath could have been waiting for the
+* node-next store above before setting node-locked.
+*/
+   if (ppn-mcs.locked)
+   return;
+
+   cpu_relax();
+   }
+   /*
+* If we interleave such that the store from pv_set_head() happens
+* between our load and store we could have over-written the new head
+* value.
+*
+* Therefore use cmpxchg to only propagate the old head value if our
+* head value is unmodified.
+*/
+   (void)cmpxchg(pn-head, NULL, READ_ONCE(ppn-head));
 }
 
 /*
  * Wait for node-locked to become true, halt the vcpu after a short spin.
  * pv_kick_node() is used to wake the 

Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-03-19 Thread Peter Zijlstra
On Thu, Mar 19, 2015 at 11:12:42AM +0100, Peter Zijlstra wrote:
 So I was now thinking of hashing the lock pointer; let me go and quickly
 put something together.

A little something like so; ideally we'd allocate the hashtable since
NR_CPUS is kinda bloated, but it shows the idea I think.

And while this has loops in (the rehashing thing) their fwd progress
does not depend on other CPUs.

And I suspect that for the typical lock contention scenarios its
unlikely we ever really get into long rehashing chains.

---
 include/linux/lfsr.h|   49 
 kernel/locking/qspinlock_paravirt.h |  143 
 2 files changed, 178 insertions(+), 14 deletions(-)

--- /dev/null
+++ b/include/linux/lfsr.h
@@ -0,0 +1,49 @@
+#ifndef _LINUX_LFSR_H
+#define _LINUX_LFSR_H
+
+/*
+ * Simple Binary Galois Linear Feedback Shift Register
+ *
+ * http://en.wikipedia.org/wiki/Linear_feedback_shift_register
+ *
+ */
+
+extern void __lfsr_needs_more_taps(void);
+
+static __always_inline u32 lfsr_taps(int bits)
+{
+   if (bits ==  1) return 0x0001;
+   if (bits ==  2) return 0x0001;
+   if (bits ==  3) return 0x0003;
+   if (bits ==  4) return 0x0009;
+   if (bits ==  5) return 0x0012;
+   if (bits ==  6) return 0x0021;
+   if (bits ==  7) return 0x0041;
+   if (bits ==  8) return 0x008E;
+   if (bits ==  9) return 0x0108;
+   if (bits == 10) return 0x0204;
+   if (bits == 11) return 0x0402;
+   if (bits == 12) return 0x0829;
+   if (bits == 13) return 0x100D;
+   if (bits == 14) return 0x2015;
+
+   /*
+* For more taps see:
+*   http://users.ece.cmu.edu/~koopman/lfsr/index.html
+*/
+   __lfsr_needs_more_taps();
+
+   return 0;
+}
+
+static inline u32 lfsr(u32 val, int bits)
+{
+   u32 bit = val  1;
+
+   val = 1;
+   if (bit)
+   val ^= lfsr_taps(bits);
+   return val;
+}
+
+#endif /* _LINUX_LFSR_H */
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -2,6 +2,9 @@
 #error do not include this file
 #endif
 
+#include linux/hash.h
+#include linux/lfsr.h
+
 /*
  * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead
  * of spinning them.
@@ -107,7 +110,120 @@ static void pv_kick_node(struct mcs_spin
pv_kick(pn-cpu);
 }
 
-static DEFINE_PER_CPU(struct qspinlock *, __pv_lock_wait);
+/*
+ * Hash table using open addressing with an LFSR probe sequence.
+ *
+ * Since we should not be holding locks from NMI context (very rare indeed) the
+ * max load factor is 0.75, which is around the point where open addressing
+ * breaks down.
+ *
+ * Instead of probing just the immediate bucket we probe all buckets in the
+ * same cacheline.
+ *
+ * http://en.wikipedia.org/wiki/Hash_table#Open_addressing
+ *
+ */
+
+#define HB_RESERVED((struct qspinlock *)1)
+
+struct pv_hash_bucket {
+   struct qspinlock *lock;
+   int cpu;
+};
+
+/*
+ * XXX dynamic allocate using nr_cpu_ids instead...
+ */
+#define PV_LOCK_HASH_BITS  (2 + NR_CPUS_BITS)
+
+#if PV_LOCK_HASH_BITS  6
+#undef PV_LOCK_HASH_BITS
+#define PB_LOCK_HASH_BITS  6
+#endif
+
+#define PV_LOCK_HASH_SIZE  (1  PV_LOCK_HASH_BITS)
+
+static struct pv_hash_bucket __pv_lock_hash[PV_LOCK_HASH_SIZE] 
cacheline_aligned;
+
+#define PV_HB_PER_LINE (SMP_CACHE_BYTES / sizeof(struct 
pv_hash_bucket))
+
+static inline u32 hash_align(u32 hash)
+{
+   return hash  ~(PV_HB_PER_LINE - 1);
+}
+
+static struct qspinlock **pv_hash(struct qspinlock *lock)
+{
+   u32 hash = hash_ptr(lock, PV_LOCK_HASH_BITS);
+   struct pv_hash_bucket *hb, *end;
+
+   if (!hash)
+   hash = 1;
+
+   hb = __pv_lock_hash[hash_align(hash)];
+   for (;;) {
+   for (end = hb + PV_HB_PER_LINE; hb  end; hb++) {
+   if (cmpxchg(hb-lock, NULL, HB_RESERVED)) {
+   WRITE_ONCE(hb-cpu, smp_processor_id());
+   /*
+* Since we must read lock first and cpu
+* second, we must write cpu first and lock
+* second, therefore use HB_RESERVE to mark an
+* entry in use before writing the values.
+*
+* This can cause hb_hash_find() to not find a
+* cpu even though _Q_SLOW_VAL, this is not a
+* problem since we re-check l-locked before
+* going to sleep and the unlock will have
+* cleared l-locked already.
+*/
+   smp_wmb(); /* matches rmb from pv_hash_find */
+   WRITE_ONCE(hb-lock, lock);
+   goto done

Re: [PATCH 8/9] qspinlock: Generic paravirt support

2015-03-19 Thread Peter Zijlstra
On Thu, Mar 19, 2015 at 01:25:36PM +0100, Peter Zijlstra wrote:
 +static struct qspinlock **pv_hash(struct qspinlock *lock)
 +{
 + u32 hash = hash_ptr(lock, PV_LOCK_HASH_BITS);
 + struct pv_hash_bucket *hb, *end;
 +
 + if (!hash)
 + hash = 1;
 +
 + hb = __pv_lock_hash[hash_align(hash)];
 + for (;;) {
 + for (end = hb + PV_HB_PER_LINE; hb  end; hb++) {
 + if (cmpxchg(hb-lock, NULL, HB_RESERVED)) {

That should be: !cmpxchg(), bit disturbing that that booted.

 + WRITE_ONCE(hb-cpu, smp_processor_id());
 + /*
 +  * Since we must read lock first and cpu
 +  * second, we must write cpu first and lock
 +  * second, therefore use HB_RESERVE to mark an
 +  * entry in use before writing the values.
 +  *
 +  * This can cause hb_hash_find() to not find a
 +  * cpu even though _Q_SLOW_VAL, this is not a
 +  * problem since we re-check l-locked before
 +  * going to sleep and the unlock will have
 +  * cleared l-locked already.
 +  */
 + smp_wmb(); /* matches rmb from pv_hash_find */
 + WRITE_ONCE(hb-lock, lock);
 + goto done;
 + }
 + }
 +
 + hash = lfsr(hash, PV_LOCK_HASH_BITS);
 + hb = __pv_lock_hash[hash_align(hash)];
 + }
 +
 +done:
 + return hb-lock;
 +}
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 0/9] qspinlock stuff -v15

2015-03-19 Thread Peter Zijlstra
On Thu, Mar 19, 2015 at 06:01:34PM +, David Vrabel wrote:
 This seems work for me, but I've not got time to give it a more thorough
 testing.
 
 You can fold this into your series.

Thanks!

 There doesn't seem to be a way to disable QUEUE_SPINLOCKS when supported by
 the arch, is this intentional?  If so, the existing ticketlock code could go.

Yeah, its left as a rudiment such that if we find issues with the
qspinlock code we can 'revert' with a trivial patch. If no issues show
up we can rip out all the old code in a subsequent release.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/9] qspinlock: Extract out code snippets for the next patch

2015-03-16 Thread Peter Zijlstra
From: Waiman Long waiman.l...@hp.com

This is a preparatory patch that extracts out the following 2 code
snippets to prepare for the next performance optimization patch.

 1) the logic for the exchange of new and previous tail code words
into a new xchg_tail() function.
 2) the logic for clearing the pending bit and setting the locked bit
into a new clear_pending_set_locked() function.

This patch also simplifies the trylock operation before queuing by
calling queue_spin_trylock() directly.

Cc: Ingo Molnar mi...@redhat.com
Cc: David Vrabel david.vra...@citrix.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Scott J Norton scott.nor...@hp.com
Cc: Paolo Bonzini paolo.bonz...@gmail.com
Cc: Douglas Hatch doug.ha...@hp.com
Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Boris Ostrovsky boris.ostrov...@oracle.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: H. Peter Anvin h...@zytor.com
Cc: Rik van Riel r...@redhat.com
Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1421784755-21945-5-git-send-email-waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |2 
 kernel/locking/qspinlock.c|   91 ++
 2 files changed, 62 insertions(+), 31 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -58,6 +58,8 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
+
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
 #define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -97,6 +97,54 @@ static inline struct mcs_spinlock *decod
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
 /**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   u32 new, old;
+
+   for (;;) {
+   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+}
+
+/**
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
+{
+   u32 old, new, val = atomic_read(lock-val);
+
+   for (;;) {
+   new = (val  _Q_LOCKED_PENDING_MASK) | tail;
+   old = atomic_cmpxchg(lock-val, val, new);
+   if (old == val)
+   break;
+
+   val = old;
+   }
+   return old;
+}
+
+/**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
@@ -178,15 +226,7 @@ void queue_spin_lock_slowpath(struct qsp
 *
 * *,1,0 - *,0,1
 */
-   for (;;) {
-   new = (val  ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   clear_pending_set_locked(lock, val);
return;
 
/*
@@ -203,37 +243,26 @@ void queue_spin_lock_slowpath(struct qsp
node-next = NULL;
 
/*
-* We have already touched the queueing cacheline; don't bother with
-* pending stuff.
-*
-* trylock || xchg(lock, node)
-*
-* 0,0,0 - 0,0,1 ; no tail, not locked - no tail, locked.
-* p,y,x - n,y,x ; tail was p - tail is n; preserving locked.
+* We touched a (possibly) cold cacheline in the per-cpu queue node;
+* attempt the trylock once more in the hope someone let go while we
+* weren't watching.
 */
-   for (;;) {
-   new = _Q_LOCKED_VAL;
-   if (val)
-   new = tail | (val  _Q_LOCKED_PENDING_MASK);
-
-   old = atomic_cmpxchg(lock-val, val, new);
-   if (old == val)
-   break;
-
-   val = old;
-   }
+   if (queue_spin_trylock(lock))
+   goto release;
 
/*
-* we won the trylock; forget about queueing.
+* We

[PATCH 5/9] qspinlock: Optimize for smaller NR_CPUS

2015-03-16 Thread Peter Zijlstra
From: Peter Zijlstra pet...@infradead.org

When we allow for a max NR_CPUS  2^14 we can optimize the pending
wait-acquire and the xchg_tail() operations.

By growing the pending bit to a byte, we reduce the tail to 16bit.
This means we can use xchg16 for the tail part and do away with all
the repeated compxchg() operations.

This in turn allows us to unconditionally acquire; the locked state
as observed by the wait loops cannot change. And because both locked
and pending are now a full byte we can use simple stores for the
state transition, obviating one atomic operation entirely.

This optimization is needed to make the qspinlock achieve performance
parity with ticket spinlock at light load.

All this is horribly broken on Alpha pre EV56 (and any other arch that
cannot do single-copy atomic byte stores).

Cc: Ingo Molnar mi...@redhat.com
Cc: David Vrabel david.vra...@citrix.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Scott J Norton scott.nor...@hp.com
Cc: Paolo Bonzini paolo.bonz...@gmail.com
Cc: Douglas Hatch doug.ha...@hp.com
Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Boris Ostrovsky boris.ostrov...@oracle.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: H. Peter Anvin h...@zytor.com
Cc: Rik van Riel r...@redhat.com
Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1421784755-21945-6-git-send-email-waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   13 ++
 kernel/locking/qspinlock.c|   71 +-
 2 files changed, 83 insertions(+), 1 deletion(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -35,6 +35,14 @@ typedef struct qspinlock {
 /*
  * Bitfields in the atomic value:
  *
+ * When NR_CPUS  16K
+ *  0- 7: locked byte
+ * 8: pending
+ *  9-15: not used
+ * 16-17: tail index
+ * 18-31: tail cpu (+1)
+ *
+ * When NR_CPUS = 16K
  *  0- 7: locked byte
  * 8: pending
  *  9-10: tail index
@@ -47,7 +55,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
 #define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#if CONFIG_NR_CPUS  (1U  14)
+#define _Q_PENDING_BITS8
+#else
 #define _Q_PENDING_BITS1
+#endif
 #define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
 
 #define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
@@ -58,6 +70,7 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_BITS   (32 - _Q_TAIL_CPU_OFFSET)
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
+#define _Q_TAIL_OFFSET _Q_TAIL_IDX_OFFSET
 #define _Q_TAIL_MASK   (_Q_TAIL_IDX_MASK | _Q_TAIL_CPU_MASK)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -24,6 +24,7 @@
 #include linux/percpu.h
 #include linux/hardirq.h
 #include linux/mutex.h
+#include asm/byteorder.h
 #include asm/qspinlock.h
 
 /*
@@ -56,6 +57,10 @@
  * node; whereby avoiding the need to carry a node from lock to unlock, and
  * preserving existing lock API. This also makes the unlock code simpler and
  * faster.
+ *
+ * N.B. The current implementation only supports architectures that allow
+ *  atomic operations on smaller 8-bit and 16-bit data types.
+ *
  */
 
 #include mcs_spinlock.h
@@ -96,6 +101,64 @@ static inline struct mcs_spinlock *decod
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/*
+ * By using the whole 2nd least significant byte for the pending bit, we
+ * can allow better optimization of the lock acquisition for the pending
+ * bit holder.
+ */
+#if _Q_PENDING_BITS == 8
+
+struct __qspinlock {
+   union {
+   atomic_t val;
+   struct {
+#ifdef __LITTLE_ENDIAN
+   u16 locked_pending;
+   u16 tail;
+#else
+   u16 tail;
+   u16 locked_pending;
+#endif
+   };
+   };
+};
+
+/**
+ * clear_pending_set_locked - take ownership and clear the pending bit.
+ * @lock: Pointer to queue spinlock structure
+ * @val : Current value of the queue spinlock 32-bit word
+ *
+ * *,1,0 - *,0,1
+ *
+ * Lock stealing is not allowed if this function is used.
+ */
+static __always_inline void
+clear_pending_set_locked(struct qspinlock *lock, u32 val)
+{
+   struct __qspinlock *l = (void *)lock;
+
+   WRITE_ONCE(l-locked_pending, _Q_LOCKED_VAL);
+}
+
+/*
+ * xchg_tail - Put in the new queue tail code word  retrieve previous one
+ * @lock : Pointer to queue spinlock structure
+ * @tail : The new queue tail code word
+ * Return: The previous queue tail code word
+ *
+ * xchg(lock, tail)
+ *
+ * p,*,* - n,*,* ; prev = xchg(lock, node)
+ */
+static __always_inline u32

[PATCH 2/9] qspinlock, x86: Enable x86-64 to use queue spinlock

2015-03-16 Thread Peter Zijlstra
From: Waiman Long waiman.l...@hp.com

This patch makes the necessary changes at the x86 architecture
specific layer to enable the use of queue spinlock for x86-64. As
x86-32 machines are typically not multi-socket. The benefit of queue
spinlock may not be apparent. So queue spinlock is not enabled.

Currently, there is some incompatibilities between the para-virtualized
spinlock code (which hard-codes the use of ticket spinlock) and the
queue spinlock. Therefore, the use of queue spinlock is disabled when
the para-virtualized spinlock is enabled.

The arch/x86/include/asm/qspinlock.h header file includes some x86
specific optimization which will make the queue spinlock code perform
better than the generic implementation.

Cc: Ingo Molnar mi...@redhat.com
Cc: David Vrabel david.vra...@citrix.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Scott J Norton scott.nor...@hp.com
Cc: Paolo Bonzini paolo.bonz...@gmail.com
Cc: Douglas Hatch doug.ha...@hp.com
Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Boris Ostrovsky boris.ostrov...@oracle.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: H. Peter Anvin h...@zytor.com
Cc: Rik van Riel r...@redhat.com
Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1421784755-21945-3-git-send-email-waiman.l...@hp.com
---
 arch/x86/Kconfig  |1 +
 arch/x86/include/asm/qspinlock.h  |   20 
 arch/x86/include/asm/spinlock.h   |5 +
 arch/x86/include/asm/spinlock_types.h |4 
 4 files changed, 30 insertions(+)
 create mode 100644 arch/x86/include/asm/qspinlock.h

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -125,6 +125,7 @@ config X86
select MODULES_USE_ELF_RELA if X86_64
select CLONE_BACKWARDS if X86_32
select ARCH_USE_BUILTIN_BSWAP
+   select ARCH_USE_QUEUE_SPINLOCK
select ARCH_USE_QUEUE_RWLOCK
select OLD_SIGSUSPEND3 if X86_32 || IA32_EMULATION
select OLD_SIGACTION if X86_32
--- /dev/null
+++ b/arch/x86/include/asm/qspinlock.h
@@ -0,0 +1,20 @@
+#ifndef _ASM_X86_QSPINLOCK_H
+#define _ASM_X86_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+#definequeue_spin_unlock queue_spin_unlock
+/**
+ * queue_spin_unlock - release a queue spinlock
+ * @lock : Pointer to queue spinlock structure
+ *
+ * An smp_store_release() on the least-significant byte.
+ */
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   smp_store_release((u8 *)lock, 0);
+}
+
+#include asm-generic/qspinlock.h
+
+#endif /* _ASM_X86_QSPINLOCK_H */
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -42,6 +42,10 @@
 extern struct static_key paravirt_ticketlocks_enabled;
 static __always_inline bool static_key_false(struct static_key *key);
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm/qspinlock.h
+#else
+
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 
 static inline void __ticket_enter_slowpath(arch_spinlock_t *lock)
@@ -196,6 +200,7 @@ static inline void arch_spin_unlock_wait
cpu_relax();
}
 }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 /*
  * Read-write spinlocks, allowing multiple readers
--- a/arch/x86/include/asm/spinlock_types.h
+++ b/arch/x86/include/asm/spinlock_types.h
@@ -23,6 +23,9 @@ typedef u32 __ticketpair_t;
 
 #define TICKET_SHIFT   (sizeof(__ticket_t) * 8)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+#include asm-generic/qspinlock_types.h
+#else
 typedef struct arch_spinlock {
union {
__ticketpair_t head_tail;
@@ -33,6 +36,7 @@ typedef struct arch_spinlock {
 } arch_spinlock_t;
 
 #define __ARCH_SPIN_LOCK_UNLOCKED  { { 0 } }
+#endif /* CONFIG_QUEUE_SPINLOCK */
 
 #include asm-generic/qrwlock_types.h
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/9] qspinlock: Use a simple write to grab the lock

2015-03-16 Thread Peter Zijlstra
From: Waiman Long waiman.l...@hp.com

Currently, atomic_cmpxchg() is used to get the lock. However, this
is not really necessary if there is more than one task in the queue
and the queue head don't need to reset the tail code. For that case,
a simple write to set the lock bit is enough as the queue head will
be the only one eligible to get the lock as long as it checks that
both the lock and pending bits are not set. The current pending bit
waiting code will ensure that the bit will not be set as soon as the
tail code in the lock is set.

With that change, the are some slight improvement in the performance
of the queue spinlock in the 5M loop micro-benchmark run on a 4-socket
Westere-EX machine as shown in the tables below.

[Standalone/Embedded - same node]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   3 2324/2321  2248/2265-3%/-2%
   4 2890/2896  2819/2831-2%/-2%
   5 3611/3595  3522/3512-2%/-2%
   6 4281/4276  4173/4160-3%/-3%
   7 5018/5001  4875/4861-3%/-3%
   8 5759/5750  5563/5568-3%/-3%

[Standalone/Embedded - different nodes]
  # of tasksBefore patchAfter patch %Change
  ----- --  ---
   312242/12237 12087/12093  -1%/-1%
   410688/10696 10507/10521  -2%/-2%

It was also found that this change produced a much bigger performance
improvement in the newer IvyBridge-EX chip and was essentially to close
the performance gap between the ticket spinlock and queue spinlock.

The disk workload of the AIM7 benchmark was run on a 4-socket
Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
on a 3.14 based kernel. The results of the test runs were:

AIM7 XFS Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock56782333.17   96.61   5.81
  qspinlock 57507993.13   94.83   5.97

AIM7 EXT4 Disk Test
  kernel JPMReal Time   Sys TimeUsr Time
  -  ----   
  ticketlock1114551   16.15  509.72   7.11
  qspinlock 21844668.24  232.99   6.01

The ext4 filesystem run had a much higher spinlock contention than
the xfs filesystem run.

The ebizzy -m test was also run with the following results:

  kernel   records/s  Real Time   Sys TimeUsr Time
  --  -   
  ticketlock 2075   10.00  216.35   3.49
  qspinlock  3023   10.00  198.20   4.80

Cc: H. Peter Anvin h...@zytor.com
Cc: David Vrabel david.vra...@citrix.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Scott J Norton scott.nor...@hp.com
Cc: Paolo Bonzini paolo.bonz...@gmail.com
Cc: Douglas Hatch doug.ha...@hp.com
Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Boris Ostrovsky boris.ostrov...@oracle.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Rik van Riel r...@redhat.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1421784755-21945-7-git-send-email-waiman.l...@hp.com
---
 kernel/locking/qspinlock.c |   61 +
 1 file changed, 45 insertions(+), 16 deletions(-)

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -105,24 +105,33 @@ static inline struct mcs_spinlock *decod
  * By using the whole 2nd least significant byte for the pending bit, we
  * can allow better optimization of the lock acquisition for the pending
  * bit holder.
+ *
+ * This internal structure is also used by the set_locked function which
+ * is not restricted to _Q_PENDING_BITS == 8.
  */
-#if _Q_PENDING_BITS == 8
-
 struct __qspinlock {
union {
atomic_t val;
-   struct {
 #ifdef __LITTLE_ENDIAN
+   u8   locked;
+   struct {
u16 locked_pending;
u16 tail;
+   };
 #else
+   struct {
u16 tail;
u16 locked_pending;
-#endif
};
+   struct {
+   u8  reserved[3];
+   u8  locked;
+   };
+#endif
};
 };
 
+#if _Q_PENDING_BITS == 8
 /**
  * clear_pending_set_locked - take ownership and clear the pending bit.
  * @lock: Pointer to queue

[PATCH 7/9] qspinlock: Revert to test-and-set on hypervisors

2015-03-16 Thread Peter Zijlstra
From: Peter Zijlstra pet...@infradead.org

When we detect a hypervisor (!paravirt, see qspinlock paravirt support
patches), revert to a simple test-and-set lock to avoid the horrors
of queue preemption.

Cc: Ingo Molnar mi...@redhat.com
Cc: David Vrabel david.vra...@citrix.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Scott J Norton scott.nor...@hp.com
Cc: Paolo Bonzini paolo.bonz...@gmail.com
Cc: Douglas Hatch doug.ha...@hp.com
Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Boris Ostrovsky boris.ostrov...@oracle.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: H. Peter Anvin h...@zytor.com
Cc: Rik van Riel r...@redhat.com
Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1421784755-21945-8-git-send-email-waiman.l...@hp.com
---
 arch/x86/include/asm/qspinlock.h |   14 ++
 include/asm-generic/qspinlock.h  |7 +++
 kernel/locking/qspinlock.c   |3 +++
 3 files changed, 24 insertions(+)

--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -1,6 +1,7 @@
 #ifndef _ASM_X86_QSPINLOCK_H
 #define _ASM_X86_QSPINLOCK_H
 
+#include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
 
 #definequeue_spin_unlock queue_spin_unlock
@@ -15,6 +16,19 @@ static inline void queue_spin_unlock(str
smp_store_release((u8 *)lock, 0);
 }
 
+#define virt_queue_spin_lock virt_queue_spin_lock
+
+static inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
+   return false;
+
+   while (atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL) != 0)
+   cpu_relax();
+
+   return true;
+}
+
 #include asm-generic/qspinlock.h
 
 #endif /* _ASM_X86_QSPINLOCK_H */
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -111,6 +111,13 @@ static inline void queue_spin_unlock_wai
cpu_relax();
 }
 
+#ifndef virt_queue_spin_lock
+static __always_inline bool virt_queue_spin_lock(struct qspinlock *lock)
+{
+   return false;
+}
+#endif
+
 /*
  * Initializier
  */
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -259,6 +259,9 @@ void queue_spin_lock_slowpath(struct qsp
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (virt_queue_spin_lock(lock))
+   return;
+
/*
 * wait for in-progress pending-locked hand-overs
 *


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/9] qspinlock: Add pending bit

2015-03-16 Thread Peter Zijlstra
From: Peter Zijlstra pet...@infradead.org

Because the qspinlock needs to touch a second cacheline (the per-cpu
mcs_nodes[]); add a pending bit and allow a single in-word spinner
before we punt to the second cacheline.

It is possible so observe the pending bit without the locked bit when
the last owner has just released but the pending owner has not yet
taken ownership.

In this case we would normally queue -- because the pending bit is
already taken. However, in this case the pending bit is guaranteed
to be released 'soon', therefore wait for it and avoid queueing.

Cc: Ingo Molnar mi...@redhat.com
Cc: David Vrabel david.vra...@citrix.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Scott J Norton scott.nor...@hp.com
Cc: Paolo Bonzini paolo.bonz...@gmail.com
Cc: Douglas Hatch doug.ha...@hp.com
Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Boris Ostrovsky boris.ostrov...@oracle.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Thomas Gleixner t...@linutronix.de
Cc: H. Peter Anvin h...@zytor.com
Cc: Rik van Riel r...@redhat.com
Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1421784755-21945-4-git-send-email-waiman.l...@hp.com
---
 include/asm-generic/qspinlock_types.h |   12 ++-
 kernel/locking/qspinlock.c|  119 --
 2 files changed, 107 insertions(+), 24 deletions(-)

--- a/include/asm-generic/qspinlock_types.h
+++ b/include/asm-generic/qspinlock_types.h
@@ -36,8 +36,9 @@ typedef struct qspinlock {
  * Bitfields in the atomic value:
  *
  *  0- 7: locked byte
- *  8- 9: tail index
- * 10-31: tail cpu (+1)
+ * 8: pending
+ *  9-10: tail index
+ * 11-31: tail cpu (+1)
  */
 #define_Q_SET_MASK(type)   (((1U  _Q_ ## type ## _BITS) - 1)\
   _Q_ ## type ## _OFFSET)
@@ -45,7 +46,11 @@ typedef struct qspinlock {
 #define _Q_LOCKED_BITS 8
 #define _Q_LOCKED_MASK _Q_SET_MASK(LOCKED)
 
-#define _Q_TAIL_IDX_OFFSET (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_OFFSET  (_Q_LOCKED_OFFSET + _Q_LOCKED_BITS)
+#define _Q_PENDING_BITS1
+#define _Q_PENDING_MASK_Q_SET_MASK(PENDING)
+
+#define _Q_TAIL_IDX_OFFSET (_Q_PENDING_OFFSET + _Q_PENDING_BITS)
 #define _Q_TAIL_IDX_BITS   2
 #define _Q_TAIL_IDX_MASK   _Q_SET_MASK(TAIL_IDX)
 
@@ -54,5 +59,6 @@ typedef struct qspinlock {
 #define _Q_TAIL_CPU_MASK   _Q_SET_MASK(TAIL_CPU)
 
 #define _Q_LOCKED_VAL  (1U  _Q_LOCKED_OFFSET)
+#define _Q_PENDING_VAL (1U  _Q_PENDING_OFFSET)
 
 #endif /* __ASM_GENERIC_QSPINLOCK_TYPES_H */
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -94,24 +94,28 @@ static inline struct mcs_spinlock *decod
return per_cpu_ptr(mcs_nodes[idx], cpu);
 }
 
+#define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
  * @val: Current value of the queue spinlock 32-bit word
  *
- * (queue tail, lock value)
- *
- *  fast  :slow  :
unlock
- *:  :
- * uncontended  (0,0)   --:-- (0,1) :-- (*,0)
- *:   | ^./  :
- *:   v   \   |  :
- * uncontended:(n,x) --+-- (n,0) |  :
- *   queue:   | ^--'  |  :
- *:   v   |  :
- * contended  :(*,x) --+-- (*,0) - (*,1) ---'  :
- *   queue: ^--' :
+ * (queue tail, pending bit, lock value)
  *
+ *  fast :slow  :unlock
+ *   :  :
+ * uncontended  (0,0,0) -:-- (0,0,1) --:-- 
(*,*,0)
+ *   :   | ^.--. /  :
+ *   :   v   \  \|  :
+ * pending   :(0,1,1) +-- (0,1,0)   \   |  :
+ *   :   | ^--'  |   |  :
+ *   :   v   |   |  :
+ * uncontended   :(n,x,y) +-- (n,0,0) --'   |  :
+ *   queue   :   | ^--'  |  :
+ *   :   v   |  :
+ * contended :(*,x,y) +-- (*,0,0) --- (*,0,1) -'  :
+ *   queue   : ^--' :
  */
 void queue_spin_lock_slowpath(struct

[PATCH 8/9] qspinlock: Generic paravirt support

2015-03-16 Thread Peter Zijlstra
Implement simple paravirt support for the qspinlock.

Provide a separate (second) version of the spin_lock_slowpath for
paravirt along with a special unlock path.

The second slowpath is generated by adding a few pv hooks to the
normal slowpath, but where those will compile away for the native
case, they expand into special wait/wake code for the pv version.

The actual MCS queue can use extra storage in the mcs_nodes[] array to
keep track of state and therefore uses directed wakeups.

The head contender has no such storage available and reverts to the
per-cpu lock entry similar to the current kvm code. We can do a single
enrty because any nesting will wake the vcpu and cause the lower loop
to retry.

Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 include/asm-generic/qspinlock.h |3 
 kernel/locking/qspinlock.c  |   69 +-
 kernel/locking/qspinlock_paravirt.h |  177 
 3 files changed, 248 insertions(+), 1 deletion(-)

--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -118,6 +118,9 @@ static __always_inline bool virt_queue_s
 }
 #endif
 
+extern void __pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_queue_spin_unlock(struct qspinlock *lock);
+
 /*
  * Initializier
  */
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -18,6 +18,9 @@
  * Authors: Waiman Long waiman.l...@hp.com
  *  Peter Zijlstra pet...@infradead.org
  */
+
+#ifndef _GEN_PV_LOCK_SLOWPATH
+
 #include linux/smp.h
 #include linux/bug.h
 #include linux/cpumask.h
@@ -65,13 +68,21 @@
 
 #include mcs_spinlock.h
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define MAX_NODES  8
+#else
+#define MAX_NODES  4
+#endif
+
 /*
  * Per-CPU queue node structures; we can never have more than 4 nested
  * contexts: task, softirq, hardirq, nmi.
  *
  * Exactly fits one 64-byte cacheline on a 64-bit architecture.
+ *
+ * PV doubles the storage and uses the second cacheline for PV state.
  */
-static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[4]);
+static DEFINE_PER_CPU_ALIGNED(struct mcs_spinlock, mcs_nodes[MAX_NODES]);
 
 /*
  * We must be able to distinguish between no-tail and the tail at 0:0,
@@ -230,6 +241,32 @@ static __always_inline void set_locked(s
WRITE_ONCE(l-locked, _Q_LOCKED_VAL);
 }
 
+
+/*
+ * Generate the native code for queue_spin_unlock_slowpath(); provide NOPs for
+ * all the PV callbacks.
+ */
+
+static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_kick_node(struct mcs_spinlock *node) { }
+
+static __always_inline void __pv_wait_head(struct qspinlock *lock) { }
+
+#define pv_enabled()   false
+
+#define pv_init_node   __pv_init_node
+#define pv_wait_node   __pv_wait_node
+#define pv_kick_node   __pv_kick_node
+
+#define pv_wait_head   __pv_wait_head
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define queue_spin_lock_slowpath   native_queue_spin_lock_slowpath
+#endif
+
+#endif /* _GEN_PV_LOCK_SLOWPATH */
+
 /**
  * queue_spin_lock_slowpath - acquire the queue spinlock
  * @lock: Pointer to queue spinlock structure
@@ -259,6 +296,9 @@ void queue_spin_lock_slowpath(struct qsp
 
BUILD_BUG_ON(CONFIG_NR_CPUS = (1U  _Q_TAIL_CPU_BITS));
 
+   if (pv_enabled())
+   goto queue;
+
if (virt_queue_spin_lock(lock))
return;
 
@@ -335,6 +375,7 @@ void queue_spin_lock_slowpath(struct qsp
node += idx;
node-locked = 0;
node-next = NULL;
+   pv_init_node(node);
 
/*
 * We touched a (possibly) cold cacheline in the per-cpu queue node;
@@ -360,6 +401,7 @@ void queue_spin_lock_slowpath(struct qsp
prev = decode_tail(old);
WRITE_ONCE(prev-next, node);
 
+   pv_wait_node(node);
arch_mcs_spin_lock_contended(node-locked);
}
 
@@ -374,6 +416,7 @@ void queue_spin_lock_slowpath(struct qsp
 * sequentiality; this is because the set_locked() function below
 * does not imply a full barrier.
 */
+   pv_wait_head(lock);
while ((val = smp_load_acquire(lock-val.counter))  
_Q_LOCKED_PENDING_MASK)
cpu_relax();
 
@@ -406,6 +449,7 @@ void queue_spin_lock_slowpath(struct qsp
cpu_relax();
 
arch_mcs_spin_unlock_contended(next-locked);
+   pv_kick_node(next);
 
 release:
/*
@@ -414,3 +458,26 @@ void queue_spin_lock_slowpath(struct qsp
this_cpu_dec(mcs_nodes[0].count);
 }
 EXPORT_SYMBOL(queue_spin_lock_slowpath);
+
+/*
+ * Generate the paravirt code for queue_spin_unlock_slowpath().
+ */
+#if !defined(_GEN_PV_LOCK_SLOWPATH)  defined(CONFIG_PARAVIRT_SPINLOCKS)
+#define _GEN_PV_LOCK_SLOWPATH
+
+#undef pv_enabled
+#define pv_enabled()   true
+
+#undef pv_init_node
+#undef pv_wait_node
+#undef

[PATCH 1/9] qspinlock: A simple generic 4-byte queue spinlock

2015-03-16 Thread Peter Zijlstra
From: Waiman Long waiman.l...@hp.com

This patch introduces a new generic queue spinlock implementation that
can serve as an alternative to the default ticket spinlock. Compared
with the ticket spinlock, this queue spinlock should be almost as fair
as the ticket spinlock. It has about the same speed in single-thread
and it can be much faster in high contention situations especially when
the spinlock is embedded within the data structure to be protected.

Only in light to moderate contention where the average queue depth
is around 1-3 will this queue spinlock be potentially a bit slower
due to the higher slowpath overhead.

This queue spinlock is especially suit to NUMA machines with a large
number of cores as the chance of spinlock contention is much higher
in those machines. The cost of contention is also higher because of
slower inter-node memory traffic.

Due to the fact that spinlocks are acquired with preemption disabled,
the process will not be migrated to another CPU while it is trying
to get a spinlock. Ignoring interrupt handling, a CPU can only be
contending in one spinlock at any one time. Counting soft IRQ, hard
IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
activities.  By allocating a set of per-cpu queue nodes and used them
to form a waiting queue, we can encode the queue node address into a
much smaller 24-bit size (including CPU number and queue node index)
leaving one byte for the lock.

Please note that the queue node is only needed when waiting for the
lock. Once the lock is acquired, the queue node can be released to
be used later.

Cc: H. Peter Anvin h...@zytor.com
Cc: David Vrabel david.vra...@citrix.com
Cc: Oleg Nesterov o...@redhat.com
Cc: Scott J Norton scott.nor...@hp.com
Cc: Paolo Bonzini paolo.bonz...@gmail.com
Cc: Douglas Hatch doug.ha...@hp.com
Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Cc: Boris Ostrovsky boris.ostrov...@oracle.com
Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Rik van Riel r...@redhat.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Raghavendra K T raghavendra...@linux.vnet.ibm.com
Cc: Thomas Gleixner t...@linutronix.de
Cc: Ingo Molnar mi...@redhat.com
Signed-off-by: Waiman Long waiman.l...@hp.com
Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1421784755-21945-2-git-send-email-waiman.l...@hp.com
---
 include/asm-generic/qspinlock.h   |  132 +
 include/asm-generic/qspinlock_types.h |   58 +
 kernel/Kconfig.locks  |7 +
 kernel/locking/Makefile   |1 
 kernel/locking/mcs_spinlock.h |1 
 kernel/locking/qspinlock.c|  209 ++
 6 files changed, 408 insertions(+)
 create mode 100644 include/asm-generic/qspinlock.h
 create mode 100644 include/asm-generic/qspinlock_types.h
 create mode 100644 kernel/locking/qspinlock.c

--- /dev/null
+++ b/include/asm-generic/qspinlock.h
@@ -0,0 +1,132 @@
+/*
+ * Queue spinlock
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2013-2014 Hewlett-Packard Development Company, L.P.
+ *
+ * Authors: Waiman Long waiman.l...@hp.com
+ */
+#ifndef __ASM_GENERIC_QSPINLOCK_H
+#define __ASM_GENERIC_QSPINLOCK_H
+
+#include asm-generic/qspinlock_types.h
+
+/**
+ * queue_spin_is_locked - is the spinlock locked?
+ * @lock: Pointer to queue spinlock structure
+ * Return: 1 if it is locked, 0 otherwise
+ */
+static __always_inline int queue_spin_is_locked(struct qspinlock *lock)
+{
+   return atomic_read(lock-val);
+}
+
+/**
+ * queue_spin_value_unlocked - is the spinlock structure unlocked?
+ * @lock: queue spinlock structure
+ * Return: 1 if it is unlocked, 0 otherwise
+ *
+ * N.B. Whenever there are tasks waiting for the lock, it is considered
+ *  locked wrt the lockref code to avoid lock stealing by the lockref
+ *  code and change things underneath the lock. This also allows some
+ *  optimizations to be applied without conflict with lockref.
+ */
+static __always_inline int queue_spin_value_unlocked(struct qspinlock lock)
+{
+   return !atomic_read(lock.val);
+}
+
+/**
+ * queue_spin_is_contended - check if the lock is contended
+ * @lock : Pointer to queue spinlock structure
+ * Return: 1 if lock contended, 0 otherwise
+ */
+static __always_inline int queue_spin_is_contended(struct qspinlock *lock)
+{
+   return atomic_read(lock-val)  ~_Q_LOCKED_MASK;
+}
+/**
+ * queue_spin_trylock - try to acquire the queue spinlock
+ * @lock : Pointer

[PATCH 9/9] qspinlock,x86,kvm: Implement KVM support for paravirt qspinlock

2015-03-16 Thread Peter Zijlstra
Implement the paravirt qspinlock for x86-kvm.

We use the regular paravirt call patching to switch between:

  native_queue_spin_lock_slowpath() __pv_queue_spin_lock_slowpath()
  native_queue_spin_unlock()__pv_queue_spin_unlock()

We use a callee saved call for the unlock function which reduces the
i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions
again.

We further optimize the unlock path by patching the direct call with a
movb $0,%arg1 if we are indeed using the native unlock code. This
makes the unlock code almost as fast as the !PARAVIRT case.

This significantly lowers the overhead of having
CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code.


Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 arch/x86/Kconfig  |2 -
 arch/x86/include/asm/paravirt.h   |   28 -
 arch/x86/include/asm/paravirt_types.h |   10 +++
 arch/x86/include/asm/qspinlock.h  |   22 -
 arch/x86/kernel/kvm.c |   44 ++
 arch/x86/kernel/paravirt-spinlocks.c  |   24 +-
 arch/x86/kernel/paravirt_patch_32.c   |   22 +
 arch/x86/kernel/paravirt_patch_64.c   |   22 +
 kernel/Kconfig.locks  |2 -
 9 files changed, 163 insertions(+), 13 deletions(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -692,7 +692,7 @@ config PARAVIRT_DEBUG
 config PARAVIRT_SPINLOCKS
bool Paravirtualization layer for spinlocks
depends on PARAVIRT  SMP
-   select UNINLINE_SPIN_UNLOCK
+   select UNINLINE_SPIN_UNLOCK if !QUEUE_SPINLOCK
---help---
  Paravirtualized spinlocks allow a pvops backend to replace the
  spinlock implementation with something virtualization-friendly
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -712,6 +712,30 @@ static inline void __set_fixmap(unsigned
 
 #if defined(CONFIG_SMP)  defined(CONFIG_PARAVIRT_SPINLOCKS)
 
+#ifdef CONFIG_QUEUE_SPINLOCK
+
+static __always_inline void pv_queue_spin_lock_slowpath(struct qspinlock 
*lock, u32 val)
+{
+   PVOP_VCALL2(pv_lock_ops.queue_spin_lock_slowpath, lock, val);
+}
+
+static __always_inline void pv_queue_spin_unlock(struct qspinlock *lock)
+{
+   PVOP_VCALLEE1(pv_lock_ops.queue_spin_unlock, lock);
+}
+
+static __always_inline void pv_wait(u8 *ptr, u8 val)
+{
+   PVOP_VCALL2(pv_lock_ops.wait, ptr, val);
+}
+
+static __always_inline void pv_kick(int cpu)
+{
+   PVOP_VCALL1(pv_lock_ops.kick, cpu);
+}
+
+#else /* !CONFIG_QUEUE_SPINLOCK */
+
 static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock,
__ticket_t ticket)
 {
@@ -724,7 +748,9 @@ static __always_inline void __ticket_unl
PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket);
 }
 
-#endif
+#endif /* CONFIG_QUEUE_SPINLOCK */
+
+#endif /* SMP  PARAVIRT_SPINLOCKS */
 
 #ifdef CONFIG_X86_32
 #define PV_SAVE_REGS pushl %ecx; pushl %edx;
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -333,9 +333,19 @@ struct arch_spinlock;
 typedef u16 __ticket_t;
 #endif
 
+struct qspinlock;
+
 struct pv_lock_ops {
+#ifdef CONFIG_QUEUE_SPINLOCK
+   void (*queue_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
+   struct paravirt_callee_save queue_spin_unlock;
+
+   void (*wait)(u8 *ptr, u8 val);
+   void (*kick)(int cpu);
+#else /* !CONFIG_QUEUE_SPINLOCK */
struct paravirt_callee_save lock_spinning;
void (*unlock_kick)(struct arch_spinlock *lock, __ticket_t ticket);
+#endif /* !CONFIG_QUEUE_SPINLOCK */
 };
 
 /* This contains all the paravirt structures: we get a convenient
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -3,6 +3,7 @@
 
 #include asm/cpufeature.h
 #include asm-generic/qspinlock_types.h
+#include asm/paravirt.h
 
 #definequeue_spin_unlock queue_spin_unlock
 /**
@@ -11,11 +12,30 @@
  *
  * An smp_store_release() on the least-significant byte.
  */
-static inline void queue_spin_unlock(struct qspinlock *lock)
+static inline void native_queue_spin_unlock(struct qspinlock *lock)
 {
smp_store_release((u8 *)lock, 0);
 }
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+extern void native_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+static inline void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+   pv_queue_spin_lock_slowpath(lock, val);
+}
+
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   pv_queue_spin_unlock(lock);
+}
+#else
+static inline void queue_spin_unlock(struct qspinlock *lock)
+{
+   native_queue_spin_unlock(lock);
+}
+#endif
+
 #define virt_queue_spin_lock virt_queue_spin_lock
 
 static inline bool virt_queue_spin_lock(struct qspinlock *lock)
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -584,6 +584,41 @@ static void kvm_kick_cpu(int cpu

[PATCH 0/9] qspinlock stuff -v15

2015-03-16 Thread Peter Zijlstra
Hi Waiman,

As promised; here is the paravirt stuff I did during the trip to BOS last week.

All the !paravirt patches are more or less the same as before (the only real
change is the copyright lines in the first patch).

The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more
convoluted and I've no real way to test that but it should be stright fwd to
make work.

I ran this using the virtme tool (thanks Andy) on my laptop with a 4x
overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and
it both booted and survived a hackbench run (perf bench sched messaging -g 20
-l 5000).

So while the paravirt code isn't the most optimal code ever conceived it does 
work.

Also, the paravirt patching includes replacing the call with movb $0, %arg1
for the native case, which should greatly reduce the cost of having
CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware.

I feel that if someone were to do a Xen patch we can go ahead and merge this
stuff (finally!).

These patches do not implement the paravirt spinlock debug stats currently
implemented (separately) by KVM and Xen, but that should not be too hard to do
on top and in the 'generic' code -- no reason to duplicate all that.

Of course; once this lands people can look at improving the paravirt nonsense.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio balloon: do not call blocking ops when !TASK_RUNNING

2015-02-26 Thread Peter Zijlstra
On Thu, Feb 26, 2015 at 09:30:31AM +0100, Michael S. Tsirkin wrote:
 On Thu, Feb 26, 2015 at 11:50:42AM +1030, Rusty Russell wrote:
  Thomas Huth th...@linux.vnet.ibm.com writes:
Hi all,
  
   with the recent kernel 3.19, I get a kernel warning when I start my
   KVM guest on s390 with virtio balloon enabled:
  
  The deeper problem is that virtio_ccw_get_config just silently fails on
  OOM.
  
  Neither get_config nor set_config are expected to fail.
  
  Cornelia, I think ccw and config_area should be allocated inside vcdev.
  You could either use pointers, or simply allocate vcdev with GDP_DMA.
  
  This would avoid the kmalloc inside these calls.
  
  Thanks,
  Rusty.
 
 But it won't solve the problem of nested sleepers
 with ccw: ATM is invokes ccw_io_helper to execute
 commands, and that one calls wait_event
 to wait for an interrupt.
 
 Might be fixable but I think my patch looks like a safer
 solution for 4.0/3.19, no?

I've no idea what your patch was since I'm not subscribed to any of the
lists this discussion is had on.

But you can annotate the warning away; _however_ with the annotation
needs to be a big comment explaining why its safe to do so. Typically to
involved talking about how its actually rare for the call to sleep.

So occasional sleeps inside a wait_event() are ok-ish, we'll just get to
go around once more. But once you consistently sleep inside a
wait_event() things go a bit funny.

So for instance; if in ccw_io_helper() we expect that wait_event(,
!doing_io()) to be (mostly) true on first go, then we'll never get into
__wait_event() and -state won't actually be mucked about with.

The thing to avoid is not actually sleeping (much) but setting
TASK_RUNNING and turning the entire thing into a giant poll loop.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-02-18 Thread Peter Zijlstra
On Tue, Feb 17, 2015 at 06:44:19PM +0100, Sebastian Andrzej Siewior wrote:
 * Peter Zijlstra | 2015-01-21 16:07:16 [+0100]:
 
 On Tue, Jan 20, 2015 at 01:16:13PM -0500, Steven Rostedt wrote:
  I'm actually wondering if we should just nuke the _interruptible()
  version of swait. As it should only be all interruptible or all not
  interruptible, that the swait_wake() should just do the wake up
  regardless. In which case, swait_wake() is good enough. No need to have
  different versions where people may think do something special.
  
  Peter?
 
 Yeah, I think the lastest thing I have sitting here on my disk only has
 the swake_up() which does TASK_NORMAL, no choice there.
 
 what is the swait status in terms of mainline? This sounds like it
 beeing worked on.
 I could take the series but then I would drop it again if the mainline
 implementation changes…

Well, 'worked' on might be putting too much on it, its one of the many
many 'spare' time things that never get attention unless people bug me
;-)

The below is my current patch, I've not actually tried it yet, I (think
I) had one patch doing some conversions but I'm having trouble locating
it.

Mostly-Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org
---
 include/linux/swait.h |  172 ++
 kernel/sched/swait.c  |  122 +++
 2 files changed, 294 insertions(+)

--- /dev/null
+++ b/include/linux/swait.h
@@ -0,0 +1,172 @@
+#ifndef _LINUX_SWAIT_H
+#define _LINUX_SWAIT_H
+
+#include linux/list.h
+#include linux/stddef.h
+#include linux/spinlock.h
+#include asm/current.h
+
+/*
+ * Simple wait queues
+ *
+ * While these are very similar to the other/complex wait queues (wait.h) the
+ * most important difference is that the simple waitqueue allows for
+ * deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold
+ * times.
+ *
+ * In order to make this so, we had to drop a fair number of features of the
+ * other waitqueue code; notably:
+ *
+ *  - mixing INTERRUPTIBLE and UNINTERRUPTIBLE sleeps on the same waitqueue;
+ *all wakeups are TASK_NORMAL in order to avoid O(n) lookups for the right
+ *sleeper state.
+ *
+ *  - the exclusive mode; because this requires preserving the list order
+ *and this is hard.
+ *
+ *  - custom wake functions; because you cannot give any guarantees about
+ *random code.
+ *
+ * As a side effect of this; the data structures are slimmer.
+ *
+ * One would recommend using this wait queue where possible.
+ */
+
+struct task_struct;
+
+struct swait_queue_head {
+   raw_spinlock_t  lock;
+   struct list_headtask_list;
+};
+
+struct swait_queue {
+   struct task_struct  *task;
+   struct list_headtask_list;
+};
+
+#define __SWAITQUEUE_INITIALIZER(name) {   \
+   .task   = current,  \
+   .task_list  = LIST_HEAD_INIT((name).task_list), \
+}
+
+#define DECLARE_SWAITQUEUE(name)   \
+   struct swait_queue name = __SWAITQUEUE_INITIALIZER(name)
+
+#define __SWAIT_QUEUE_HEAD_INITIALIZER(name) { \
+   .lock   = __RAW_SPIN_LOCK_UNLOCKED(name.lock),  \
+   .task_list  = LIST_HEAD_INIT((name).task_list), \
+}
+
+#define DECLARE_SWAIT_QUEUE_HEAD(name) \
+   struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INITIALIZER(name)
+
+extern void __init_swait_queue_head(struct swait_queue_head *q, const char 
*name,
+   struct lock_class_key *key);
+
+#define init_swait_queue_head(q)   \
+   do {\
+   static struct lock_class_key __key; \
+   __init_swait_queue_head((q), #q, __key);   \
+   } while (0)
+
+#ifdef CONFIG_LOCKDEP
+# define __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name) \
+   ({ init_swait_queue_head(name); name; })
+# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\
+   struct swait_queue_head name = __SWAIT_QUEUE_HEAD_INIT_ONSTACK(name)
+#else
+# define DECLARE_SWAIT_QUEUE_HEAD_ONSTACK(name)\
+   DECLARE_SWAIT_QUEUE_HEAD(name)
+#endif
+
+static inline int swait_active(struct swait_queue_head *q)
+{
+   return !list_empty(q-task_list);
+}
+
+extern void swake_up(struct swait_queue_head *q);
+extern void swake_up_all(struct swait_queue_head *q);
+extern void swake_up_locked(struct swait_queue_head *q);
+
+extern void __prepare_to_swait(struct swait_queue_head *q, struct swait_queue 
*wait);
+extern void prepare_to_swait(struct swait_queue_head *q, struct swait_queue 
*wait, int state);
+extern long prepare_to_swait_event(struct swait_queue_head *q, struct 
swait_queue *wait, int state);
+
+extern void __finish_swait(struct

Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions

2015-02-12 Thread Peter Zijlstra
On Thu, Feb 12, 2015 at 05:17:27PM +0530, Raghavendra K T wrote:
 Paravirt spinlock clears slowpath flag after doing unlock.
 As explained by Linus currently it does:
 prev = *lock;
 add_smp(lock-tickets.head, TICKET_LOCK_INC);
 
 /* add_smp() is a full mb() */
 
 if (unlikely(lock-tickets.tail  TICKET_SLOWPATH_FLAG))
 __ticket_unlock_slowpath(lock, prev);
 
 
 which is *exactly* the kind of things you cannot do with spinlocks,
 because after you've done the add_smp() and released the spinlock
 for the fast-path, you can't access the spinlock any more.  Exactly
 because a fast-path lock might come in, and release the whole data
 structure.
 
 Linus suggested that we should not do any writes to lock after unlock(),
 and we can move slowpath clearing to fastpath lock.
 
 So this patch implements the fix with:
 1. Moving slowpath flag to head (Oleg).
 2. use xadd to avoid read/write after unlock that checks the need for
 unlock_kick (Linus).

Maybe spend a few more words explaining these things; something like:

Unlocked locks don't care about the slowpath flag; therefore we can keep
it set after the last unlock, as long as we clear it again on the first
(try)lock -- this removes the write after unlock.

By moving the slowpath flag from the tail to the head ticket we avoid
the need to access both the head and tail tickets on unlock.

We can further avoid the need for a read-after-release by using xadd;
the prev head value will include the slowpath flag and indicate if we
need to do PV kicking of suspended spinners -- on modern chips xadd
isn't (much) more expensive than an add + load.

Its 'obvious' now, but maybe not so much after we've all not looked at
this for a few months.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-09 Thread Peter Zijlstra
On Mon, Feb 09, 2015 at 03:04:22PM +0530, Raghavendra K T wrote:
 So we have 3 choices,
 1. xadd
 2. continue with current approach.
 3. a read before unlock and also after that.

For the truly paranoid we have probe_kernel_address(), suppose the lock
was in module space and the module just got unloaded under us.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-01-21 Thread Peter Zijlstra
On Tue, Jan 20, 2015 at 01:16:13PM -0500, Steven Rostedt wrote:
 I'm actually wondering if we should just nuke the _interruptible()
 version of swait. As it should only be all interruptible or all not
 interruptible, that the swait_wake() should just do the wake up
 regardless. In which case, swait_wake() is good enough. No need to have
 different versions where people may think do something special.
 
 Peter?

Yeah, I think the lastest thing I have sitting here on my disk only has
the swake_up() which does TASK_NORMAL, no choice there.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch -rt 1/2] KVM: use simple waitqueue for vcpu-wq

2015-01-16 Thread Peter Zijlstra
On Fri, Jan 16, 2015 at 11:48:46AM -0500, Steven Rostedt wrote:
 I notice everywhere you have a swait_wake_interruptible() but here. Is
 there a reason why?
 
 IIRC, Peter wants to make swait wakeup usage homogenous. That is, you
 either sleep in an interruptible state, or you don't. You can't mix and
 match it.
 
 Peter is that what you plan?

Yes, this is required for the single wakeup case to be bounded.

If you have both INTERRUPTIBLE and UNINTERRUPTIBLE waiters on a list,
and you were allowed to do INTERRUPTIBLE/UNINTERRUPTIBLE wakeups, the
wakeup would have to do O(n) iteration to find a matching waiter.

Seeing that the entire purpose of simple wait queues was to retain RT
properties, that's entirely retarded ;-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v13 09/11] pvqspinlock, x86: Add para-virtualization support

2014-11-03 Thread Peter Zijlstra
On Wed, Oct 29, 2014 at 04:19:09PM -0400, Waiman Long wrote:
  arch/x86/include/asm/pvqspinlock.h|  411 
 +

I do wonder why all this needs to live in x86..

  
 +#ifdef CONFIG_QUEUE_SPINLOCK
 +
 +static __always_inline void pv_kick_cpu(int cpu)
 +{
 + PVOP_VCALLEE1(pv_lock_ops.kick_cpu, cpu);
 +}
 +
 +static __always_inline void pv_lockwait(u8 *lockbyte)
 +{
 + PVOP_VCALLEE1(pv_lock_ops.lockwait, lockbyte);
 +}
 +
 +static __always_inline void pv_lockstat(enum pv_lock_stats type)
 +{
 + PVOP_VCALLEE1(pv_lock_ops.lockstat, type);
 +}

Why are any of these PV ops? they're only called from other pv_*()
functions. What's the point of pv ops you only call from pv code?

 +/*
 + *   Queue Spinlock Para-Virtualization (PV) Support
 + *
 + * The PV support code for queue spinlock is roughly the same as that
 + * of the ticket spinlock.

Relative comments are bad, esp. since we'll make the ticket code go away
if this works, at which point this is a reference into a black hole.

 Each CPU waiting for the lock will spin until it
 + * reaches a threshold. When that happens, it will put itself to a halt state
 + * so that the hypervisor can reuse the CPU cycles in some other guests as
 + * well as returning other hold-up CPUs faster.




 +/**
 + * queue_spin_lock - acquire a queue spinlock
 + * @lock: Pointer to queue spinlock structure
 + *
 + * N.B. INLINE_SPIN_LOCK should not be enabled when PARAVIRT_SPINLOCK is on.

One should write a compile time fail for that, not a comment.

 + */
 +static __always_inline void queue_spin_lock(struct qspinlock *lock)
 +{
 + u32 val;
 +
 + val = atomic_cmpxchg(lock-val, 0, _Q_LOCKED_VAL);
 + if (likely(val == 0))
 + return;
 + if (static_key_false(paravirt_spinlocks_enabled))
 + pv_queue_spin_lock_slowpath(lock, val);
 + else
 + queue_spin_lock_slowpath(lock, val);
 +}

No, this is just vile.. _that_ is what we have PV ops for. And at that
point its the same function it was before the PV stuff, so that whole
inline thing is then gone.

 +extern void queue_spin_unlock_slowpath(struct qspinlock *lock);
 +
  /**
   * queue_spin_unlock - release a queue spinlock
   * @lock : Pointer to queue spinlock structure
   *
   * An effective smp_store_release() on the least-significant byte.
 + *
 + * Inlining of the unlock function is disabled when CONFIG_PARAVIRT_SPINLOCKS
 + * is defined. So _raw_spin_unlock() will be the only call site that will
 + * have to be patched.

again if you hard rely on the not inlining make a build fail not a
comment.

   */
  static inline void queue_spin_unlock(struct qspinlock *lock)
  {
   barrier();
 + if (!static_key_false(paravirt_spinlocks_enabled)) {
 + native_spin_unlock(lock);
 + return;
 + }
  
 + /*
 +  * Need to atomically clear the lock byte to avoid racing with
 +  * queue head waiter trying to set _QLOCK_LOCKED_SLOWPATH.
 +  */
 + if (unlikely(cmpxchg((u8 *)lock, _Q_LOCKED_VAL, 0) != _Q_LOCKED_VAL))
 + queue_spin_unlock_slowpath(lock);
 +}

Idem, that static key stuff is wrong, use PV ops to switch between
unlock paths.

 @@ -354,7 +394,7 @@ queue:
* if there was a previous node; link it and wait until reaching the
* head of the waitqueue.
*/
 - if (old  _Q_TAIL_MASK) {
 + if (!pv_link_and_wait_node(old, node)  (old  _Q_TAIL_MASK)) {
   prev = decode_tail(old);
   ACCESS_ONCE(prev-next) = node;
 @@ -369,9 +409,11 @@ queue:
*
* *,x,y - *,0,0
*/
 - while ((val = smp_load_acquire(lock-val.counter)) 
 - _Q_LOCKED_PENDING_MASK)
 + val = pv_wait_head(lock, node);
 + while (val  _Q_LOCKED_PENDING_MASK) {
   cpu_relax();
 + val = smp_load_acquire(lock-val.counter);
 + }
  
   /*
* claim the lock:

Please make the pv_*() calls return void and reduce to NOPs. This keeps
the logic invariant of the pv stuff.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Peter Zijlstra
On Mon, Oct 27, 2014 at 01:15:53PM -0400, Waiman Long wrote:
 On 10/24/2014 06:04 PM, Peter Zijlstra wrote:
 On Fri, Oct 24, 2014 at 04:53:27PM -0400, Waiman Long wrote:
 The additional register pressure may just cause a few more register moves
 which should be negligible in the overall performance . The additional
 icache pressure, however, may have some impact on performance. I was trying
 to balance the performance of the pv and non-pv versions so that we won't
 penalize the pv code too much for a bit more performance in the non-pv code.
 Doing it your way will add a lot of function call and register
 saving/restoring to the pv code.
 If people care about performance they should not be using virt crap :-)
 
 I only really care about bare metal.
 
 Yes, I am aware of that. However, the whole point of doing PV spinlock is to
 improve performance in a virtual guest.

Anything that avoids the lock holder preemption nonsense is a _massive_
win for them, a few function calls should not even register on that
scale.

 +#ifdef _GEN_PV_LOCK_SLOWPATH
 +static void pv_queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 +#else
  void queue_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 +#endif

If you have two functions you might as well use the PV stuff to patch in
the right function call at the usage sites and avoid:

 +   if (pv_enabled()) {
 +   pv_queue_spin_lock_slowpath(lock, val);
 +   return;
 +   }

this alltogether.

 this_cpu_dec(mcs_nodes[0].count);
  }
  EXPORT_SYMBOL(queue_spin_lock_slowpath);
 +
 +#if !defined(_GEN_PV_LOCK_SLOWPATH)  defined(CONFIG_PARAVIRT_SPINLOCKS)
 +/*
 + * Generate the PV version of the queue_spin_lock_slowpath function
 + */
 +#undef pv_init_node
 +#undef pv_wait_check
 +#undef pv_link_and_wait_node
 +#undef pv_wait_head
 +#undef EXPORT_SYMBOL
 +#undef in_pv_code
 +
 +#define _GEN_PV_LOCK_SLOWPATH
 +#define EXPORT_SYMBOL(x)
 +#define in_pv_code return_true
 +#define pv_enabled return_false
 +
 +#include qspinlock.c
 +
 +#endif

That's properly disgusting :-) But a lot better than actually
duplicating everything I suppose.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-27 Thread Peter Zijlstra
On Mon, Oct 27, 2014 at 01:38:20PM -0400, Waiman Long wrote:
 On 10/24/2014 04:54 AM, Peter Zijlstra wrote:
 On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:
 
 Since enabling paravirt spinlock will disable unlock function inlining,
 a jump label can be added to the unlock function without adding patch
 sites all over the kernel.
 But you don't have to. My patches allowed for the inline to remain,
 again reducing the overhead of enabling PV spinlocks while running on a
 real machine.
 
 Look at:
 
http://lkml.kernel.org/r/20140615130154.213923...@chello.nl
 
 In particular this hunk:
 
 Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
 ===
 --- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
 +++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
 @@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
   DEF_NATIVE(, mov32, mov %edi, %eax);
   DEF_NATIVE(, mov64, mov %rdi, %rax);
 
 +#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
 +DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
 +#endif
 +
   unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
   {
  return paravirt_patch_insns(insnbuf, len,
 @@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
  PATCH_SITE(pv_cpu_ops, clts);
  PATCH_SITE(pv_mmu_ops, flush_tlb_single);
  PATCH_SITE(pv_cpu_ops, wbinvd);
 +#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
 +   PATCH_SITE(pv_lock_ops, queue_unlock);
 +#endif
 
  patch_site:
  ret = paravirt_patch_insns(ibuf, len, start, end);
 
 
 That makes sure to overwrite the callee-saved call to the
 pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).
 
 
 Therefore you can retain the inlined unlock with hardly (there might be
 some NOP padding) any overhead at all. On PV it reverts to a callee
 saved function call.
 
 My concern is that spin_unlock() can be called in many places, including
 loadable kernel modules. Can the paravirt_patch_ident_32() function able to
 patch all of them in reasonable time? How about a kernel module loaded later
 at run time?

modules should be fine, see arch/x86/kernel/module.c:module_finalize()
- apply_paravirt().

Also note that the 'default' text is an indirect call into the paravirt
ops table which routes to the 'right' function, so even if the text
patching would be 'late' calls would 'work' as expected, just slower.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-24 Thread Peter Zijlstra
On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:
 +static inline void pv_init_node(struct mcs_spinlock *node)
 +{
 + struct pv_qnode *pn = (struct pv_qnode *)node;
 +
 + BUILD_BUG_ON(sizeof(struct pv_qnode)  5*sizeof(struct mcs_spinlock));
 +
 + if (!pv_enabled())
 + return;
 +
 + pn-cpustate = PV_CPU_ACTIVE;
 + pn-mayhalt  = false;
 + pn-mycpu= smp_processor_id();
 + pn-head = PV_INVALID_HEAD;
 +}


 @@ -333,6 +393,7 @@ queue:
   node += idx;
   node-locked = 0;
   node-next = NULL;
 + pv_init_node(node);
  
   /*
* We touched a (possibly) cold cacheline in the per-cpu queue node;


So even if !pv_enabled() the compiler will still have to emit the code
for that inline, which will generate additional register pressure,
icache pressure and lovely stuff like that.

The patch I had used pv-ops for these things that would turn into NOPs
in the regular case and callee-saved function calls for the PV case.

That still does not entirely eliminate cost, but does reduce it
significant. Please consider using that.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-24 Thread Peter Zijlstra
On Thu, Oct 16, 2014 at 02:10:38PM -0400, Waiman Long wrote:

 Since enabling paravirt spinlock will disable unlock function inlining,
 a jump label can be added to the unlock function without adding patch
 sites all over the kernel.

But you don't have to. My patches allowed for the inline to remain,
again reducing the overhead of enabling PV spinlocks while running on a
real machine.

Look at: 

  http://lkml.kernel.org/r/20140615130154.213923...@chello.nl

In particular this hunk:

Index: linux-2.6/arch/x86/kernel/paravirt_patch_64.c
===
--- linux-2.6.orig/arch/x86/kernel/paravirt_patch_64.c
+++ linux-2.6/arch/x86/kernel/paravirt_patch_64.c
@@ -22,6 +22,10 @@ DEF_NATIVE(pv_cpu_ops, swapgs, swapgs)
 DEF_NATIVE(, mov32, mov %edi, %eax);
 DEF_NATIVE(, mov64, mov %rdi, %rax);

+#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
+DEF_NATIVE(pv_lock_ops, queue_unlock, movb $0, (%rdi));
+#endif
+ 
 unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
 {
return paravirt_patch_insns(insnbuf, len,
@@ -61,6 +65,9 @@ unsigned native_patch(u8 type, u16 clobb
PATCH_SITE(pv_cpu_ops, clts);
PATCH_SITE(pv_mmu_ops, flush_tlb_single);
PATCH_SITE(pv_cpu_ops, wbinvd);
+#if defined(CONFIG_PARAVIRT_SPINLOCKS)  defined(CONFIG_QUEUE_SPINLOCK)
+   PATCH_SITE(pv_lock_ops, queue_unlock);
+#endif

patch_site:
ret = paravirt_patch_insns(ibuf, len, start, end);


That makes sure to overwrite the callee-saved call to the
pv_lock_ops::queue_unlock with the immediate asm movb $0, (%rdi).


Therefore you can retain the inlined unlock with hardly (there might be
some NOP padding) any overhead at all. On PV it reverts to a callee
saved function call.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 00/11] qspinlock: a 4-byte queue spinlock with PV support

2014-10-24 Thread Peter Zijlstra
On Thu, Oct 16, 2014 at 02:10:29PM -0400, Waiman Long wrote:
 v11-v12:
  - Based on PeterZ's version of the qspinlock patch
(https://lkml.org/lkml/2014/6/15/63).
  - Incorporated many of the review comments from Konrad Wilk and
Paolo Bonzini.
  - The pvqspinlock code is largely from my previous version with
PeterZ's way of going from queue tail to head and his idea of
using callee saved calls to KVM and XEN codes.

Thanks for taking the time to refresh this.. I would prefer you use a
little more of the PV techniques I outlined in my latest PV patch to
further reduce the overhead of PV enabled kernels on real hardware.

This is an important use case, because distro kernels will have to
enable PV support while their majority of installations will be on
physical hardware.

Other than that I see no reason not to move this forward.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v12 09/11] pvqspinlock, x86: Add para-virtualization support

2014-10-24 Thread Peter Zijlstra
On Fri, Oct 24, 2014 at 04:53:27PM -0400, Waiman Long wrote:
 The additional register pressure may just cause a few more register moves
 which should be negligible in the overall performance . The additional
 icache pressure, however, may have some impact on performance. I was trying
 to balance the performance of the pv and non-pv versions so that we won't
 penalize the pv code too much for a bit more performance in the non-pv code.
 Doing it your way will add a lot of function call and register
 saving/restoring to the pv code.

If people care about performance they should not be using virt crap :-)

I only really care about bare metal.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] mm: gup: add FOLL_TRIED

2014-10-09 Thread Peter Zijlstra
On Wed, Oct 01, 2014 at 10:56:34AM +0200, Andrea Arcangeli wrote:
 From: Andres Lagar-Cavilla andre...@google.com

This needs a changelog

 Reviewed-by: Radim Krčmář rkrc...@redhat.com
 Signed-off-by: Andres Lagar-Cavilla andre...@google.com
 Signed-off-by: Andrea Arcangeli aarca...@redhat.com
 ---
  include/linux/mm.h | 1 +
  mm/gup.c   | 4 
  2 files changed, 5 insertions(+)
 
 diff --git a/include/linux/mm.h b/include/linux/mm.h
 index 8981cc8..0f4196a 100644
 --- a/include/linux/mm.h
 +++ b/include/linux/mm.h
 @@ -1985,6 +1985,7 @@ static inline struct page *follow_page(struct 
 vm_area_struct *vma,
  #define FOLL_HWPOISON0x100   /* check page is hwpoisoned */
  #define FOLL_NUMA0x200   /* force NUMA hinting page fault */
  #define FOLL_MIGRATION   0x400   /* wait for page to replace migration 
 entry */
 +#define FOLL_TRIED   0x800   /* a retry, previous pass started an IO */
  
  typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
   void *data);
 diff --git a/mm/gup.c b/mm/gup.c
 index 91d044b..af7ea3e 100644
 --- a/mm/gup.c
 +++ b/mm/gup.c
 @@ -281,6 +281,10 @@ static int faultin_page(struct task_struct *tsk, struct 
 vm_area_struct *vma,
   fault_flags |= FAULT_FLAG_ALLOW_RETRY;
   if (*flags  FOLL_NOWAIT)
   fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT;
 + if (*flags  FOLL_TRIED) {
 + VM_WARN_ON_ONCE(fault_flags  FAULT_FLAG_ALLOW_RETRY);
 + fault_flags |= FAULT_FLAG_TRIED;
 + }
  
   ret = handle_mm_fault(mm, vma, address, fault_flags);
   if (ret  VM_FAULT_ERROR) {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] mm: gup: add get_user_pages_locked and get_user_pages_unlocked

2014-10-09 Thread Peter Zijlstra
On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:
 +static inline long __get_user_pages_locked(struct task_struct *tsk,
 +struct mm_struct *mm,
 +unsigned long start,
 +unsigned long nr_pages,
 +int write, int force,
 +struct page **pages,
 +struct vm_area_struct **vmas,
 +int *locked,
 +bool notify_drop)

You might want to consider __always_inline to make sure it does indeed
get inlined and constant propagation works for @locked and @notify_drop.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] mm: gup: add get_user_pages_locked and get_user_pages_unlocked

2014-10-09 Thread Peter Zijlstra
On Wed, Oct 01, 2014 at 10:56:35AM +0200, Andrea Arcangeli wrote:

 +static inline long __get_user_pages_locked(struct task_struct *tsk,
 +struct mm_struct *mm,
 +unsigned long start,
 +unsigned long nr_pages,
 +int write, int force,
 +struct page **pages,
 +struct vm_area_struct **vmas,
 +int *locked,
 +bool notify_drop)
 +{

 + if (notify_drop  lock_dropped  *locked) {
 + /*
 +  * We must let the caller know we temporarily dropped the lock
 +  * and so the critical section protected by it was lost.
 +  */
 + up_read(mm-mmap_sem);
 + *locked = 0;
 + }
 + return pages_done;
 +}

 +long get_user_pages_locked(struct task_struct *tsk, struct mm_struct *mm,
 +unsigned long start, unsigned long nr_pages,
 +int write, int force, struct page **pages,
 +int *locked)
 +{
 + return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
 +pages, NULL, locked, true);
 +}

 +long get_user_pages_unlocked(struct task_struct *tsk, struct mm_struct *mm,
 +  unsigned long start, unsigned long nr_pages,
 +  int write, int force, struct page **pages)
 +{
 + long ret;
 + int locked = 1;
 + down_read(mm-mmap_sem);
 + ret = __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
 +   pages, NULL, locked, false);
 + if (locked)
 + up_read(mm-mmap_sem);
 + return ret;
 +}

  long get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
   unsigned long start, unsigned long nr_pages, int write,
   int force, struct page **pages, struct vm_area_struct **vmas)
  {
 + return __get_user_pages_locked(tsk, mm, start, nr_pages, write, force,
 +pages, vmas, NULL, false);
  }

I'm wondering about that notify_drop parameter, what's the added
benefit? If you look at these 3 callers we can do away with it, since in
the second called where we have locked but !notify_drop we seem to do
the exact same thing afterwards anyway.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/4] mm: gup: use get_user_pages_fast and get_user_pages_unlocked

2014-10-09 Thread Peter Zijlstra
On Wed, Oct 01, 2014 at 10:56:36AM +0200, Andrea Arcangeli wrote:
 Just an optimization.

Does it make sense to split the thing in two? One where you apply
_unlocked and then one where you apply _fast?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY

2014-10-02 Thread Peter Zijlstra
On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote:
 On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
  For all these and the other _fast() users, is there an actual limit to
  the nr_pages passed in? Because we used to have the 64 pages limit from
  DIO, but without that we get rather long IRQ-off latencies.
 
 Ok, I would tend to think this is an issue to solve in gup_fast
 implementation, I wouldn't blame or modify the callers for it.
 
 I don't think there's anything that prevents gup_fast to enable irqs
 after certain number of pages have been taken, nop; and disable the
 irqs again.
 

Agreed, I once upon a time had a patch set converting the 2 (x86 and
powerpc) gup_fast implementations at the time, but somehow that never
got anywhere.

Just saying we should probably do that before we add callers with
unlimited nr_pages.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY

2014-10-02 Thread Peter Zijlstra
On Thu, Oct 02, 2014 at 02:50:52PM +0200, Peter Zijlstra wrote:
 On Thu, Oct 02, 2014 at 02:31:17PM +0200, Andrea Arcangeli wrote:
  On Wed, Oct 01, 2014 at 05:36:11PM +0200, Peter Zijlstra wrote:
   For all these and the other _fast() users, is there an actual limit to
   the nr_pages passed in? Because we used to have the 64 pages limit from
   DIO, but without that we get rather long IRQ-off latencies.
  
  Ok, I would tend to think this is an issue to solve in gup_fast
  implementation, I wouldn't blame or modify the callers for it.
  
  I don't think there's anything that prevents gup_fast to enable irqs
  after certain number of pages have been taken, nop; and disable the
  irqs again.
  
 
 Agreed, I once upon a time had a patch set converting the 2 (x86 and
 powerpc) gup_fast implementations at the time, but somehow that never
 got anywhere.
 
 Just saying we should probably do that before we add callers with
 unlimited nr_pages.

https://lkml.org/lkml/2009/6/24/457

Clearly there's more work these days. Many more archs grew a gup.c
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: get_user_pages_locked|unlocked to leverage VM_FAULT_RETRY

2014-10-01 Thread Peter Zijlstra
On Fri, Sep 26, 2014 at 07:25:35PM +0200, Andrea Arcangeli wrote:
 diff --git a/drivers/dma/iovlock.c b/drivers/dma/iovlock.c
 index bb48a57..12ea7c3 100644
 --- a/drivers/dma/iovlock.c
 +++ b/drivers/dma/iovlock.c
 @@ -95,17 +95,11 @@ struct dma_pinned_list *dma_pin_iovec_pages(struct iovec 
 *iov, size_t len)
   pages += page_list-nr_pages;
  
   /* pin pages down */
 - down_read(current-mm-mmap_sem);
 - ret = get_user_pages(
 - current,
 - current-mm,
 + ret = get_user_pages_fast(
   (unsigned long) iov[i].iov_base,
   page_list-nr_pages,
   1,  /* write */
 - 0,  /* force */
 - page_list-pages,
 - NULL);
 - up_read(current-mm-mmap_sem);
 + page_list-pages);
  
   if (ret != page_list-nr_pages)
   goto unpin;

 --- a/drivers/misc/sgi-gru/grufault.c
 +++ b/drivers/misc/sgi-gru/grufault.c
 @@ -198,8 +198,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct 
 *vma,
  #else
   *pageshift = PAGE_SHIFT;
  #endif
 - if (get_user_pages
 - (current, current-mm, vaddr, 1, write, 0, page, NULL) = 0)
 + if (get_user_pages_fast(vaddr, 1, write, page) = 0)
   return -EFAULT;
   *paddr = page_to_phys(page);
   put_page(page);

 diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
 index aff9689..c89dcfa 100644
 --- a/drivers/scsi/st.c
 +++ b/drivers/scsi/st.c
 @@ -4536,18 +4536,12 @@ static int sgl_map_user_pages(struct st_buffer *STbp,
   return -ENOMEM;
  
  /* Try to fault in all of the necessary pages */
 - down_read(current-mm-mmap_sem);
  /* rw==READ means read from drive, write into memory area */
 - res = get_user_pages(
 - current,
 - current-mm,
 + res = get_user_pages_fast(
   uaddr,
   nr_pages,
   rw == READ,
 - 0, /* don't force */
 - pages,
 - NULL);
 - up_read(current-mm-mmap_sem);
 + pages);
  
   /* Errors and no page mapped should return here */
   if (res  nr_pages)


For all these and the other _fast() users, is there an actual limit to
the nr_pages passed in? Because we used to have the 64 pages limit from
DIO, but without that we get rather long IRQ-off latencies.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND PATCH 1/3] x86: Adding structs to reflect cpuid fields

2014-09-17 Thread Peter Zijlstra
On Wed, Sep 17, 2014 at 03:54:12PM +0300, Nadav Amit wrote:
 Adding structs that reflect various cpuid fields in x86 architecture. Structs
 were added only for functions that are not pure bitmaps.
 
 Signed-off-by: Nadav Amit na...@cs.technion.ac.il
 ---
  arch/x86/include/asm/cpuid_def.h | 163 
 +++
  1 file changed, 163 insertions(+)
  create mode 100644 arch/x86/include/asm/cpuid_def.h
 
 diff --git a/arch/x86/include/asm/cpuid_def.h 
 b/arch/x86/include/asm/cpuid_def.h
 new file mode 100644
 index 000..0cf43ba
 --- /dev/null
 +++ b/arch/x86/include/asm/cpuid_def.h
 @@ -0,0 +1,163 @@
 +#ifndef ARCH_X86_KVM_CPUID_DEF_H
 +#define ARCH_X86_KVM_CPUID_DEF_H

Stale name? Its not exclusively used for KVM and the header name itself
doesn't include KVM either.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] x86: structs for cpuid info in x86

2014-09-17 Thread Peter Zijlstra
On Wed, Sep 17, 2014 at 02:37:10PM +0200, Ingo Molnar wrote:
 If hpa, tglx or Linus objects I'll yield to that objection 
 though.
 
 Opinions, objections?

They generally look fine to me. I appreciate the bitfields for
readability. I often use the same when having to deal with hardware
bitfields. See for example the cpuid10_a?x unions in asm/perf_event.h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V6 1/2] perf ignore LBR and extra_rsp

2014-07-15 Thread Peter Zijlstra
On Mon, Jul 14, 2014 at 12:25:56PM -0700, kan.li...@intel.com wrote:

Close enough.

 +#define EVENT_EXTRA_REG(e, ms, m, vm, i, a) {\
 + .event = (e),   \
 + .msr = (ms),\
 + .config_mask = (m), \
 + .valid_mask = (vm), \
 + .idx = EXTRA_REG_##i,   \
 + .extra_msr_access = (a),\
   }
  
  #define INTEL_EVENT_EXTRA_REG(event, msr, vm, idx)   \
 - EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, idx)
 + EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT, vm, idx, true)
  
  #define INTEL_UEVENT_EXTRA_REG(event, msr, vm, idx) \
   EVENT_EXTRA_REG(event, msr, ARCH_PERFMON_EVENTSEL_EVENT | \
 - ARCH_PERFMON_EVENTSEL_UMASK, vm, idx)
 + ARCH_PERFMON_EVENTSEL_UMASK, vm, idx, true)
  
  #define INTEL_UEVENT_PEBS_LDLAT_EXTRA_REG(c) \
   INTEL_UEVENT_EXTRA_REG(c, \
 @@ -318,7 +320,7 @@ struct extra_reg {
  0x, \
  LDLAT)
  
 -#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0)
 +#define EVENT_EXTRA_END EVENT_EXTRA_REG(0, 0, 0, 0, RSP_0, false)

 diff --git a/arch/x86/kernel/cpu/perf_event_intel_uncore.h 
 b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
 index 90236f0..d860e26 100644
 --- a/arch/x86/kernel/cpu/perf_event_intel_uncore.h
 +++ b/arch/x86/kernel/cpu/perf_event_intel_uncore.h
 @@ -366,11 +366,11 @@
   NHMEX_M_PMON_CTL_FLAG_MODE)
  #define MBOX_INC_SEL_EXTAR_REG(c, r) \
   EVENT_EXTRA_REG(MBOX_INC_SEL(c), NHMEX_M0_MSR_PMU_##r, \
 - MBOX_INC_SEL_MASK, (u64)-1, NHMEX_M_##r)
 + MBOX_INC_SEL_MASK, (u64)-1, NHMEX_M_##r, true)
  #define MBOX_SET_FLAG_SEL_EXTRA_REG(c, r) \
   EVENT_EXTRA_REG(MBOX_SET_FLAG_SEL(c), NHMEX_M0_MSR_PMU_##r, \
   MBOX_SET_FLAG_SEL_MASK, \
 - (u64)-1, NHMEX_M_##r)
 + (u64)-1, NHMEX_M_##r, true)
  
  /* NHM-EX Rbox */
  #define NHMEX_R_MSR_GLOBAL_CTL   0xe00


Since nobody ever treats EVENT_EXTRA_END as an actual event, the value
of .extra_msr_access is irrelevant, this leaves the only 'possible'
value 'true' and we can delete all those changes.

Which, combined with a few whitespace cleanups, gives the below patch.

---
Subject: x86, perf: Protect LBR and extra_regs against KVM lying
From: Kan Liang kan.li...@intel.com
Date: Mon, 14 Jul 2014 12:25:56 -0700

With -cpu host, KVM reports LBR and extra_regs support, if the host has
support.

When the guest perf driver tries to access LBR or extra_regs MSR,
it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support.
So check the related MSRs access right once at initialization time to avoid
the error access at runtime.

For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y
(for host kernel).
And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).
Start the guest with -cpu host.
Run perf record with --branch-any or --branch-filter in guest to trigger LBR
Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to
trigger offcore_rsp #GP

Cc: a...@firstfloor.org
Signed-off-by: Kan Liang kan.li...@intel.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
Link: 
http://lkml.kernel.org/r/1405365957-20202-1-git-send-email-kan.li...@intel.com
---

 arch/x86/kernel/cpu/perf_event.c   |3 +
 arch/x86/kernel/cpu/perf_event.h   |   12 +++---
 arch/x86/kernel/cpu/perf_event_intel.c |   66 -
 3 files changed, 75 insertions(+), 6 deletions(-)

--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config
continue;
if (event-attr.config1  ~er-valid_mask)
return -EINVAL;
+   /* Check if the extra msrs can be safely accessed*/
+   if (!er-extra_msr_access)
+   return -ENXIO;
 
reg-idx = er-idx;
reg-config = event-attr.config1;
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -295,14 +295,16 @@ struct extra_reg {
u64 config_mask;
u64 valid_mask;
int idx;  /* per_xxx-regs[] reg index */
+   boolextra_msr_access;
 };
 
 #define EVENT_EXTRA_REG(e, ms, m, vm, i) { \
-   .event = (e),   \
-   .msr = (ms),\
-   .config_mask = (m), \
-   .valid_mask = (vm), \
-   .idx = EXTRA_REG_##i,   \
+   .event = (e),   \
+   .msr = (ms),\
+   .config_mask = (m), \
+   .valid_mask = (vm), \
+   .idx = EXTRA_REG_##i

Re: [PATCH V6 1/2] perf ignore LBR and extra_rsp

2014-07-15 Thread Peter Zijlstra
On Tue, Jul 15, 2014 at 03:32:36PM +, Liang, Kan wrote:
 
 
  
  
  Since nobody ever treats EVENT_EXTRA_END as an actual event, the value
  of .extra_msr_access is irrelevant, this leaves the only 'possible'
  value 'true' and we can delete all those changes.
  
 Right.
  Which, combined with a few whitespace cleanups, gives the below patch.
 
 Thanks. Your change is perfect. Do I need to resubmit the patch to the 
 mailing list?

Nope, got it queued.


pgpnelvoOS7ed.pgp
Description: PGP signature


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Peter Zijlstra
On Thu, Jul 10, 2014 at 03:59:43AM -0700, kan.li...@intel.com wrote:
 From: Kan Liang kan.li...@intel.com
 
 x86, perf: Protect LBR and extra_regs against KVM lying
 
 With -cpu host, KVM reports LBR and extra_regs support, if the host has 
 support.
 When the guest perf driver tries to access LBR or extra_regs MSR,
 it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support.
 So check the related MSRs access right once at initialization time to avoid 
 the error access at runtime.
 
 For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y 
 (for host kernel).
 And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).
 Start the guest with -cpu host.
 Run perf record with --branch-any or --branch-filter in guest to trigger LBR 
 #GP.
 Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to 
 trigger offcore_rsp #GP

This is still not properly wrapped at 78 chars.

 Signed-off-by: Kan Liang kan.li...@intel.com
 
 V2: Move the check code to initialization time.
 V3: Add flag for each extra register.
 Check all LBR MSRs at initialization time.
 V4: Remove lbr_msr_access. For LBR msr, simply set lbr_nr to 0 if check_msr 
 failed.
 Disable all extra msrs in creation places if check_msr failed.
 V5: Fix check_msr broken
 Don't check any more MSRs after the first fail
 Return error when checking fail to stop creating the event
 Remove the checking code path which never get

These things should go below the --- so they get thrown away when
applying the patch, its of no relevance once applied.

 ---
  arch/x86/kernel/cpu/perf_event.c   |  3 +++
  arch/x86/kernel/cpu/perf_event.h   | 45 
 ++
  arch/x86/kernel/cpu/perf_event_intel.c | 38 +++-
  3 files changed, 85 insertions(+), 1 deletion(-)
 
 diff --git a/arch/x86/kernel/cpu/perf_event.c 
 b/arch/x86/kernel/cpu/perf_event.c
 index 2bdfbff..a7c5e4b 100644
 --- a/arch/x86/kernel/cpu/perf_event.c
 +++ b/arch/x86/kernel/cpu/perf_event.c
 @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct 
 perf_event *event)
   continue;
   if (event-attr.config1  ~er-valid_mask)
   return -EINVAL;
 + /* Check if the extra msrs can be safely accessed*/
 + if (!x86_pmu.extra_msr_access[er-idx])
 + return -EFAULT;

This is not a correct usage of -EFAULT. Event creation did not fail
because we took a fault dereferencing a user provided pointer. Possibly
ENXIO is appropriate.

   reg-idx = er-idx;
   reg-config = event-attr.config1;
 diff --git a/arch/x86/kernel/cpu/perf_event.h 
 b/arch/x86/kernel/cpu/perf_event.h
 index 3b2f9bd..992c678 100644
 --- a/arch/x86/kernel/cpu/perf_event.h
 +++ b/arch/x86/kernel/cpu/perf_event.h
 @@ -464,6 +464,12 @@ struct x86_pmu {
*/
   struct extra_reg *extra_regs;
   unsigned int er_flags;
 + /*
 +  * EXTRA REG MSR can be accessed
 +  * The extra registers are completely unrelated to each other.
 +  * So it needs a flag for each extra register.
 +  */
 + boolextra_msr_access[EXTRA_REG_MAX];

So why not in struct extra_reg again? You didn't give a straight answer
there.

 +/*
 + * Under certain circumstances, access certain MSR may cause #GP.
 + * The function tests if the input MSR can be safely accessed.
 + */
 +static inline bool check_msr(unsigned long msr)
 +{

This reads like a generic function;

 + u64 val_old, val_new, val_tmp;
 +
 + /*
 +  * Read the current value, change it and read it back to see if it
 +  * matches, this is needed to detect certain hardware emulators
 +  * (qemu/kvm) that don't trap on the MSR access and always return 0s.
 +  */
 + if (rdmsrl_safe(msr, val_old))
 + goto msr_fail;
 + /*
 +  * Only chagne it slightly,
 +  * since the higher bits of some MSRs cannot be updated by wrmsrl.
 +  * E.g. MSR_LBR_TOS
 +  */
 + val_tmp = val_old ^ 0x3UL;

but this is not generally true; not all MSRs can write the 2 LSB, can
they? One option would be to extend the function with a u64 mask.

 + if (wrmsrl_safe(msr, val_tmp) ||
 + rdmsrl_safe(msr, val_new))
 + goto msr_fail;
 +
 + if (val_new != val_tmp)
 + goto msr_fail;
 +
 + /* Here it's sure that the MSR can be safely accessed.
 +  * Restore the old value and return.
 +  */
 + wrmsrl(msr, val_old);
 +
 + return true;
 +
 +msr_fail:
 + return false;
 +}

Also, by now this function is far too large to be inline and in a
header.

 + /*
 +  * Access LBR MSR may cause #GP under certain circumstances.
 +  * E.g. KVM doesn't support LBR MSR
 +  * Check all LBT MSR here.
 +  * Disable LBR access if any LBR MSRs can not be accessed.
 +  */
 + if (x86_pmu.lbr_nr) {
 + if 

Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Peter Zijlstra
On Thu, Jul 10, 2014 at 03:59:43AM -0700, kan.li...@intel.com wrote:
 +/*
 + * Under certain circumstances, access certain MSR may cause #GP.
 + * The function tests if the input MSR can be safely accessed.
 + */
 +static inline bool check_msr(unsigned long msr)
 +{
 + u64 val_old, val_new, val_tmp;
 +
 + /*
 +  * Read the current value, change it and read it back to see if it
 +  * matches, this is needed to detect certain hardware emulators
 +  * (qemu/kvm) that don't trap on the MSR access and always return 0s.
 +  */
 + if (rdmsrl_safe(msr, val_old))
 + goto msr_fail;
 + /*
 +  * Only chagne it slightly,
 +  * since the higher bits of some MSRs cannot be updated by wrmsrl.
 +  * E.g. MSR_LBR_TOS
 +  */
 + val_tmp = val_old ^ 0x3UL;
 + if (wrmsrl_safe(msr, val_tmp) ||
 + rdmsrl_safe(msr, val_new))
 + goto msr_fail;
 +
 + if (val_new != val_tmp)
 + goto msr_fail;
 +
 + /* Here it's sure that the MSR can be safely accessed.
 +  * Restore the old value and return.
 +  */
 + wrmsrl(msr, val_old);
 +
 + return true;
 +
 +msr_fail:
 + return false;
 +}

I don't think we need the msr_fail thing, there's no state to clean up,
you can return false at all places you now have goto.


pgpkp9qBX8LOU.pgp
Description: PGP signature


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Peter Zijlstra
On Mon, Jul 14, 2014 at 01:55:03PM +0200, Paolo Bonzini wrote:
 Il 10/07/2014 12:59, kan.li...@intel.com ha scritto:
 From: Kan Liang kan.li...@intel.com
 
 x86, perf: Protect LBR and extra_regs against KVM lying
 
 With -cpu host, KVM reports LBR and extra_regs support, if the host has 
 support.
 When the guest perf driver tries to access LBR or extra_regs MSR,
 it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support.
 So check the related MSRs access right once at initialization time to avoid 
 the error access at runtime.
 
 For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y 
 (for host kernel).
 And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel).
 
 I'm not sure this is a useful patch.
 
 This is #GP'ing just because of a limitation in the PMU; just compile the
 kernel with CONFIG_PARAVIRT

How's that going to help? If you run kvm -host the VM is lying through
its teeth is the kernel is going to assume all those MSRs present,
PARAVIRT isn't going to help with this.

 , or split the rdmsr is always rdmsr_safe
 behavior out of CONFIG_PARAVIRT and into a new Kconfig symbol.

That's not useful either, because non of these code-paths are going to
check the return value.

 In fact there's no reason why LBR cannot be virtualized (though it does need
 support from the processor), and it may even be possible to support
 OFFCORE_RSP_X in the KVM virtual PMU.

But its not, so something needs to be done, right?


pgpmnEwGpgwpV.pgp
Description: PGP signature


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Peter Zijlstra
On Mon, Jul 14, 2014 at 02:40:33PM +0200, Paolo Bonzini wrote:
 Hmmm, I thought rdmsr_safe was going to return zero, but it just returns
 whatever happened to be in edx:eax which maybe should also be fixed.
 
 Kan Liang, what happens if CONFIG_PARAVIRT=y?  Do you get garbage or just no
 events reported?

Either way; its not functioning according to 'spec'. Therefore we should
'disable' things.

 In fact there's no reason why LBR cannot be virtualized (though it does need
 support from the processor), and it may even be possible to support
 OFFCORE_RSP_X in the KVM virtual PMU.
 
 But its not, so something needs to be done, right?
 
 A first thing that could be done, is to have a way for the kernel to detect
 absence of LBR

Which is exactly what this patch does, no?

 , for example an all-1s setting of the LBR format field of
 IA32_PERF_CAPABILITIES.  If Intel can tell us all 1s will never be used,
 we can have KVM set the field that way.  The kernel then should disable LBR.

So what's wrong with testing if the MSRs that provide LBR actually work
or not? That doesn't require any magic co-operation of the host/vm
machinery and is therefore far more robust.


pgpBAg5uoyJMK.pgp
Description: PGP signature


Re: [PATCH V5 1/2] perf ignore LBR and extra_regs

2014-07-14 Thread Peter Zijlstra

so once more; and then I'm going to route your emails to /dev/null, wrap
text at 78 chars.

On Mon, Jul 14, 2014 at 02:28:36PM +, Liang, Kan wrote:
   +++ b/arch/x86/kernel/cpu/perf_event.h
   @@ -464,6 +464,12 @@ struct x86_pmu {
  */
 struct extra_reg *extra_regs;
 unsigned int er_flags;
   + /*
   +  * EXTRA REG MSR can be accessed
   +  * The extra registers are completely unrelated to each other.
   +  * So it needs a flag for each extra register.
   +  */
   + boolextra_msr_access[EXTRA_REG_MAX];
  
  So why not in struct extra_reg again? You didn't give a straight answer 
  there.
 
 I think I did in the email.
 You mentioned that there's still (only) 4 empty bytes at the tail of
 extra_reg itself.  However, the extra_reg_type may be extended in the
 near future.  So that may not be a reason to move to extra_reg.

Well, you can always grow. Also be explicit, 'may be' is an empty
statement.

 Furthermore, if we move extra_msr_access to extra_reg, I guess we have
 to modify all the related micros (i.e EVENT_EXTRA_REG) to initialize
 the new items.  That could be a big change.

Nah, trivial stuff :-)

 On the other side, in x86_pmu structure, there are extra_regs related
 items defined under the comments Extra registers for events.  And
 the bit holes are enough for current usage and future extension.
 
 So I guess  x86_pmu should be a good place to store the availability
 of the reg.

It just doesn't make sense to me to have multiple arrays of the same
thing.



pgpllkEQRnDii.pgp
Description: PGP signature


Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.

2014-07-10 Thread Peter Zijlstra
/me reminds you of 78 char text wrap.

On Wed, Jul 09, 2014 at 07:32:09PM +, Liang, Kan wrote:
  Sure; but what I meant was, check_msr() is broken when ran on such a
  kernel. You need to fix check_msr() to return failure on these 'ignored'
  MSRs, after all they don't function as expected, they're effectively broken.
 
 The function is designed to check if the MSRs can be safely accessed
 (no #GP). It cannot guarantee the correctness of the MSRs.  If KVM
 applied patch 2 and guest applied patch 1, from the guest's
 perspective, the MSRs can be accessed (no #GP triggered). So return
 true is expected. It should not be a broken. 

You're not understanding. I know you wrote that function to do that. I'm
saying that's wrong.

Look at check_hw_exists() it explicitly checks for fake MSRs and reports
them broken.

These fake MSRs _ARE_ broken, they do not behave as expected. Not
crashing is not the right consideration here, we're interested in higher
order correct behaviour.

 The only unexpected
 thing for guest is that the counting/sampling result for LBR/extra reg
 is always 0. But the patch is a short term fix to stop things from
 crashing. I think it should be acceptable.

Patch 2 is fine, patch 1, in particular your check_msr() routine is not.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.

2014-07-09 Thread Peter Zijlstra
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote:
 diff --git a/arch/x86/kernel/cpu/perf_event.c 
 b/arch/x86/kernel/cpu/perf_event.c
 index 2bdfbff..f0e8022 100644
 --- a/arch/x86/kernel/cpu/perf_event.c
 +++ b/arch/x86/kernel/cpu/perf_event.c
 @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct 
 perf_event *event)
   continue;
   if (event-attr.config1  ~er-valid_mask)
   return -EINVAL;
 + /* Check if the extra msrs can be safely accessed*/
 + if (!x86_pmu.extra_msr_access[er-idx])
 + continue;
  
   reg-idx = er-idx;
   reg-config = event-attr.config1;

You should return an error here; doing continue will eventually make
it return 0, which is no-error and things will happily create the event
and malfunction.


pgpCz6668PC2f.pgp
Description: PGP signature


Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.

2014-07-09 Thread Peter Zijlstra
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote:
 diff --git a/arch/x86/kernel/cpu/perf_event.c 
 b/arch/x86/kernel/cpu/perf_event.c
 index 2bdfbff..f0e8022 100644
 --- a/arch/x86/kernel/cpu/perf_event.c
 +++ b/arch/x86/kernel/cpu/perf_event.c
 @@ -118,6 +118,9 @@ static int x86_pmu_extra_regs(u64 config, struct 
 perf_event *event)
   continue;
   if (event-attr.config1  ~er-valid_mask)
   return -EINVAL;
 + /* Check if the extra msrs can be safely accessed*/
 + if (!x86_pmu.extra_msr_access[er-idx])
 + continue;

If you fail here;

   reg-idx = er-idx;
   reg-config = event-attr.config1;

 diff --git a/arch/x86/kernel/cpu/perf_event_intel.c 
 b/arch/x86/kernel/cpu/perf_event_intel.c
 index adb02aa..2be44be 100644
 --- a/arch/x86/kernel/cpu/perf_event_intel.c
 +++ b/arch/x86/kernel/cpu/perf_event_intel.c
 @@ -1471,11 +1471,14 @@ static void intel_fixup_er(struct perf_event *event, 
 int idx)
  {
   event-hw.extra_reg.idx = idx;
  
 - if (idx == EXTRA_REG_RSP_0) {
 + /* The extra_reg doesn't update if msrs cannot be accessed safely */
 + if ((idx == EXTRA_REG_RSP_0) 
 + x86_pmu.extra_msr_access[EXTRA_REG_RSP_0]) {
   event-hw.config = ~INTEL_ARCH_EVENT_MASK;
   event-hw.config |= x86_pmu.extra_regs[EXTRA_REG_RSP_0].event;
   event-hw.extra_reg.reg = MSR_OFFCORE_RSP_0;
 - } else if (idx == EXTRA_REG_RSP_1) {
 + } else if ((idx == EXTRA_REG_RSP_1) 
 + x86_pmu.extra_msr_access[EXTRA_REG_RSP_1]) {
   event-hw.config = ~INTEL_ARCH_EVENT_MASK;
   event-hw.config |= x86_pmu.extra_regs[EXTRA_REG_RSP_1].event;
   event-hw.extra_reg.reg = MSR_OFFCORE_RSP_1;

You should never get here..



pgpO7wyItAz6_.pgp
Description: PGP signature


Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.

2014-07-09 Thread Peter Zijlstra
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote:
 + /*
 +  * Access LBR MSR may cause #GP under certain circumstances.
 +  * E.g. KVM doesn't support LBR MSR
 +  * Check all LBT MSR here.
 +  * Disable LBR access if any LBR MSRs can not be accessed.
 +  */
 + if (x86_pmu.lbr_nr) {
 + access = check_msr(x86_pmu.lbr_tos);
 + for (i = 0; i  x86_pmu.lbr_nr; i++) {
 + access = check_msr(x86_pmu.lbr_from + i);
 + access = check_msr(x86_pmu.lbr_to + i);
 + }
 + if (!access)
 + x86_pmu.lbr_nr = 0;
 + }

You _could_ write that as:

if (x86_pmu.lbr_nr) {
if (!check_msr(x86_pmu.lbr_tos))
x86_pmu.lbr_nr = 0;
for (i = 0; i  x86_pmu.lbr_nr; i++) {
if (!(check_msr(x86_pmu.lbr_from + i) 
  check_msr(x86_pmu.lbr_to + i)))
  x86_pmu.lbr_nr = 0;
}
}

There is no point in checking any more MSRs after the first fail after
all.


pgpyqfgHDvqcm.pgp
Description: PGP signature


Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.

2014-07-09 Thread Peter Zijlstra
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote:
 --- a/arch/x86/kernel/cpu/perf_event.h
 +++ b/arch/x86/kernel/cpu/perf_event.h
 @@ -464,6 +464,12 @@ struct x86_pmu {
*/
   struct extra_reg *extra_regs;
   unsigned int er_flags;
 + /*
 +  * EXTRA REG MSR can be accessed
 +  * The extra registers are completely unrelated to each other.
 +  * So it needs a flag for each extra register.
 +  */
 + boolextra_msr_access[EXTRA_REG_MAX];
  
   /*
* Intel host/guest support (KVM)

# pahole -C extra_reg arch/x86/kernel/cpu/perf_event_intel.o
struct extra_reg {
unsigned int   event;/* 0 4 */
unsigned int   msr;  /* 4 4 */
u64config_mask;  /* 8 8 */
u64valid_mask;   /*16 8 */
intidx;  /*24 4 */

/* size: 32, cachelines: 1, members: 5 */
/* padding: 4 */
/* last cacheline: 32 bytes */
};

There's still 4 empty bytes at the tail of extra_reg itself; would it
make sense to store the availability of the reg in there?

After all; the place we use it (x86_pmu_extra_regs) already has the
pointer to the structure.


pgpXVLCMc15Qs.pgp
Description: PGP signature


Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.

2014-07-09 Thread Peter Zijlstra
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote:
 +/*
 + * Under certain circumstances, access certain MSR may cause #GP.
 + * The function tests if the input MSR can be safely accessed.
 + */
 +static inline bool check_msr(unsigned long msr)
 +{
 + u64 value;
 +
 + if (rdmsrl_safe(msr, value)  0)
 + return false;
 + if (wrmsrl_safe(msr, value)  0)
 + return false;
 + return true;
 +}

What does this thing return after patch 2? does the write still fault or
will KVM silently take writes too?


pgpfavVlVCcga.pgp
Description: PGP signature


Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.

2014-07-09 Thread Peter Zijlstra
On Wed, Jul 09, 2014 at 02:32:28PM +, Liang, Kan wrote:
 
 
  On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote:
   +/*
   + * Under certain circumstances, access certain MSR may cause #GP.
   + * The function tests if the input MSR can be safely accessed.
   + */
   +static inline bool check_msr(unsigned long msr) {
   + u64 value;
   +
   + if (rdmsrl_safe(msr, value)  0)
   + return false;
   + if (wrmsrl_safe(msr, value)  0)
   + return false;
   + return true;
   +}
  
  What does this thing return after patch 2? does the write still fault or 
  will
  KVM silently take writes too?
 
 If applying patch 2, the function will return true. The KVM just simply 
 ignore the reads/writes.

OK, then that's broken too. We want a function to return false for any
malfunctioning MSR, ignoring writes and returning 0s is not proper
functioning.


pgpw9EoY9DQwh.pgp
Description: PGP signature


Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.

2014-07-09 Thread Peter Zijlstra
On Wed, Jul 09, 2014 at 03:43:45PM +, Liang, Kan wrote:
 
 
  -Original Message-
  From: Peter Zijlstra [mailto:pet...@infradead.org]
  Sent: Wednesday, July 09, 2014 10:58 AM
  To: Liang, Kan
  Cc: a...@firstfloor.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org
  Subject: Re: [PATCH V4 1/2] perf ignore LBR and extra_regs.
  
  On Wed, Jul 09, 2014 at 02:32:28PM +, Liang, Kan wrote:
  
  
On Tue, Jul 08, 2014 at 09:49:40AM -0700, kan.li...@intel.com wrote:
 +/*
 + * Under certain circumstances, access certain MSR may cause #GP.
 + * The function tests if the input MSR can be safely accessed.
 + */
 +static inline bool check_msr(unsigned long msr) {
 + u64 value;
 +
 + if (rdmsrl_safe(msr, value)  0)
 + return false;
 + if (wrmsrl_safe(msr, value)  0)
 + return false;
 + return true;
 +}
   
What does this thing return after patch 2? does the write still
fault or will KVM silently take writes too?
  
   If applying patch 2, the function will return true. The KVM just simply 
   ignore
  the reads/writes.
  
  OK, then that's broken too. We want a function to return false for any
  malfunctioning MSR, ignoring writes and returning 0s is not proper
  functioning.
 
 The patch 2 is to handle the case that the administrator can only
 patch the host. Don't need to force user to upgrade their guest to fix
 the crash.  And ignore the annoying unhandled KVM messages

Sure; but what I meant was, check_msr() is broken when ran on such a
kernel. You need to fix check_msr() to return failure on these 'ignored'
MSRs, after all they don't function as expected, they're effectively
broken.


pgpjTPCCq4qqc.pgp
Description: PGP signature


  1   2   3   4   5   6   >