Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-11 Thread Raghavendra K T

On 07/10/2013 04:03 PM, Gleb Natapov wrote:
[...] trimmed


Yes. you are right. dynamic ple window was an attempt to solve it.

Probelm is, reducing the SPIN_THRESHOLD is resulting in excess halt
exits in under-commits and increasing ple_window may be sometimes
counter productive as it affects other busy-wait constructs such as
flush_tlb AFAIK.
So if we could have had a dynamically changing SPIN_THRESHOLD too, that
would be nice.



Gleb, Andrew,
I tested with the global ple window change (similar to what I posted
here https://lkml.org/lkml/2012/11/11/14 ),

This does not look global. It changes PLE per vcpu.


Okay. Got it. I was thinking it would change the global value. But IIRC
 It is changing global sysfs value and per vcpu ple_window.
Sorry. I missed this part yesterday.




But did not see good result. May be it is good to go with per VM
ple_window.

Gleb,
Can you elaborate little more on what you have in mind regarding per
VM ple_window. (maintaining part of it as a per vm variable is clear
to
  me), but is it that we have to load that every time of guest entry?


Only when it changes, shouldn't be to often no?


Ok. Thinking how to do. read the register and writeback if there need
to be a change during guest entry?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-11 Thread Gleb Natapov
On Thu, Jul 11, 2013 at 02:43:03PM +0530, Raghavendra K T wrote:
 On 07/10/2013 04:03 PM, Gleb Natapov wrote:
 [...] trimmed
 
 Yes. you are right. dynamic ple window was an attempt to solve it.
 
 Probelm is, reducing the SPIN_THRESHOLD is resulting in excess halt
 exits in under-commits and increasing ple_window may be sometimes
 counter productive as it affects other busy-wait constructs such as
 flush_tlb AFAIK.
 So if we could have had a dynamically changing SPIN_THRESHOLD too, that
 would be nice.
 
 
 Gleb, Andrew,
 I tested with the global ple window change (similar to what I posted
 here https://lkml.org/lkml/2012/11/11/14 ),
 This does not look global. It changes PLE per vcpu.
 
 Okay. Got it. I was thinking it would change the global value. But IIRC
  It is changing global sysfs value and per vcpu ple_window.
 Sorry. I missed this part yesterday.
 
Yes, it changes sysfs value but this does not affect already created
vcpus.

 
 But did not see good result. May be it is good to go with per VM
 ple_window.
 
 Gleb,
 Can you elaborate little more on what you have in mind regarding per
 VM ple_window. (maintaining part of it as a per vm variable is clear
 to
   me), but is it that we have to load that every time of guest entry?
 
 Only when it changes, shouldn't be to often no?
 
 Ok. Thinking how to do. read the register and writeback if there need
 to be a change during guest entry?
 
Why not do it like in the patch you've linked? When value changes write it
to VMCS of the current vcpu.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-11 Thread Raghavendra K T

On 07/11/2013 03:18 PM, Gleb Natapov wrote:

On Thu, Jul 11, 2013 at 02:43:03PM +0530, Raghavendra K T wrote:

On 07/10/2013 04:03 PM, Gleb Natapov wrote:
[...] trimmed


Yes. you are right. dynamic ple window was an attempt to solve it.

Probelm is, reducing the SPIN_THRESHOLD is resulting in excess halt
exits in under-commits and increasing ple_window may be sometimes
counter productive as it affects other busy-wait constructs such as
flush_tlb AFAIK.
So if we could have had a dynamically changing SPIN_THRESHOLD too, that
would be nice.



Gleb, Andrew,
I tested with the global ple window change (similar to what I posted
here https://lkml.org/lkml/2012/11/11/14 ),

This does not look global. It changes PLE per vcpu.


Okay. Got it. I was thinking it would change the global value. But IIRC
  It is changing global sysfs value and per vcpu ple_window.
Sorry. I missed this part yesterday.


Yes, it changes sysfs value but this does not affect already created
vcpus.




But did not see good result. May be it is good to go with per VM
ple_window.

Gleb,
Can you elaborate little more on what you have in mind regarding per
VM ple_window. (maintaining part of it as a per vm variable is clear
to
  me), but is it that we have to load that every time of guest entry?


Only when it changes, shouldn't be to often no?


Ok. Thinking how to do. read the register and writeback if there need
to be a change during guest entry?


Why not do it like in the patch you've linked? When value changes write it
to VMCS of the current vcpu.



Yes. can be done. So the running vcpu's ple_window gets updated only
after next pl-exit. right?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-11 Thread Gleb Natapov
On Thu, Jul 11, 2013 at 03:40:38PM +0530, Raghavendra K T wrote:
 Gleb,
 Can you elaborate little more on what you have in mind regarding per
 VM ple_window. (maintaining part of it as a per vm variable is clear
 to
   me), but is it that we have to load that every time of guest entry?
 
 Only when it changes, shouldn't be to often no?
 
 Ok. Thinking how to do. read the register and writeback if there need
 to be a change during guest entry?
 
 Why not do it like in the patch you've linked? When value changes write it
 to VMCS of the current vcpu.
 
 
 Yes. can be done. So the running vcpu's ple_window gets updated only
 after next pl-exit. right?
I am not sure what you mean. You cannot change vcpu's ple_window while
vcpu is in a guest mode.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-11 Thread Raghavendra K T

On 07/11/2013 03:41 PM, Gleb Natapov wrote:

On Thu, Jul 11, 2013 at 03:40:38PM +0530, Raghavendra K T wrote:

Gleb,
Can you elaborate little more on what you have in mind regarding per
VM ple_window. (maintaining part of it as a per vm variable is clear
to
  me), but is it that we have to load that every time of guest entry?


Only when it changes, shouldn't be to often no?


Ok. Thinking how to do. read the register and writeback if there need
to be a change during guest entry?


Why not do it like in the patch you've linked? When value changes write it
to VMCS of the current vcpu.



Yes. can be done. So the running vcpu's ple_window gets updated only
after next pl-exit. right?

I am not sure what you mean. You cannot change vcpu's ple_window while
vcpu is in a guest mode.



I agree with that. Both of us are on the same page.
 What I meant is,
suppose the per VM ple_window changes when a vcpu x of that VM  was running,
it will get its ple_window updated during next run.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-11 Thread Gleb Natapov
On Thu, Jul 11, 2013 at 04:23:58PM +0530, Raghavendra K T wrote:
 On 07/11/2013 03:41 PM, Gleb Natapov wrote:
 On Thu, Jul 11, 2013 at 03:40:38PM +0530, Raghavendra K T wrote:
 Gleb,
 Can you elaborate little more on what you have in mind regarding per
 VM ple_window. (maintaining part of it as a per vm variable is clear
 to
   me), but is it that we have to load that every time of guest entry?
 
 Only when it changes, shouldn't be to often no?
 
 Ok. Thinking how to do. read the register and writeback if there need
 to be a change during guest entry?
 
 Why not do it like in the patch you've linked? When value changes write it
 to VMCS of the current vcpu.
 
 
 Yes. can be done. So the running vcpu's ple_window gets updated only
 after next pl-exit. right?
 I am not sure what you mean. You cannot change vcpu's ple_window while
 vcpu is in a guest mode.
 
 
 I agree with that. Both of us are on the same page.
  What I meant is,
 suppose the per VM ple_window changes when a vcpu x of that VM  was running,
 it will get its ple_window updated during next run.
Ah, I think per VM is what confuses me. Why do you want to have per
VM ple_windows and not per vcpu one? With per vcpu one ple_windows
cannot change while vcpu is running.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-11 Thread Raghavendra K T

On 07/11/2013 04:26 PM, Gleb Natapov wrote:

On Thu, Jul 11, 2013 at 04:23:58PM +0530, Raghavendra K T wrote:

On 07/11/2013 03:41 PM, Gleb Natapov wrote:

On Thu, Jul 11, 2013 at 03:40:38PM +0530, Raghavendra K T wrote:

Gleb,
Can you elaborate little more on what you have in mind regarding per
VM ple_window. (maintaining part of it as a per vm variable is clear
to
  me), but is it that we have to load that every time of guest entry?


Only when it changes, shouldn't be to often no?


Ok. Thinking how to do. read the register and writeback if there need
to be a change during guest entry?


Why not do it like in the patch you've linked? When value changes write it
to VMCS of the current vcpu.



Yes. can be done. So the running vcpu's ple_window gets updated only
after next pl-exit. right?

I am not sure what you mean. You cannot change vcpu's ple_window while
vcpu is in a guest mode.



I agree with that. Both of us are on the same page.
  What I meant is,
suppose the per VM ple_window changes when a vcpu x of that VM  was running,
it will get its ple_window updated during next run.

Ah, I think per VM is what confuses me. Why do you want to have per
VM ple_windows and not per vcpu one? With per vcpu one ple_windows
cannot change while vcpu is running.



Okay. Got that. My initial feeling was vcpu does not feel the global
load. But I think that should be of no problem. instead we will not need
atomic operations to update ple_window, which is better.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Tue, Jul 09, 2013 at 02:41:30PM +0530, Raghavendra K T wrote:
 On 06/26/2013 11:24 PM, Raghavendra K T wrote:
 On 06/26/2013 09:41 PM, Gleb Natapov wrote:
 On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote:
 On 06/26/2013 06:22 PM, Gleb Natapov wrote:
 On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
 On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
 On 06/25/2013 08:20 PM, Andrew Theurer wrote:
 On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
 This series replaces the existing paravirtualized spinlock
 mechanism
 with a paravirtualized ticketlock mechanism. The series provides
 implementation for both Xen and KVM.
 
 Changes in V9:
 - Changed spin_threshold to 32k to avoid excess halt exits that are
 causing undercommit degradation (after PLE handler
 improvement).
 - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
 - Optimized halt exit path to use PLE handler
 
 V8 of PVspinlock was posted last year. After Avi's suggestions
 to look
 at PLE handler's improvements, various optimizations in PLE
 handling
 have been tried.
 
 Sorry for not posting this sooner.  I have tested the v9
 pv-ticketlock
 patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I
 have
 tested these patches with and without PLE, as PLE is still not
 scalable
 with large VMs.
 
 
 Hi Andrew,
 
 Thanks for testing.
 
 System: x3850X5, 40 cores, 80 threads
 
 
 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
 --
 Total
 ConfigurationThroughput(MB/s)Notes
 
 3.10-default-ple_on229455% CPU in host
 kernel, 2% spin_lock in guests
 3.10-default-ple_off231845% CPU in host
 kernel, 2% spin_lock in guests
 3.10-pvticket-ple_on228955% CPU in host
 kernel, 2% spin_lock in guests
 3.10-pvticket-ple_off230515% CPU in host
 kernel, 2% spin_lock in guests
 [all 1x results look good here]
 
 Yes. The 1x results look too close
 
 
 
 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
 ---
 Total
 ConfigurationThroughputNotes
 
 3.10-default-ple_on 628755% CPU  host
 kernel, 17% spin_lock in guests
 3.10-default-ple_off 18492% CPU in host
 kernel, 95% spin_lock in guests
 3.10-pvticket-ple_on 669150% CPU in host
 kernel, 15% spin_lock in guests
 3.10-pvticket-ple_off164648% CPU in host
 kernel, 33% spin_lock in guests
 
 I see 6.426% improvement with ple_on
 and 161.87% improvement with ple_off. I think this is a very good
 sign
   for the patches
 
 [PLE hinders pv-ticket improvements, but even with PLE off,
   we still off from ideal throughput (somewhere 2)]
 
 
 Okay, The ideal throughput you are referring is getting around
 atleast
 80% of 1x throughput for over-commit. Yes we are still far away from
 there.
 
 
 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
 --
 Total
 ConfigurationThroughputNotes
 
 3.10-default-ple_on227366% CPU in host
 kernel, 3% spin_lock in guests
 3.10-default-ple_off233775% CPU in host
 kernel, 3% spin_lock in guests
 3.10-pvticket-ple_on224716% CPU in host
 kernel, 3% spin_lock in guests
 3.10-pvticket-ple_off234455% CPU in host
 kernel, 3% spin_lock in guests
 [1x looking fine here]
 
 
 I see ple_off is little better here.
 
 
 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
 --
 Total
 ConfigurationThroughputNotes
 
 3.10-default-ple_on 196570% CPU in host
 kernel, 34% spin_lock in guests
 3.10-default-ple_off  2262% CPU in host
 kernel, 94% spin_lock in guests
 3.10-pvticket-ple_on 194270% CPU in host
 kernel, 35% spin_lock in guests
 3.10-pvticket-ple_off 800311% CPU in host
 kernel, 70% spin_lock in guests
 [quite bad all around, but pv-tickets with PLE off the best so far.
   Still quite a bit off from ideal throughput]
 
 This is again a remarkable improvement (307%).
 This motivates me to add a patch to disable ple when pvspinlock is
 on.
 probably we can add a hypercall that disables ple in kvm init patch.
 but only problem I see is what if the guests are mixed.
 
   (i.e one guest has pvspinlock support but other does not. Host
 supports pv)
 
 How about reintroducing the idea to create per-kvm ple_gap,ple_window
 state. We were headed down that road when considering a dynamic
 window at
 one point. Then 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Peter Zijlstra
On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:

Here's an idea, trim the damn email ;-) -- not only directed at gleb.

  Ingo, Gleb,
  
  From the results perspective, Andrew Theurer, Vinod's test results are
  pro-pvspinlock.
  Could you please help me to know what will make it a mergeable
  candidate?.
  
 I need to spend more time reviewing it :) The problem with PV interfaces
 is that they are easy to add but hard to get rid of if better solution
 (HW or otherwise) appears.

How so? Just make sure the registration for the PV interface is optional; that
is, allow it to fail. A guest that fails the PV setup will either have to try
another PV interface or fall back to 'native'.

  I agree that Jiannan's Preemptable Lock idea is promising and we could
  evaluate that  approach, and make the best one get into kernel and also
  will carry on discussion with Jiannan to improve that patch.
 That would be great. The work is stalled from what I can tell.

I absolutely hated that stuff because it wrecked the native code.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
 On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
 
 Here's an idea, trim the damn email ;-) -- not only directed at gleb.
 
Good idea.

   Ingo, Gleb,
   
   From the results perspective, Andrew Theurer, Vinod's test results are
   pro-pvspinlock.
   Could you please help me to know what will make it a mergeable
   candidate?.
   
  I need to spend more time reviewing it :) The problem with PV interfaces
  is that they are easy to add but hard to get rid of if better solution
  (HW or otherwise) appears.
 
 How so? Just make sure the registration for the PV interface is optional; that
 is, allow it to fail. A guest that fails the PV setup will either have to try
 another PV interface or fall back to 'native'.
 
We have to carry PV around for live migration purposes. PV interface
cannot disappear under a running guest.

   I agree that Jiannan's Preemptable Lock idea is promising and we could
   evaluate that  approach, and make the best one get into kernel and also
   will carry on discussion with Jiannan to improve that patch.
  That would be great. The work is stalled from what I can tell.
 
 I absolutely hated that stuff because it wrecked the native code.
Yes, the idea was to hide it from native code behind PV hooks.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Raghavendra K T

On 07/10/2013 04:03 PM, Gleb Natapov wrote:

On Tue, Jul 09, 2013 at 02:41:30PM +0530, Raghavendra K T wrote:

On 06/26/2013 11:24 PM, Raghavendra K T wrote:

On 06/26/2013 09:41 PM, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote:

On 06/26/2013 06:22 PM, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:

On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:

On 06/25/2013 08:20 PM, Andrew Theurer wrote:

On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:

This series replaces the existing paravirtualized spinlock
mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
causing undercommit degradation (after PLE handler
improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions
to look
at PLE handler's improvements, various optimizations in PLE
handling
have been tried.


Sorry for not posting this sooner.  I have tested the v9
pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I
have
tested these patches with and without PLE, as PLE is still not
scalable
with large VMs.



Hi Andrew,

Thanks for testing.


System: x3850X5, 40 cores, 80 threads


1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
--
Total
ConfigurationThroughput(MB/s)Notes

3.10-default-ple_on229455% CPU in host
kernel, 2% spin_lock in guests
3.10-default-ple_off231845% CPU in host
kernel, 2% spin_lock in guests
3.10-pvticket-ple_on228955% CPU in host
kernel, 2% spin_lock in guests
3.10-pvticket-ple_off230515% CPU in host
kernel, 2% spin_lock in guests
[all 1x results look good here]


Yes. The 1x results look too close




2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
---
Total
ConfigurationThroughputNotes

3.10-default-ple_on 628755% CPU  host
kernel, 17% spin_lock in guests
3.10-default-ple_off 18492% CPU in host
kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 669150% CPU in host
kernel, 15% spin_lock in guests
3.10-pvticket-ple_off164648% CPU in host
kernel, 33% spin_lock in guests


I see 6.426% improvement with ple_on
and 161.87% improvement with ple_off. I think this is a very good
sign
  for the patches


[PLE hinders pv-ticket improvements, but even with PLE off,
  we still off from ideal throughput (somewhere 2)]



Okay, The ideal throughput you are referring is getting around
atleast
80% of 1x throughput for over-commit. Yes we are still far away from
there.



1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
--
Total
ConfigurationThroughputNotes

3.10-default-ple_on227366% CPU in host
kernel, 3% spin_lock in guests
3.10-default-ple_off233775% CPU in host
kernel, 3% spin_lock in guests
3.10-pvticket-ple_on224716% CPU in host
kernel, 3% spin_lock in guests
3.10-pvticket-ple_off234455% CPU in host
kernel, 3% spin_lock in guests
[1x looking fine here]



I see ple_off is little better here.



2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
--
Total
ConfigurationThroughputNotes

3.10-default-ple_on 196570% CPU in host
kernel, 34% spin_lock in guests
3.10-default-ple_off  2262% CPU in host
kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 194270% CPU in host
kernel, 35% spin_lock in guests
3.10-pvticket-ple_off 800311% CPU in host
kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the best so far.
  Still quite a bit off from ideal throughput]


This is again a remarkable improvement (307%).
This motivates me to add a patch to disable ple when pvspinlock is
on.
probably we can add a hypercall that disables ple in kvm init patch.
but only problem I see is what if the guests are mixed.

  (i.e one guest has pvspinlock support but other does not. Host
supports pv)


How about reintroducing the idea to create per-kvm ple_gap,ple_window
state. We were headed down that road when considering a dynamic
window at
one point. Then you can just set a single guest's ple_gap to zero,
which
would lead to 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Raghavendra K T

On 07/10/2013 04:17 PM, Gleb Natapov wrote:

On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:

On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:

Here's an idea, trim the damn email ;-) -- not only directed at gleb.


Good idea.


Ingo, Gleb,

 From the results perspective, Andrew Theurer, Vinod's test results are
pro-pvspinlock.
Could you please help me to know what will make it a mergeable
candidate?.


I need to spend more time reviewing it :) The problem with PV interfaces
is that they are easy to add but hard to get rid of if better solution
(HW or otherwise) appears.


How so? Just make sure the registration for the PV interface is optional; that
is, allow it to fail. A guest that fails the PV setup will either have to try
another PV interface or fall back to 'native'.


We have to carry PV around for live migration purposes. PV interface
cannot disappear under a running guest.



IIRC, The only requirement was running state of the vcpu to be retained.
This was addressed by
[PATCH RFC V10 13/18] kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl 
to aid migration.


I would have to know more if I missed something here.


I agree that Jiannan's Preemptable Lock idea is promising and we could
evaluate that  approach, and make the best one get into kernel and also
will carry on discussion with Jiannan to improve that patch.

That would be great. The work is stalled from what I can tell.


I absolutely hated that stuff because it wrecked the native code.

Yes, the idea was to hide it from native code behind PV hooks.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 04:58:29PM +0530, Raghavendra K T wrote:
 On 07/10/2013 04:17 PM, Gleb Natapov wrote:
 On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
 On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
 
 Here's an idea, trim the damn email ;-) -- not only directed at gleb.
 
 Good idea.
 
 Ingo, Gleb,
 
  From the results perspective, Andrew Theurer, Vinod's test results are
 pro-pvspinlock.
 Could you please help me to know what will make it a mergeable
 candidate?.
 
 I need to spend more time reviewing it :) The problem with PV interfaces
 is that they are easy to add but hard to get rid of if better solution
 (HW or otherwise) appears.
 
 How so? Just make sure the registration for the PV interface is optional; 
 that
 is, allow it to fail. A guest that fails the PV setup will either have to 
 try
 another PV interface or fall back to 'native'.
 
 We have to carry PV around for live migration purposes. PV interface
 cannot disappear under a running guest.
 
 
 IIRC, The only requirement was running state of the vcpu to be retained.
 This was addressed by
 [PATCH RFC V10 13/18] kvm : Fold pv_unhalt flag into GET_MP_STATE
 ioctl to aid migration.
 
 I would have to know more if I missed something here.
 
I was not talking about the state that has to be migrated, but
HV-guest interface that has to be preserved after migration.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Raghavendra K T

dropping stephen becuase of bounce

On 07/10/2013 04:58 PM, Raghavendra K T wrote:

On 07/10/2013 04:17 PM, Gleb Natapov wrote:

On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:

On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:

Here's an idea, trim the damn email ;-) -- not only directed at gleb.


Good idea.


Ingo, Gleb,

 From the results perspective, Andrew Theurer, Vinod's test results
are
pro-pvspinlock.
Could you please help me to know what will make it a mergeable
candidate?.


I need to spend more time reviewing it :) The problem with PV
interfaces
is that they are easy to add but hard to get rid of if better solution
(HW or otherwise) appears.


How so? Just make sure the registration for the PV interface is
optional; that
is, allow it to fail. A guest that fails the PV setup will either
have to try
another PV interface or fall back to 'native'.



Forgot to add. Yes currently pvspinlocks are not enabled by default and
also, we have jump_label mechanism to enable it.
[...]

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 04:54:12PM +0530, Raghavendra K T wrote:
 Ingo, Gleb,
 
  From the results perspective, Andrew Theurer, Vinod's test results are
 pro-pvspinlock.
 Could you please help me to know what will make it a mergeable
 candidate?.
 
 I need to spend more time reviewing it :) The problem with PV interfaces
 is that they are easy to add but hard to get rid of if better solution
 (HW or otherwise) appears.
 
 Infact Avi had acked the whole V8 series, but delayed for seeing how
 PLE improvement would affect it.
 
I see that Ingo was happy with it too.

 The only addition from that series has been
 1. tuning the SPIN_THRESHOLD to 32k (from 2k)
 and
 2. the halt handler now calls vcpu_on_spin to take the advantage of
 PLE improvements. (this can also go as an independent patch into
 kvm)
 
 The rationale for making SPIN_THERSHOLD 32k needs big explanation.
 Before PLE improvements, as you know,
 kvm undercommit scenario was very worse in ple enabled cases.
 (compared to ple disabled cases).
 pvspinlock patches behaved equally bad in undercommit. Both had
 similar reason so at the end there was no degradation w.r.t base.
 
 The reason for bad performance in PLE case was unneeded vcpu
 iteration in ple handler resulting in high yield_to calls and double
 run queue locks.
 With pvspinlock applied, same villain role was played by excessive
 halt exits.
 
 But after ple handler improved, we needed to throttle unnecessary halts
 in undercommit for pvspinlock to be on par with 1x result.
 
Make sense. I will review it ASAP. BTW the latest version is V10 right?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Raghavendra K T

On 07/10/2013 05:11 PM, Gleb Natapov wrote:

On Wed, Jul 10, 2013 at 04:54:12PM +0530, Raghavendra K T wrote:

Ingo, Gleb,

 From the results perspective, Andrew Theurer, Vinod's test results are
pro-pvspinlock.
Could you please help me to know what will make it a mergeable
candidate?.


I need to spend more time reviewing it :) The problem with PV interfaces
is that they are easy to add but hard to get rid of if better solution
(HW or otherwise) appears.


Infact Avi had acked the whole V8 series, but delayed for seeing how
PLE improvement would affect it.


I see that Ingo was happy with it too.


The only addition from that series has been
1. tuning the SPIN_THRESHOLD to 32k (from 2k)
and
2. the halt handler now calls vcpu_on_spin to take the advantage of
PLE improvements. (this can also go as an independent patch into
kvm)

The rationale for making SPIN_THERSHOLD 32k needs big explanation.
Before PLE improvements, as you know,
kvm undercommit scenario was very worse in ple enabled cases.
(compared to ple disabled cases).
pvspinlock patches behaved equally bad in undercommit. Both had
similar reason so at the end there was no degradation w.r.t base.

The reason for bad performance in PLE case was unneeded vcpu
iteration in ple handler resulting in high yield_to calls and double
run queue locks.
With pvspinlock applied, same villain role was played by excessive
halt exits.

But after ple handler improved, we needed to throttle unnecessary halts
in undercommit for pvspinlock to be on par with 1x result.


Make sense. I will review it ASAP. BTW the latest version is V10 right?



Yes. Thank you.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Konrad Rzeszutek Wilk
On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote:
 On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
  On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
  
  Here's an idea, trim the damn email ;-) -- not only directed at gleb.
  
 Good idea.
 
Ingo, Gleb,

From the results perspective, Andrew Theurer, Vinod's test results are
pro-pvspinlock.
Could you please help me to know what will make it a mergeable
candidate?.

   I need to spend more time reviewing it :) The problem with PV interfaces
   is that they are easy to add but hard to get rid of if better solution
   (HW or otherwise) appears.
  
  How so? Just make sure the registration for the PV interface is optional; 
  that
  is, allow it to fail. A guest that fails the PV setup will either have to 
  try
  another PV interface or fall back to 'native'.
  
 We have to carry PV around for live migration purposes. PV interface
 cannot disappear under a running guest.

Why can't it? This is the same as handling say XSAVE operations. Some hosts
might have it - some might not. It is the job of the toolstack to make sure
to not migrate to the hosts which don't have it. Or bound the guest to the
lowest interface (so don't enable the PV interface if the other hosts in the
cluster can't support this flag)?

 
I agree that Jiannan's Preemptable Lock idea is promising and we could
evaluate that  approach, and make the best one get into kernel and also
will carry on discussion with Jiannan to improve that patch.
   That would be great. The work is stalled from what I can tell.
  
  I absolutely hated that stuff because it wrecked the native code.
 Yes, the idea was to hide it from native code behind PV hooks.
 
 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Gleb Natapov
On Wed, Jul 10, 2013 at 11:03:15AM -0400, Konrad Rzeszutek Wilk wrote:
 On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote:
  On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
   On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
   
   Here's an idea, trim the damn email ;-) -- not only directed at gleb.
   
  Good idea.
  
 Ingo, Gleb,
 
 From the results perspective, Andrew Theurer, Vinod's test results are
 pro-pvspinlock.
 Could you please help me to know what will make it a mergeable
 candidate?.
 
I need to spend more time reviewing it :) The problem with PV interfaces
is that they are easy to add but hard to get rid of if better solution
(HW or otherwise) appears.
   
   How so? Just make sure the registration for the PV interface is optional; 
   that
   is, allow it to fail. A guest that fails the PV setup will either have to 
   try
   another PV interface or fall back to 'native'.
   
  We have to carry PV around for live migration purposes. PV interface
  cannot disappear under a running guest.
 
 Why can't it? This is the same as handling say XSAVE operations. Some hosts
 might have it - some might not. It is the job of the toolstack to make sure
 to not migrate to the hosts which don't have it. Or bound the guest to the
 lowest interface (so don't enable the PV interface if the other hosts in the
 cluster can't support this flag)?
XSAVE is HW feature and it is not going disappear under you after software
upgrade. Upgrading kernel on part of your hosts and no longer been
able to migrate to them is not something people who use live migration
expect. In practise it means that updating all hosts in a datacenter to
newer kernel is no longer possible without rebooting VMs.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-10 Thread Konrad Rzeszutek Wilk
Gleb Natapov g...@redhat.com wrote:
On Wed, Jul 10, 2013 at 11:03:15AM -0400, Konrad Rzeszutek Wilk wrote:
 On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote:
  On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote:
   On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote:
   
   Here's an idea, trim the damn email ;-) -- not only directed at
gleb.
   
  Good idea.
  
 Ingo, Gleb,
 
 From the results perspective, Andrew Theurer, Vinod's test
results are
 pro-pvspinlock.
 Could you please help me to know what will make it a
mergeable
 candidate?.
 
I need to spend more time reviewing it :) The problem with PV
interfaces
is that they are easy to add but hard to get rid of if better
solution
(HW or otherwise) appears.
   
   How so? Just make sure the registration for the PV interface is
optional; that
   is, allow it to fail. A guest that fails the PV setup will either
have to try
   another PV interface or fall back to 'native'.
   
  We have to carry PV around for live migration purposes. PV
interface
  cannot disappear under a running guest.
 
 Why can't it? This is the same as handling say XSAVE operations. Some
hosts
 might have it - some might not. It is the job of the toolstack to
make sure
 to not migrate to the hosts which don't have it. Or bound the guest
to the
 lowest interface (so don't enable the PV interface if the other hosts
in the
 cluster can't support this flag)?
XSAVE is HW feature and it is not going disappear under you after
software
upgrade. Upgrading kernel on part of your hosts and no longer been
able to migrate to them is not something people who use live migration
expect. In practise it means that updating all hosts in a datacenter to
newer kernel is no longer possible without rebooting VMs.

--
   Gleb.

I see. Perhaps then if the hardware becomes much better at this then another PV 
interface can be provided which will use the static_key to turn off the PV spin 
lock and use the bare metal version (or perhaps some forms of super ellision 
locks). That does mean the host has to do something when this PV interface is 
invoked for the older guests. 

Anyhow that said I think the benefits are pretty neat right now and benefit 
much and worrying about whether the hardware vendors will provide something new 
is not benefiting users. What perhaps then needs to be addressed is how to have 
an obsolete mechanism in this if the hardware becomes superb? 
-- 
Sent from my Android phone. Please excuse my brevity.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-09 Thread Raghavendra K T

On 06/26/2013 11:24 PM, Raghavendra K T wrote:

On 06/26/2013 09:41 PM, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote:

On 06/26/2013 06:22 PM, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:

On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:

On 06/25/2013 08:20 PM, Andrew Theurer wrote:

On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:

This series replaces the existing paravirtualized spinlock
mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
causing undercommit degradation (after PLE handler
improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions
to look
at PLE handler's improvements, various optimizations in PLE
handling
have been tried.


Sorry for not posting this sooner.  I have tested the v9
pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I
have
tested these patches with and without PLE, as PLE is still not
scalable
with large VMs.



Hi Andrew,

Thanks for testing.


System: x3850X5, 40 cores, 80 threads


1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
--
Total
ConfigurationThroughput(MB/s)Notes

3.10-default-ple_on229455% CPU in host
kernel, 2% spin_lock in guests
3.10-default-ple_off231845% CPU in host
kernel, 2% spin_lock in guests
3.10-pvticket-ple_on228955% CPU in host
kernel, 2% spin_lock in guests
3.10-pvticket-ple_off230515% CPU in host
kernel, 2% spin_lock in guests
[all 1x results look good here]


Yes. The 1x results look too close




2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
---
Total
ConfigurationThroughputNotes

3.10-default-ple_on 628755% CPU  host
kernel, 17% spin_lock in guests
3.10-default-ple_off 18492% CPU in host
kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 669150% CPU in host
kernel, 15% spin_lock in guests
3.10-pvticket-ple_off164648% CPU in host
kernel, 33% spin_lock in guests


I see 6.426% improvement with ple_on
and 161.87% improvement with ple_off. I think this is a very good
sign
  for the patches


[PLE hinders pv-ticket improvements, but even with PLE off,
  we still off from ideal throughput (somewhere 2)]



Okay, The ideal throughput you are referring is getting around
atleast
80% of 1x throughput for over-commit. Yes we are still far away from
there.



1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
--
Total
ConfigurationThroughputNotes

3.10-default-ple_on227366% CPU in host
kernel, 3% spin_lock in guests
3.10-default-ple_off233775% CPU in host
kernel, 3% spin_lock in guests
3.10-pvticket-ple_on224716% CPU in host
kernel, 3% spin_lock in guests
3.10-pvticket-ple_off234455% CPU in host
kernel, 3% spin_lock in guests
[1x looking fine here]



I see ple_off is little better here.



2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
--
Total
ConfigurationThroughputNotes

3.10-default-ple_on 196570% CPU in host
kernel, 34% spin_lock in guests
3.10-default-ple_off  2262% CPU in host
kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 194270% CPU in host
kernel, 35% spin_lock in guests
3.10-pvticket-ple_off 800311% CPU in host
kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the best so far.
  Still quite a bit off from ideal throughput]


This is again a remarkable improvement (307%).
This motivates me to add a patch to disable ple when pvspinlock is
on.
probably we can add a hypercall that disables ple in kvm init patch.
but only problem I see is what if the guests are mixed.

  (i.e one guest has pvspinlock support but other does not. Host
supports pv)


How about reintroducing the idea to create per-kvm ple_gap,ple_window
state. We were headed down that road when considering a dynamic
window at
one point. Then you can just set a single guest's ple_gap to zero,
which
would lead to PLE being disabled for that guest. We could also revisit
the dynamic window then.


Can be done, but lets 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-07-01 Thread Raghavendra K T

On 06/26/2013 09:26 PM, Andrew Theurer wrote:

On Wed, 2013-06-26 at 15:52 +0300, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:

On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:

On 06/25/2013 08:20 PM, Andrew Theurer wrote:

On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
causing undercommit degradation (after PLE handler improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions to look
at PLE handler's improvements, various optimizations in PLE handling
have been tried.


Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
tested these patches with and without PLE, as PLE is still not scalable
with large VMs.



Hi Andrew,

Thanks for testing.


System: x3850X5, 40 cores, 80 threads


1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput(MB/s)Notes

3.10-default-ple_on 22945   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-default-ple_off23184   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_on22895   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_off   23051   5% CPU in host 
kernel, 2% spin_lock in guests
[all 1x results look good here]


Yes. The 1x results look too close




2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
---
Total
Configuration   Throughput  Notes

3.10-default-ple_on  6287   55% CPU  host 
kernel, 17% spin_lock in guests
3.10-default-ple_off 1849   2% CPU in host 
kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 6691   50% CPU in host 
kernel, 15% spin_lock in guests
3.10-pvticket-ple_off   16464   8% CPU in host 
kernel, 33% spin_lock in guests


I see 6.426% improvement with ple_on
and 161.87% improvement with ple_off. I think this is a very good sign
  for the patches


[PLE hinders pv-ticket improvements, but even with PLE off,
  we still off from ideal throughput (somewhere 2)]



Okay, The ideal throughput you are referring is getting around atleast
80% of 1x throughput for over-commit. Yes we are still far away from
there.



1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on 22736   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-default-ple_off23377   5% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_on22471   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_off   23445   5% CPU in host 
kernel, 3% spin_lock in guests
[1x looking fine here]



I see ple_off is little better here.



2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on  1965   70% CPU in host 
kernel, 34% spin_lock in guests 
3.10-default-ple_off  226   2% CPU in host 
kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 1942   70% CPU in host 
kernel, 35% spin_lock in guests
3.10-pvticket-ple_off8003   11% CPU in host 
kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the best so far.
  Still quite a bit off from ideal throughput]


This is again a remarkable improvement (307%).
This motivates me to add a patch to disable ple when pvspinlock is on.
probably we can add a hypercall that disables ple in kvm init patch.
but only problem I see is what if the guests are mixed.

  (i.e one guest has pvspinlock support but other does not. Host
supports pv)


How about 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Raghavendra K T

On 06/25/2013 08:20 PM, Andrew Theurer wrote:

On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
causing undercommit degradation (after PLE handler improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions to look
at PLE handler's improvements, various optimizations in PLE handling
have been tried.


Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
tested these patches with and without PLE, as PLE is still not scalable
with large VMs.



Hi Andrew,

Thanks for testing.


System: x3850X5, 40 cores, 80 threads


1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput(MB/s)Notes

3.10-default-ple_on 22945   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-default-ple_off23184   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_on22895   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_off   23051   5% CPU in host 
kernel, 2% spin_lock in guests
[all 1x results look good here]


Yes. The 1x results look too close




2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
---
Total
Configuration   Throughput  Notes

3.10-default-ple_on  6287   55% CPU  host 
kernel, 17% spin_lock in guests
3.10-default-ple_off 1849   2% CPU in host 
kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 6691   50% CPU in host 
kernel, 15% spin_lock in guests
3.10-pvticket-ple_off   16464   8% CPU in host 
kernel, 33% spin_lock in guests


I see 6.426% improvement with ple_on
and 161.87% improvement with ple_off. I think this is a very good sign
 for the patches


[PLE hinders pv-ticket improvements, but even with PLE off,
  we still off from ideal throughput (somewhere 2)]



Okay, The ideal throughput you are referring is getting around atleast
80% of 1x throughput for over-commit. Yes we are still far away from
there.



1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on 22736   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-default-ple_off23377   5% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_on22471   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_off   23445   5% CPU in host 
kernel, 3% spin_lock in guests
[1x looking fine here]



I see ple_off is little better here.



2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on  1965   70% CPU in host 
kernel, 34% spin_lock in guests 
3.10-default-ple_off  226   2% CPU in host 
kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 1942   70% CPU in host 
kernel, 35% spin_lock in guests
3.10-pvticket-ple_off8003   11% CPU in host 
kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the best so far.
  Still quite a bit off from ideal throughput]


This is again a remarkable improvement (307%).
This motivates me to add a patch to disable ple when pvspinlock is on.
probably we can add a hypercall that disables ple in kvm init patch.
but only problem I see is what if the guests are mixed.

 (i.e one guest has pvspinlock support but other does not. Host
supports pv)

/me thinks



In summary, I would state that the pv-ticket is an overall win, but the
current PLE handler tends to get in the way on these larger guests.

-Andrew



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Andrew Jones
On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
 On 06/25/2013 08:20 PM, Andrew Theurer wrote:
 On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
 This series replaces the existing paravirtualized spinlock mechanism
 with a paravirtualized ticketlock mechanism. The series provides
 implementation for both Xen and KVM.
 
 Changes in V9:
 - Changed spin_threshold to 32k to avoid excess halt exits that are
 causing undercommit degradation (after PLE handler improvement).
 - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
 - Optimized halt exit path to use PLE handler
 
 V8 of PVspinlock was posted last year. After Avi's suggestions to look
 at PLE handler's improvements, various optimizations in PLE handling
 have been tried.
 
 Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
 patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
 tested these patches with and without PLE, as PLE is still not scalable
 with large VMs.
 
 
 Hi Andrew,
 
 Thanks for testing.
 
 System: x3850X5, 40 cores, 80 threads
 
 
 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
 --
  Total
 ConfigurationThroughput(MB/s)Notes
 
 3.10-default-ple_on  22945   5% CPU in host 
 kernel, 2% spin_lock in guests
 3.10-default-ple_off 23184   5% CPU in host 
 kernel, 2% spin_lock in guests
 3.10-pvticket-ple_on 22895   5% CPU in host 
 kernel, 2% spin_lock in guests
 3.10-pvticket-ple_off23051   5% CPU 
 in host kernel, 2% spin_lock in guests
 [all 1x results look good here]
 
 Yes. The 1x results look too close
 
 
 
 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
 ---
  Total
 ConfigurationThroughput  Notes
 
 3.10-default-ple_on   6287   55% CPU  host 
 kernel, 17% spin_lock in guests
 3.10-default-ple_off  1849   2% CPU in host 
 kernel, 95% spin_lock in guests
 3.10-pvticket-ple_on  6691   50% CPU in host 
 kernel, 15% spin_lock in guests
 3.10-pvticket-ple_off16464   8% CPU 
 in host kernel, 33% spin_lock in guests
 
 I see 6.426% improvement with ple_on
 and 161.87% improvement with ple_off. I think this is a very good sign
  for the patches
 
 [PLE hinders pv-ticket improvements, but even with PLE off,
   we still off from ideal throughput (somewhere 2)]
 
 
 Okay, The ideal throughput you are referring is getting around atleast
 80% of 1x throughput for over-commit. Yes we are still far away from
 there.
 
 
 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
 --
  Total
 ConfigurationThroughput  Notes
 
 3.10-default-ple_on  22736   6% CPU in host 
 kernel, 3% spin_lock in guests
 3.10-default-ple_off 23377   5% CPU in host 
 kernel, 3% spin_lock in guests
 3.10-pvticket-ple_on 22471   6% CPU in host 
 kernel, 3% spin_lock in guests
 3.10-pvticket-ple_off23445   5% CPU 
 in host kernel, 3% spin_lock in guests
 [1x looking fine here]
 
 
 I see ple_off is little better here.
 
 
 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
 --
  Total
 ConfigurationThroughput  Notes
 
 3.10-default-ple_on   1965   70% CPU in host 
 kernel, 34% spin_lock in guests 
 3.10-default-ple_off   226   2% CPU in host 
 kernel, 94% spin_lock in guests
 3.10-pvticket-ple_on  1942   70% CPU in host 
 kernel, 35% spin_lock in guests
 3.10-pvticket-ple_off 8003   11% CPU 
 in host kernel, 70% spin_lock in guests
 [quite bad all around, but pv-tickets with PLE off the best so far.
   Still quite a bit off from ideal throughput]
 
 This is again a remarkable improvement (307%).
 This motivates me to add a patch to disable ple when pvspinlock is on.
 probably we can add a hypercall that disables ple in kvm init patch.
 but only problem I see is what if the guests are mixed.
 
  (i.e one guest has pvspinlock support but other does not. Host
 supports pv)

How about reintroducing the idea to create per-kvm ple_gap,ple_window
state. We were headed 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Gleb Natapov
On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
 On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
  On 06/25/2013 08:20 PM, Andrew Theurer wrote:
  On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
  This series replaces the existing paravirtualized spinlock mechanism
  with a paravirtualized ticketlock mechanism. The series provides
  implementation for both Xen and KVM.
  
  Changes in V9:
  - Changed spin_threshold to 32k to avoid excess halt exits that are
  causing undercommit degradation (after PLE handler improvement).
  - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
  - Optimized halt exit path to use PLE handler
  
  V8 of PVspinlock was posted last year. After Avi's suggestions to look
  at PLE handler's improvements, various optimizations in PLE handling
  have been tried.
  
  Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
  patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
  tested these patches with and without PLE, as PLE is still not scalable
  with large VMs.
  
  
  Hi Andrew,
  
  Thanks for testing.
  
  System: x3850X5, 40 cores, 80 threads
  
  
  1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
  --
 Total
  Configuration  Throughput(MB/s)Notes
  
  3.10-default-ple_on22945   5% CPU 
  in host kernel, 2% spin_lock in guests
  3.10-default-ple_off   23184   5% CPU 
  in host kernel, 2% spin_lock in guests
  3.10-pvticket-ple_on   22895   5% CPU 
  in host kernel, 2% spin_lock in guests
  3.10-pvticket-ple_off  23051   5% CPU 
  in host kernel, 2% spin_lock in guests
  [all 1x results look good here]
  
  Yes. The 1x results look too close
  
  
  
  2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
  ---
 Total
  Configuration  Throughput  Notes
  
  3.10-default-ple_on 6287   55% CPU 
   host kernel, 17% spin_lock in guests
  3.10-default-ple_off1849   2% CPU 
  in host kernel, 95% spin_lock in guests
  3.10-pvticket-ple_on6691   50% CPU 
  in host kernel, 15% spin_lock in guests
  3.10-pvticket-ple_off  16464   8% CPU 
  in host kernel, 33% spin_lock in guests
  
  I see 6.426% improvement with ple_on
  and 161.87% improvement with ple_off. I think this is a very good sign
   for the patches
  
  [PLE hinders pv-ticket improvements, but even with PLE off,
we still off from ideal throughput (somewhere 2)]
  
  
  Okay, The ideal throughput you are referring is getting around atleast
  80% of 1x throughput for over-commit. Yes we are still far away from
  there.
  
  
  1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
  --
 Total
  Configuration  Throughput  Notes
  
  3.10-default-ple_on22736   6% CPU 
  in host kernel, 3% spin_lock in guests
  3.10-default-ple_off   23377   5% CPU 
  in host kernel, 3% spin_lock in guests
  3.10-pvticket-ple_on   22471   6% CPU 
  in host kernel, 3% spin_lock in guests
  3.10-pvticket-ple_off  23445   5% CPU 
  in host kernel, 3% spin_lock in guests
  [1x looking fine here]
  
  
  I see ple_off is little better here.
  
  
  2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
  --
 Total
  Configuration  Throughput  Notes
  
  3.10-default-ple_on 1965   70% CPU 
  in host kernel, 34% spin_lock in guests 
  3.10-default-ple_off 226   2% CPU 
  in host kernel, 94% spin_lock in guests
  3.10-pvticket-ple_on1942   70% CPU 
  in host kernel, 35% spin_lock in guests
  3.10-pvticket-ple_off   8003   11% CPU 
  in host kernel, 70% spin_lock in guests
  [quite bad all around, but pv-tickets with PLE off the best so far.
Still quite a bit off from ideal throughput]
  
  This is again a remarkable improvement (307%).
  This motivates me to add a patch to disable ple when pvspinlock is on.
  probably we can add a hypercall that disables ple in kvm init patch.
  but 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Raghavendra K T

On 06/26/2013 06:22 PM, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:

On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:

On 06/25/2013 08:20 PM, Andrew Theurer wrote:

On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
causing undercommit degradation (after PLE handler improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions to look
at PLE handler's improvements, various optimizations in PLE handling
have been tried.


Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
tested these patches with and without PLE, as PLE is still not scalable
with large VMs.



Hi Andrew,

Thanks for testing.


System: x3850X5, 40 cores, 80 threads


1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput(MB/s)Notes

3.10-default-ple_on 22945   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-default-ple_off23184   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_on22895   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_off   23051   5% CPU in host 
kernel, 2% spin_lock in guests
[all 1x results look good here]


Yes. The 1x results look too close




2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
---
Total
Configuration   Throughput  Notes

3.10-default-ple_on  6287   55% CPU  host 
kernel, 17% spin_lock in guests
3.10-default-ple_off 1849   2% CPU in host 
kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 6691   50% CPU in host 
kernel, 15% spin_lock in guests
3.10-pvticket-ple_off   16464   8% CPU in host 
kernel, 33% spin_lock in guests


I see 6.426% improvement with ple_on
and 161.87% improvement with ple_off. I think this is a very good sign
  for the patches


[PLE hinders pv-ticket improvements, but even with PLE off,
  we still off from ideal throughput (somewhere 2)]



Okay, The ideal throughput you are referring is getting around atleast
80% of 1x throughput for over-commit. Yes we are still far away from
there.



1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on 22736   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-default-ple_off23377   5% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_on22471   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_off   23445   5% CPU in host 
kernel, 3% spin_lock in guests
[1x looking fine here]



I see ple_off is little better here.



2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on  1965   70% CPU in host 
kernel, 34% spin_lock in guests 
3.10-default-ple_off  226   2% CPU in host 
kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 1942   70% CPU in host 
kernel, 35% spin_lock in guests
3.10-pvticket-ple_off8003   11% CPU in host 
kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the best so far.
  Still quite a bit off from ideal throughput]


This is again a remarkable improvement (307%).
This motivates me to add a patch to disable ple when pvspinlock is on.
probably we can add a hypercall that disables ple in kvm init patch.
but only problem I see is what if the guests are mixed.

  (i.e one guest has pvspinlock support but other does not. Host
supports pv)


How about reintroducing the idea to create per-kvm ple_gap,ple_window

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Konrad Rzeszutek Wilk
On Wed, Jun 26, 2013 at 03:52:40PM +0300, Gleb Natapov wrote:
 On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
  On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
   On 06/25/2013 08:20 PM, Andrew Theurer wrote:
   On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
   This series replaces the existing paravirtualized spinlock mechanism
   with a paravirtualized ticketlock mechanism. The series provides
   implementation for both Xen and KVM.
   
   Changes in V9:
   - Changed spin_threshold to 32k to avoid excess halt exits that are
   causing undercommit degradation (after PLE handler improvement).
   - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
   - Optimized halt exit path to use PLE handler
   
   V8 of PVspinlock was posted last year. After Avi's suggestions to look
   at PLE handler's improvements, various optimizations in PLE handling
   have been tried.
   
   Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
   patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
   tested these patches with and without PLE, as PLE is still not scalable
   with large VMs.
   
   
   Hi Andrew,
   
   Thanks for testing.
   
   System: x3850X5, 40 cores, 80 threads
   
   
   1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
   --
Total
   ConfigurationThroughput(MB/s)Notes
   
   3.10-default-ple_on  22945   5% CPU 
   in host kernel, 2% spin_lock in guests
   3.10-default-ple_off 23184   5% CPU 
   in host kernel, 2% spin_lock in guests
   3.10-pvticket-ple_on 22895   5% CPU 
   in host kernel, 2% spin_lock in guests
   3.10-pvticket-ple_off23051   5% CPU 
   in host kernel, 2% spin_lock in guests
   [all 1x results look good here]
   
   Yes. The 1x results look too close
   
   
   
   2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
   ---
Total
   ConfigurationThroughput  Notes
   
   3.10-default-ple_on   6287   55% CPU 
host kernel, 17% spin_lock in guests
   3.10-default-ple_off  1849   2% CPU 
   in host kernel, 95% spin_lock in guests
   3.10-pvticket-ple_on  6691   50% CPU 
   in host kernel, 15% spin_lock in guests
   3.10-pvticket-ple_off16464   8% CPU 
   in host kernel, 33% spin_lock in guests
   
   I see 6.426% improvement with ple_on
   and 161.87% improvement with ple_off. I think this is a very good sign
for the patches
   
   [PLE hinders pv-ticket improvements, but even with PLE off,
 we still off from ideal throughput (somewhere 2)]
   
   
   Okay, The ideal throughput you are referring is getting around atleast
   80% of 1x throughput for over-commit. Yes we are still far away from
   there.
   
   
   1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
   --
Total
   ConfigurationThroughput  Notes
   
   3.10-default-ple_on  22736   6% CPU 
   in host kernel, 3% spin_lock in guests
   3.10-default-ple_off 23377   5% CPU 
   in host kernel, 3% spin_lock in guests
   3.10-pvticket-ple_on 22471   6% CPU 
   in host kernel, 3% spin_lock in guests
   3.10-pvticket-ple_off23445   5% CPU 
   in host kernel, 3% spin_lock in guests
   [1x looking fine here]
   
   
   I see ple_off is little better here.
   
   
   2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
   --
Total
   ConfigurationThroughput  Notes
   
   3.10-default-ple_on   1965   70% CPU 
   in host kernel, 34% spin_lock in guests 
   3.10-default-ple_off   226   2% CPU 
   in host kernel, 94% spin_lock in guests
   3.10-pvticket-ple_on  1942   70% CPU 
   in host kernel, 35% spin_lock in guests
   3.10-pvticket-ple_off 8003   11% CPU 
   in host kernel, 70% spin_lock in guests
   [quite bad all around, but pv-tickets with PLE off the best so far.
 Still quite a bit off from ideal throughput]
   
   This is again a remarkable improvement (307%).
   This motivates me to 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Raghavendra K T

On 06/26/2013 08:09 PM, Chegu Vinod wrote:

On 6/26/2013 6:40 AM, Raghavendra K T wrote:

On 06/26/2013 06:22 PM, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:

On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:

On 06/25/2013 08:20 PM, Andrew Theurer wrote:

On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
causing undercommit degradation (after PLE handler improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions to
look
at PLE handler's improvements, various optimizations in PLE handling
have been tried.


Sorry for not posting this sooner.  I have tested the v9
pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I
have
tested these patches with and without PLE, as PLE is still not
scalable
with large VMs.



Hi Andrew,

Thanks for testing.


System: x3850X5, 40 cores, 80 threads


1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
--
Total
ConfigurationThroughput(MB/s)Notes

3.10-default-ple_on229455% CPU in host
kernel, 2% spin_lock in guests
3.10-default-ple_off231845% CPU in host
kernel, 2% spin_lock in guests
3.10-pvticket-ple_on228955% CPU in host
kernel, 2% spin_lock in guests
3.10-pvticket-ple_off230515% CPU in host
kernel, 2% spin_lock in guests
[all 1x results look good here]


Yes. The 1x results look too close




2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
---
Total
ConfigurationThroughputNotes

3.10-default-ple_on 628755% CPU host
kernel, 17% spin_lock in guests
3.10-default-ple_off 18492% CPU in host
kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 669150% CPU in host
kernel, 15% spin_lock in guests
3.10-pvticket-ple_off164648% CPU in host
kernel, 33% spin_lock in guests


I see 6.426% improvement with ple_on
and 161.87% improvement with ple_off. I think this is a very good sign
  for the patches


[PLE hinders pv-ticket improvements, but even with PLE off,
  we still off from ideal throughput (somewhere 2)]



Okay, The ideal throughput you are referring is getting around atleast
80% of 1x throughput for over-commit. Yes we are still far away from
there.



1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
--
Total
ConfigurationThroughputNotes

3.10-default-ple_on227366% CPU in host
kernel, 3% spin_lock in guests
3.10-default-ple_off233775% CPU in host
kernel, 3% spin_lock in guests
3.10-pvticket-ple_on224716% CPU in host
kernel, 3% spin_lock in guests
3.10-pvticket-ple_off234455% CPU in host
kernel, 3% spin_lock in guests
[1x looking fine here]



I see ple_off is little better here.



2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
--
Total
ConfigurationThroughputNotes

3.10-default-ple_on 196570% CPU in host
kernel, 34% spin_lock in guests
3.10-default-ple_off  2262% CPU in host
kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 194270% CPU in host
kernel, 35% spin_lock in guests
3.10-pvticket-ple_off 800311% CPU in host
kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the best so far.
  Still quite a bit off from ideal throughput]


This is again a remarkable improvement (307%).
This motivates me to add a patch to disable ple when pvspinlock is on.
probably we can add a hypercall that disables ple in kvm init patch.
but only problem I see is what if the guests are mixed.

  (i.e one guest has pvspinlock support but other does not. Host
supports pv)


How about reintroducing the idea to create per-kvm ple_gap,ple_window
state. We were headed down that road when considering a dynamic
window at
one point. Then you can just set a single guest's ple_gap to zero,
which
would lead to PLE being disabled for that guest. We could also revisit
the dynamic window then.


Can be done, but lets understand why ple on is such a big problem. Is it
possible that ple gap 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Andrew Theurer
On Wed, 2013-06-26 at 15:52 +0300, Gleb Natapov wrote:
 On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
  On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
   On 06/25/2013 08:20 PM, Andrew Theurer wrote:
   On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
   This series replaces the existing paravirtualized spinlock mechanism
   with a paravirtualized ticketlock mechanism. The series provides
   implementation for both Xen and KVM.
   
   Changes in V9:
   - Changed spin_threshold to 32k to avoid excess halt exits that are
   causing undercommit degradation (after PLE handler improvement).
   - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
   - Optimized halt exit path to use PLE handler
   
   V8 of PVspinlock was posted last year. After Avi's suggestions to look
   at PLE handler's improvements, various optimizations in PLE handling
   have been tried.
   
   Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
   patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
   tested these patches with and without PLE, as PLE is still not scalable
   with large VMs.
   
   
   Hi Andrew,
   
   Thanks for testing.
   
   System: x3850X5, 40 cores, 80 threads
   
   
   1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
   --
Total
   ConfigurationThroughput(MB/s)Notes
   
   3.10-default-ple_on  22945   5% CPU 
   in host kernel, 2% spin_lock in guests
   3.10-default-ple_off 23184   5% CPU 
   in host kernel, 2% spin_lock in guests
   3.10-pvticket-ple_on 22895   5% CPU 
   in host kernel, 2% spin_lock in guests
   3.10-pvticket-ple_off23051   5% CPU 
   in host kernel, 2% spin_lock in guests
   [all 1x results look good here]
   
   Yes. The 1x results look too close
   
   
   
   2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
   ---
Total
   ConfigurationThroughput  Notes
   
   3.10-default-ple_on   6287   55% CPU 
host kernel, 17% spin_lock in guests
   3.10-default-ple_off  1849   2% CPU 
   in host kernel, 95% spin_lock in guests
   3.10-pvticket-ple_on  6691   50% CPU 
   in host kernel, 15% spin_lock in guests
   3.10-pvticket-ple_off16464   8% CPU 
   in host kernel, 33% spin_lock in guests
   
   I see 6.426% improvement with ple_on
   and 161.87% improvement with ple_off. I think this is a very good sign
for the patches
   
   [PLE hinders pv-ticket improvements, but even with PLE off,
 we still off from ideal throughput (somewhere 2)]
   
   
   Okay, The ideal throughput you are referring is getting around atleast
   80% of 1x throughput for over-commit. Yes we are still far away from
   there.
   
   
   1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
   --
Total
   ConfigurationThroughput  Notes
   
   3.10-default-ple_on  22736   6% CPU 
   in host kernel, 3% spin_lock in guests
   3.10-default-ple_off 23377   5% CPU 
   in host kernel, 3% spin_lock in guests
   3.10-pvticket-ple_on 22471   6% CPU 
   in host kernel, 3% spin_lock in guests
   3.10-pvticket-ple_off23445   5% CPU 
   in host kernel, 3% spin_lock in guests
   [1x looking fine here]
   
   
   I see ple_off is little better here.
   
   
   2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
   --
Total
   ConfigurationThroughput  Notes
   
   3.10-default-ple_on   1965   70% CPU 
   in host kernel, 34% spin_lock in guests 
   3.10-default-ple_off   226   2% CPU 
   in host kernel, 94% spin_lock in guests
   3.10-pvticket-ple_on  1942   70% CPU 
   in host kernel, 35% spin_lock in guests
   3.10-pvticket-ple_off 8003   11% CPU 
   in host kernel, 70% spin_lock in guests
   [quite bad all around, but pv-tickets with PLE off the best so far.
 Still quite a bit off from ideal throughput]
   
   This is again a remarkable improvement (307%).
   This motivates me to add a 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Gleb Natapov
On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote:
 On 06/26/2013 06:22 PM, Gleb Natapov wrote:
 On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
 On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
 On 06/25/2013 08:20 PM, Andrew Theurer wrote:
 On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
 This series replaces the existing paravirtualized spinlock mechanism
 with a paravirtualized ticketlock mechanism. The series provides
 implementation for both Xen and KVM.
 
 Changes in V9:
 - Changed spin_threshold to 32k to avoid excess halt exits that are
 causing undercommit degradation (after PLE handler improvement).
 - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
 - Optimized halt exit path to use PLE handler
 
 V8 of PVspinlock was posted last year. After Avi's suggestions to look
 at PLE handler's improvements, various optimizations in PLE handling
 have been tried.
 
 Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
 patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
 tested these patches with and without PLE, as PLE is still not scalable
 with large VMs.
 
 
 Hi Andrew,
 
 Thanks for testing.
 
 System: x3850X5, 40 cores, 80 threads
 
 
 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
 --
   Total
 Configuration Throughput(MB/s)Notes
 
 3.10-default-ple_on   22945   5% CPU 
 in host kernel, 2% spin_lock in guests
 3.10-default-ple_off  23184   5% CPU 
 in host kernel, 2% spin_lock in guests
 3.10-pvticket-ple_on  22895   5% CPU 
 in host kernel, 2% spin_lock in guests
 3.10-pvticket-ple_off 23051   5% CPU 
 in host kernel, 2% spin_lock in guests
 [all 1x results look good here]
 
 Yes. The 1x results look too close
 
 
 
 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
 ---
   Total
 Configuration Throughput  Notes
 
 3.10-default-ple_on6287   55% CPU 
  host kernel, 17% spin_lock in guests
 3.10-default-ple_off   1849   2% CPU 
 in host kernel, 95% spin_lock in guests
 3.10-pvticket-ple_on   6691   50% CPU 
 in host kernel, 15% spin_lock in guests
 3.10-pvticket-ple_off 16464   8% CPU 
 in host kernel, 33% spin_lock in guests
 
 I see 6.426% improvement with ple_on
 and 161.87% improvement with ple_off. I think this is a very good sign
   for the patches
 
 [PLE hinders pv-ticket improvements, but even with PLE off,
   we still off from ideal throughput (somewhere 2)]
 
 
 Okay, The ideal throughput you are referring is getting around atleast
 80% of 1x throughput for over-commit. Yes we are still far away from
 there.
 
 
 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
 --
   Total
 Configuration Throughput  Notes
 
 3.10-default-ple_on   22736   6% CPU 
 in host kernel, 3% spin_lock in guests
 3.10-default-ple_off  23377   5% CPU 
 in host kernel, 3% spin_lock in guests
 3.10-pvticket-ple_on  22471   6% CPU 
 in host kernel, 3% spin_lock in guests
 3.10-pvticket-ple_off 23445   5% CPU 
 in host kernel, 3% spin_lock in guests
 [1x looking fine here]
 
 
 I see ple_off is little better here.
 
 
 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
 --
   Total
 Configuration Throughput  Notes
 
 3.10-default-ple_on1965   70% CPU 
 in host kernel, 34% spin_lock in guests 
 3.10-default-ple_off226   2% CPU 
 in host kernel, 94% spin_lock in guests
 3.10-pvticket-ple_on   1942   70% CPU 
 in host kernel, 35% spin_lock in guests
 3.10-pvticket-ple_off  8003   11% CPU 
 in host kernel, 70% spin_lock in guests
 [quite bad all around, but pv-tickets with PLE off the best so far.
   Still quite a bit off from ideal throughput]
 
 This is again a remarkable improvement (307%).
 This motivates me to add a patch to disable ple when pvspinlock is on.
 probably we can add a hypercall that disables ple in kvm init patch.
 but only problem I see is what 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Raghavendra K T

On 06/26/2013 09:41 PM, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote:

On 06/26/2013 06:22 PM, Gleb Natapov wrote:

On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:

On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:

On 06/25/2013 08:20 PM, Andrew Theurer wrote:

On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
causing undercommit degradation (after PLE handler improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions to look
at PLE handler's improvements, various optimizations in PLE handling
have been tried.


Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
tested these patches with and without PLE, as PLE is still not scalable
with large VMs.



Hi Andrew,

Thanks for testing.


System: x3850X5, 40 cores, 80 threads


1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput(MB/s)Notes

3.10-default-ple_on 22945   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-default-ple_off23184   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_on22895   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_off   23051   5% CPU in host 
kernel, 2% spin_lock in guests
[all 1x results look good here]


Yes. The 1x results look too close




2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
---
Total
Configuration   Throughput  Notes

3.10-default-ple_on  6287   55% CPU  host 
kernel, 17% spin_lock in guests
3.10-default-ple_off 1849   2% CPU in host 
kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 6691   50% CPU in host 
kernel, 15% spin_lock in guests
3.10-pvticket-ple_off   16464   8% CPU in host 
kernel, 33% spin_lock in guests


I see 6.426% improvement with ple_on
and 161.87% improvement with ple_off. I think this is a very good sign
  for the patches


[PLE hinders pv-ticket improvements, but even with PLE off,
  we still off from ideal throughput (somewhere 2)]



Okay, The ideal throughput you are referring is getting around atleast
80% of 1x throughput for over-commit. Yes we are still far away from
there.



1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on 22736   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-default-ple_off23377   5% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_on22471   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_off   23445   5% CPU in host 
kernel, 3% spin_lock in guests
[1x looking fine here]



I see ple_off is little better here.



2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on  1965   70% CPU in host 
kernel, 34% spin_lock in guests 
3.10-default-ple_off  226   2% CPU in host 
kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 1942   70% CPU in host 
kernel, 35% spin_lock in guests
3.10-pvticket-ple_off8003   11% CPU in host 
kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the best so far.
  Still quite a bit off from ideal throughput]


This is again a remarkable improvement (307%).
This motivates me to add a patch to disable ple when pvspinlock is on.
probably we can add a hypercall that disables ple in kvm init patch.
but only problem I see is what if the guests are mixed.

  (i.e one guest has pvspinlock support 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-25 Thread Andrew Theurer
On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
 This series replaces the existing paravirtualized spinlock mechanism
 with a paravirtualized ticketlock mechanism. The series provides
 implementation for both Xen and KVM.
 
 Changes in V9:
 - Changed spin_threshold to 32k to avoid excess halt exits that are
causing undercommit degradation (after PLE handler improvement).
 - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
 - Optimized halt exit path to use PLE handler
 
 V8 of PVspinlock was posted last year. After Avi's suggestions to look
 at PLE handler's improvements, various optimizations in PLE handling
 have been tried.

Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
tested these patches with and without PLE, as PLE is still not scalable
with large VMs.

System: x3850X5, 40 cores, 80 threads


1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput(MB/s)Notes

3.10-default-ple_on 22945   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-default-ple_off23184   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_on22895   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_off   23051   5% CPU in host 
kernel, 2% spin_lock in guests
[all 1x results look good here]


2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
---
Total
Configuration   Throughput  Notes

3.10-default-ple_on  6287   55% CPU  host 
kernel, 17% spin_lock in guests
3.10-default-ple_off 1849   2% CPU in host 
kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 6691   50% CPU in host 
kernel, 15% spin_lock in guests
3.10-pvticket-ple_off   16464   8% CPU in host 
kernel, 33% spin_lock in guests
[PLE hinders pv-ticket improvements, but even with PLE off,
 we still off from ideal throughput (somewhere 2)]


1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on 22736   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-default-ple_off23377   5% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_on22471   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_off   23445   5% CPU in host 
kernel, 3% spin_lock in guests
[1x looking fine here]


2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on  1965   70% CPU in host 
kernel, 34% spin_lock in guests 
3.10-default-ple_off  226   2% CPU in host 
kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 1942   70% CPU in host 
kernel, 35% spin_lock in guests
3.10-pvticket-ple_off8003   11% CPU in host 
kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the best so far.
 Still quite a bit off from ideal throughput]

In summary, I would state that the pv-ticket is an overall win, but the
current PLE handler tends to get in the way on these larger guests.

-Andrew

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-07 Thread Raghavendra K T

On 06/03/2013 11:51 AM, Raghavendra K T wrote:

On 06/03/2013 07:10 AM, Raghavendra K T wrote:

On 06/02/2013 09:50 PM, Jiannan Ouyang wrote:

On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote:


High level question here. We have a big hope for Preemptable Ticket
Spinlock patch series by Jiannan Ouyang to solve most, if not all,
ticketing spinlocks in overcommit scenarios problem without need for
PV.
So how this patch series compares with his patches on PLE enabled
processors?



No experiment results yet.

An error is reported on a 20 core VM. I'm during an internship
relocation, and will start work on it next week.


Preemptable spinlocks' testing update:
I hit the same softlockup problem while testing on 32 core machine with
32 guest vcpus that Andrew had reported.

After that i started tuning TIMEOUT_UNIT, and when I went till (18),
things seemed to be manageable for undercommit cases.
But I still see degradation for undercommit w.r.t baseline itself on 32
core machine (after tuning).

(37.5% degradation w.r.t base line).
I can give the full report after the all tests complete.

For over-commit cases, I again started hitting softlockups (and
degradation is worse). But as I said in the preemptable thread, the
concept of preemptable locks looks promising (though I am still not a
fan of  embedded TIMEOUT mechanism)

Here is my opinion of TODOs for preemptable locks to make it better ( I
think I need to paste in the preemptable thread also)

1. Current TIMEOUT UNIT seem to be on higher side and also it does not
scale well with large guests and also overcommit. we need to have a
sort of adaptive mechanism and better is sort of different TIMEOUT_UNITS
for different types of lock too. The hashing mechanism that was used in
Rik's spinlock backoff series fits better probably.

2. I do not think TIMEOUT_UNIT itself would work great when we have a
big queue (for large guests / overcommits) for lock.
one way is to add a PV hook that does yield hypercall immediately for
the waiters above some THRESHOLD so that they don't burn the CPU.
( I can do POC to check if  that idea works in improving situation
at some later point of time)



Preemptable-lock results from my run with 2^8 TIMEOUT:

+---+---+---++---+
  ebizzy (records/sec) higher is better
+---+---+---++---+
 basestdevpatchedstdev%improvement
+---+---+---++---+
1x  5574.9000   237.49973484.2000   113.4449   -37.50202
2x  2741.5000   561.3090 351.5000   140.5420   -87.17855
3x  2146.2500   216.7718 194.833385.0303   -90.92215
4x  1663.   141.9235 101.57.7853   -93.92664
+---+---+---++---+
+---+---+---++---+
dbench  (Throughput) higher is better
+---+---+---++---+
  basestdevpatchedstdev%improvement
+---+---+---++---+
1x  14111.5600   754.4525   3930.1602   2547.2369-72.14936
2x  2481.627071.2665  181.181689.5368-92.69908
3x  1510.248331.8634  104.724353.2470-93.06576
4x  1029.487516.9166   72.373838.2432-92.96992
+---+---+---++---+

Note we can not trust on overcommit results because of softlock-ups



Hi, I tried
(1) TIMEOUT=(2^7)

(2) having yield hypercall that uses kvm_vcpu_on_spin() to do directed 
yield to other vCPUs.


Now I do not see any soft-lockup in overcommit cases and results are 
better now (except ebizzy 1x). and for dbench I see now it is closer to 
base and even improvement in 4x


+---+---+---++---+
   ebizzy (records/sec) higher is better
+---+---+---++---+
  basestdevpatchedstdev%improvement
+---+---+---++---+
  5574.9000   237.4997 523.7000 1.4181   -90.60611
  2741.5000   561.3090 597.800034.9755   -78.19442
  2146.2500   216.7718 902.666782.4228   -57.94215
  1663.   141.92351245.67.2989   -25.13530
+---+---+---++---+
+---+---+---++---+
dbench  (Throughput) higher is better
+---+---+---++---+
   basestdevpatchedstdev%improvement
+---+---+---++---+
 14111.5600   754.4525 884.905124.4723   -93.72922
  2481.627071.26652383.5700   333.2435-3.95132
  1510.248331.86341477.735850.5126-2.15279
  1029.487516.91661075.922513.9911 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-07 Thread Andrew Theurer
On Fri, 2013-06-07 at 11:45 +0530, Raghavendra K T wrote:
 On 06/03/2013 11:51 AM, Raghavendra K T wrote:
  On 06/03/2013 07:10 AM, Raghavendra K T wrote:
  On 06/02/2013 09:50 PM, Jiannan Ouyang wrote:
  On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote:
 
  High level question here. We have a big hope for Preemptable Ticket
  Spinlock patch series by Jiannan Ouyang to solve most, if not all,
  ticketing spinlocks in overcommit scenarios problem without need for
  PV.
  So how this patch series compares with his patches on PLE enabled
  processors?
 
 
  No experiment results yet.
 
  An error is reported on a 20 core VM. I'm during an internship
  relocation, and will start work on it next week.
 
  Preemptable spinlocks' testing update:
  I hit the same softlockup problem while testing on 32 core machine with
  32 guest vcpus that Andrew had reported.
 
  After that i started tuning TIMEOUT_UNIT, and when I went till (18),
  things seemed to be manageable for undercommit cases.
  But I still see degradation for undercommit w.r.t baseline itself on 32
  core machine (after tuning).
 
  (37.5% degradation w.r.t base line).
  I can give the full report after the all tests complete.
 
  For over-commit cases, I again started hitting softlockups (and
  degradation is worse). But as I said in the preemptable thread, the
  concept of preemptable locks looks promising (though I am still not a
  fan of  embedded TIMEOUT mechanism)
 
  Here is my opinion of TODOs for preemptable locks to make it better ( I
  think I need to paste in the preemptable thread also)
 
  1. Current TIMEOUT UNIT seem to be on higher side and also it does not
  scale well with large guests and also overcommit. we need to have a
  sort of adaptive mechanism and better is sort of different TIMEOUT_UNITS
  for different types of lock too. The hashing mechanism that was used in
  Rik's spinlock backoff series fits better probably.
 
  2. I do not think TIMEOUT_UNIT itself would work great when we have a
  big queue (for large guests / overcommits) for lock.
  one way is to add a PV hook that does yield hypercall immediately for
  the waiters above some THRESHOLD so that they don't burn the CPU.
  ( I can do POC to check if  that idea works in improving situation
  at some later point of time)
 
 
  Preemptable-lock results from my run with 2^8 TIMEOUT:
 
  +---+---+---++---+
ebizzy (records/sec) higher is better
  +---+---+---++---+
   basestdevpatchedstdev%improvement
  +---+---+---++---+
  1x  5574.9000   237.49973484.2000   113.4449   -37.50202
  2x  2741.5000   561.3090 351.5000   140.5420   -87.17855
  3x  2146.2500   216.7718 194.833385.0303   -90.92215
  4x  1663.   141.9235 101.57.7853   -93.92664
  +---+---+---++---+
  +---+---+---++---+
  dbench  (Throughput) higher is better
  +---+---+---++---+
basestdevpatchedstdev%improvement
  +---+---+---++---+
  1x  14111.5600   754.4525   3930.1602   2547.2369-72.14936
  2x  2481.627071.2665  181.181689.5368-92.69908
  3x  1510.248331.8634  104.724353.2470-93.06576
  4x  1029.487516.9166   72.373838.2432-92.96992
  +---+---+---++---+
 
  Note we can not trust on overcommit results because of softlock-ups
 
 
 Hi, I tried
 (1) TIMEOUT=(2^7)
 
 (2) having yield hypercall that uses kvm_vcpu_on_spin() to do directed 
 yield to other vCPUs.
 
 Now I do not see any soft-lockup in overcommit cases and results are 
 better now (except ebizzy 1x). and for dbench I see now it is closer to 
 base and even improvement in 4x
 
 +---+---+---++---+
 ebizzy (records/sec) higher is better
 +---+---+---++---+
basestdevpatchedstdev%improvement
 +---+---+---++---+
5574.9000   237.4997 523.7000 1.4181   -90.60611
2741.5000   561.3090 597.800034.9755   -78.19442
2146.2500   216.7718 902.666782.4228   -57.94215
1663.   141.92351245.67.2989   -25.13530
 +---+---+---++---+
 +---+---+---++---+
  dbench  (Throughput) higher is better
 +---+---+---++---+
 basestdevpatchedstdev%improvement
 +---+---+---++---+
   

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-07 Thread Jiannan Ouyang
Raghu, thanks for you input. I'm more than glad to work together with
you to make this idea work better.

-Jiannan

On Thu, Jun 6, 2013 at 11:15 PM, Raghavendra K T
raghavendra...@linux.vnet.ibm.com wrote:
 On 06/03/2013 11:51 AM, Raghavendra K T wrote:

 On 06/03/2013 07:10 AM, Raghavendra K T wrote:

 On 06/02/2013 09:50 PM, Jiannan Ouyang wrote:

 On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote:

 High level question here. We have a big hope for Preemptable Ticket
 Spinlock patch series by Jiannan Ouyang to solve most, if not all,
 ticketing spinlocks in overcommit scenarios problem without need for
 PV.
 So how this patch series compares with his patches on PLE enabled
 processors?


 No experiment results yet.

 An error is reported on a 20 core VM. I'm during an internship
 relocation, and will start work on it next week.


 Preemptable spinlocks' testing update:
 I hit the same softlockup problem while testing on 32 core machine with
 32 guest vcpus that Andrew had reported.

 After that i started tuning TIMEOUT_UNIT, and when I went till (18),
 things seemed to be manageable for undercommit cases.
 But I still see degradation for undercommit w.r.t baseline itself on 32
 core machine (after tuning).

 (37.5% degradation w.r.t base line).
 I can give the full report after the all tests complete.

 For over-commit cases, I again started hitting softlockups (and
 degradation is worse). But as I said in the preemptable thread, the
 concept of preemptable locks looks promising (though I am still not a
 fan of  embedded TIMEOUT mechanism)

 Here is my opinion of TODOs for preemptable locks to make it better ( I
 think I need to paste in the preemptable thread also)

 1. Current TIMEOUT UNIT seem to be on higher side and also it does not
 scale well with large guests and also overcommit. we need to have a
 sort of adaptive mechanism and better is sort of different TIMEOUT_UNITS
 for different types of lock too. The hashing mechanism that was used in
 Rik's spinlock backoff series fits better probably.

 2. I do not think TIMEOUT_UNIT itself would work great when we have a
 big queue (for large guests / overcommits) for lock.
 one way is to add a PV hook that does yield hypercall immediately for
 the waiters above some THRESHOLD so that they don't burn the CPU.
 ( I can do POC to check if  that idea works in improving situation
 at some later point of time)


 Preemptable-lock results from my run with 2^8 TIMEOUT:

 +---+---+---++---+
   ebizzy (records/sec) higher is better
 +---+---+---++---+
  basestdevpatchedstdev%improvement
 +---+---+---++---+
 1x  5574.9000   237.49973484.2000   113.4449   -37.50202
 2x  2741.5000   561.3090 351.5000   140.5420   -87.17855
 3x  2146.2500   216.7718 194.833385.0303   -90.92215
 4x  1663.   141.9235 101.57.7853   -93.92664
 +---+---+---++---+
 +---+---+---++---+
 dbench  (Throughput) higher is better
 +---+---+---++---+
   basestdevpatchedstdev%improvement
 +---+---+---++---+
 1x  14111.5600   754.4525   3930.1602   2547.2369-72.14936
 2x  2481.627071.2665  181.181689.5368-92.69908
 3x  1510.248331.8634  104.724353.2470-93.06576
 4x  1029.487516.9166   72.373838.2432-92.96992
 +---+---+---++---+

 Note we can not trust on overcommit results because of softlock-ups


 Hi, I tried
 (1) TIMEOUT=(2^7)

 (2) having yield hypercall that uses kvm_vcpu_on_spin() to do directed yield
 to other vCPUs.

 Now I do not see any soft-lockup in overcommit cases and results are better
 now (except ebizzy 1x). and for dbench I see now it is closer to base and
 even improvement in 4x


 +---+---+---++---+
ebizzy (records/sec) higher is better
 +---+---+---++---+
   basestdevpatchedstdev%improvement
 +---+---+---++---+
   5574.9000   237.4997 523.7000 1.4181   -90.60611
   2741.5000   561.3090 597.800034.9755   -78.19442
   2146.2500   216.7718 902.666782.4228   -57.94215
   1663.   141.92351245.67.2989   -25.13530

 +---+---+---++---+
 +---+---+---++---+
 dbench  (Throughput) higher is better
 +---+---+---++---+
basestdevpatchedstdev%improvement
 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-04 Thread Raghavendra K T

On 06/02/2013 01:44 AM, Andi Kleen wrote:


FWIW I use the paravirt spinlock ops for adding lock elision
to the spinlocks.

This needs to be done at the top level (so the level you're removing)

However I don't like the pv mechanism very much and would
be fine with using an static key hook in the main path
like I do for all the other lock types.

It also uses interrupt ops patching, for that it would
be still needed though.



Hi Andi, IIUC, you are okay with the current approach overall right?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-03 Thread Raghavendra K T

On 06/03/2013 07:10 AM, Raghavendra K T wrote:

On 06/02/2013 09:50 PM, Jiannan Ouyang wrote:

On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote:


High level question here. We have a big hope for Preemptable Ticket
Spinlock patch series by Jiannan Ouyang to solve most, if not all,
ticketing spinlocks in overcommit scenarios problem without need for PV.
So how this patch series compares with his patches on PLE enabled
processors?



No experiment results yet.

An error is reported on a 20 core VM. I'm during an internship
relocation, and will start work on it next week.


Preemptable spinlocks' testing update:
I hit the same softlockup problem while testing on 32 core machine with
32 guest vcpus that Andrew had reported.

After that i started tuning TIMEOUT_UNIT, and when I went till (18),
things seemed to be manageable for undercommit cases.
But I still see degradation for undercommit w.r.t baseline itself on 32
core machine (after tuning).

(37.5% degradation w.r.t base line).
I can give the full report after the all tests complete.

For over-commit cases, I again started hitting softlockups (and
degradation is worse). But as I said in the preemptable thread, the
concept of preemptable locks looks promising (though I am still not a
fan of  embedded TIMEOUT mechanism)

Here is my opinion of TODOs for preemptable locks to make it better ( I
think I need to paste in the preemptable thread also)

1. Current TIMEOUT UNIT seem to be on higher side and also it does not
scale well with large guests and also overcommit. we need to have a
sort of adaptive mechanism and better is sort of different TIMEOUT_UNITS
for different types of lock too. The hashing mechanism that was used in
Rik's spinlock backoff series fits better probably.

2. I do not think TIMEOUT_UNIT itself would work great when we have a
big queue (for large guests / overcommits) for lock.
one way is to add a PV hook that does yield hypercall immediately for
the waiters above some THRESHOLD so that they don't burn the CPU.
( I can do POC to check if  that idea works in improving situation
at some later point of time)



Preemptable-lock results from my run with 2^8 TIMEOUT:

+---+---+---++---+
 ebizzy (records/sec) higher is better
+---+---+---++---+
basestdevpatchedstdev%improvement
+---+---+---++---+
1x  5574.9000   237.49973484.2000   113.4449   -37.50202
2x  2741.5000   561.3090 351.5000   140.5420   -87.17855
3x  2146.2500   216.7718 194.833385.0303   -90.92215
4x  1663.   141.9235 101.57.7853   -93.92664
+---+---+---++---+
+---+---+---++---+
   dbench  (Throughput) higher is better
+---+---+---++---+
 basestdevpatchedstdev%improvement
+---+---+---++---+
1x  14111.5600   754.4525   3930.1602   2547.2369-72.14936
2x  2481.627071.2665  181.181689.5368-92.69908
3x  1510.248331.8634  104.724353.2470-93.06576
4x  1029.487516.9166   72.373838.2432-92.96992
+---+---+---++---+

Note we can not trust on overcommit results because of softlock-ups

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-02 Thread Gleb Natapov
On Sun, Jun 02, 2013 at 12:51:25AM +0530, Raghavendra K T wrote:
 
 This series replaces the existing paravirtualized spinlock mechanism
 with a paravirtualized ticketlock mechanism. The series provides
 implementation for both Xen and KVM.
 
High level question here. We have a big hope for Preemptable Ticket
Spinlock patch series by Jiannan Ouyang to solve most, if not all,
ticketing spinlocks in overcommit scenarios problem without need for PV.
So how this patch series compares with his patches on PLE enabled processors?

 Changes in V9:
 - Changed spin_threshold to 32k to avoid excess halt exits that are
causing undercommit degradation (after PLE handler improvement).
 - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
 - Optimized halt exit path to use PLE handler
 
 V8 of PVspinlock was posted last year. After Avi's suggestions to look
 at PLE handler's improvements, various optimizations in PLE handling
 have been tried.
 
 With this series we see that we could get little more improvements on top
 of that. 
 
 Ticket locks have an inherent problem in a virtualized case, because
 the vCPUs are scheduled rather than running concurrently (ignoring
 gang scheduled vCPUs).  This can result in catastrophic performance
 collapses when the vCPU scheduler doesn't schedule the correct next
 vCPU, and ends up scheduling a vCPU which burns its entire timeslice
 spinning.  (Note that this is not the same problem as lock-holder
 preemption, which this series also addresses; that's also a problem,
 but not catastrophic).
 
 (See Thomas Friebel's talk Prevent Guests from Spinning Around
 http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)
 
 Currently we deal with this by having PV spinlocks, which adds a layer
 of indirection in front of all the spinlock functions, and defining a
 completely new implementation for Xen (and for other pvops users, but
 there are none at present).
 
 PV ticketlocks keeps the existing ticketlock implemenentation
 (fastpath) as-is, but adds a couple of pvops for the slow paths:
 
 - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
   iterations, then call out to the __ticket_lock_spinning() pvop,
   which allows a backend to block the vCPU rather than spinning.  This
   pvop can set the lock into slowpath state.
 
 - When releasing a lock, if it is in slowpath state, the call
   __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
   lock is no longer in contention, it also clears the slowpath flag.
 
 The slowpath state is stored in the LSB of the within the lock tail
 ticket.  This has the effect of reducing the max number of CPUs by
 half (so, a small ticket can deal with 128 CPUs, and large ticket
 32768).
 
 For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick
 another vcpu out of halt state.
 The blocking of vcpu is done using halt() in (lock_spinning) slowpath.
 
 Overall, it results in a large reduction in code, it makes the native
 and virtualized cases closer, and it removes a layer of indirection
 around all the spinlock functions.
 
 The fast path (taking an uncontended lock which isn't in slowpath
 state) is optimal, identical to the non-paravirtualized case.
 
 The inner part of ticket lock code becomes:
   inc = xadd(lock-tickets, inc);
   inc.tail = ~TICKET_SLOWPATH_FLAG;
 
   if (likely(inc.head == inc.tail))
   goto out;
   for (;;) {
   unsigned count = SPIN_THRESHOLD;
   do {
   if (ACCESS_ONCE(lock-tickets.head) == inc.tail)
   goto out;
   cpu_relax();
   } while (--count);
   __ticket_lock_spinning(lock, inc.tail);
   }
 out:  barrier();
 which results in:
   push   %rbp
   mov%rsp,%rbp
 
   mov$0x200,%eax
   lock xadd %ax,(%rdi)
   movzbl %ah,%edx
   cmp%al,%dl
   jne1f   # Slowpath if lock in contention
 
   pop%rbp
   retq   
 
   ### SLOWPATH START
 1:and$-2,%edx
   movzbl %dl,%esi
 
 2:mov$0x800,%eax
   jmp4f
 
 3:pause  
   sub$0x1,%eax
   je 5f
 
 4:movzbl (%rdi),%ecx
   cmp%cl,%dl
   jne3b
 
   pop%rbp
   retq   
 
 5:callq  *__ticket_lock_spinning
   jmp2b
   ### SLOWPATH END
 
 with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
 the fastpath case is straight through (taking the lock without
 contention), and the spin loop is out of line:
 
   push   %rbp
   mov%rsp,%rbp
 
   mov$0x100,%eax
   lock xadd %ax,(%rdi)
   movzbl %ah,%edx
   cmp%al,%dl
   jne1f
 
   pop%rbp
   retq   
 
   ### SLOWPATH START
 1:pause  
   movzbl (%rdi),%eax
   cmp%dl,%al
   jne1b
 
   pop%rbp
   retq   
   ### SLOWPATH END
 
 The unlock code is complicated by the need to both add to 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-02 Thread Jiannan Ouyang
On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote:

 High level question here. We have a big hope for Preemptable Ticket
 Spinlock patch series by Jiannan Ouyang to solve most, if not all,
 ticketing spinlocks in overcommit scenarios problem without need for PV.
 So how this patch series compares with his patches on PLE enabled processors?


No experiment results yet.

An error is reported on a 20 core VM. I'm during an internship
relocation, and will start work on it next week.

--
Jiannan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-02 Thread Raghavendra K T

On 06/02/2013 09:50 PM, Jiannan Ouyang wrote:

On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote:


High level question here. We have a big hope for Preemptable Ticket
Spinlock patch series by Jiannan Ouyang to solve most, if not all,
ticketing spinlocks in overcommit scenarios problem without need for PV.
So how this patch series compares with his patches on PLE enabled processors?



No experiment results yet.

An error is reported on a 20 core VM. I'm during an internship
relocation, and will start work on it next week.


Preemptable spinlocks' testing update:
I hit the same softlockup problem while testing on 32 core machine with
32 guest vcpus that Andrew had reported.

After that i started tuning TIMEOUT_UNIT, and when I went till (18),
things seemed to be manageable for undercommit cases.
But I still see degradation for undercommit w.r.t baseline itself on 32
core machine (after tuning).

(37.5% degradation w.r.t base line).
I can give the full report after the all tests complete.

For over-commit cases, I again started hitting softlockups (and
degradation is worse). But as I said in the preemptable thread, the
concept of preemptable locks looks promising (though I am still not a
fan of  embedded TIMEOUT mechanism)

Here is my opinion of TODOs for preemptable locks to make it better ( I
think I need to paste in the preemptable thread also)

1. Current TIMEOUT UNIT seem to be on higher side and also it does not
scale well with large guests and also overcommit. we need to have a
sort of adaptive mechanism and better is sort of different TIMEOUT_UNITS
for different types of lock too. The hashing mechanism that was used in
Rik's spinlock backoff series fits better probably.

2. I do not think TIMEOUT_UNIT itself would work great when we have a
big queue (for large guests / overcommits) for lock.
one way is to add a PV hook that does yield hypercall immediately for
the waiters above some THRESHOLD so that they don't burn the CPU.
( I can do POC to check if  that idea works in improving situation
at some later point of time)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-01 Thread Raghavendra K T

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
   causing undercommit degradation (after PLE handler improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions to look
at PLE handler's improvements, various optimizations in PLE handling
have been tried.

With this series we see that we could get little more improvements on top
of that. 

Ticket locks have an inherent problem in a virtualized case, because
the vCPUs are scheduled rather than running concurrently (ignoring
gang scheduled vCPUs).  This can result in catastrophic performance
collapses when the vCPU scheduler doesn't schedule the correct next
vCPU, and ends up scheduling a vCPU which burns its entire timeslice
spinning.  (Note that this is not the same problem as lock-holder
preemption, which this series also addresses; that's also a problem,
but not catastrophic).

(See Thomas Friebel's talk Prevent Guests from Spinning Around
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

Currently we deal with this by having PV spinlocks, which adds a layer
of indirection in front of all the spinlock functions, and defining a
completely new implementation for Xen (and for other pvops users, but
there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation
(fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
  iterations, then call out to the __ticket_lock_spinning() pvop,
  which allows a backend to block the vCPU rather than spinning.  This
  pvop can set the lock into slowpath state.

- When releasing a lock, if it is in slowpath state, the call
  __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
  lock is no longer in contention, it also clears the slowpath flag.

The slowpath state is stored in the LSB of the within the lock tail
ticket.  This has the effect of reducing the max number of CPUs by
half (so, a small ticket can deal with 128 CPUs, and large ticket
32768).

For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick
another vcpu out of halt state.
The blocking of vcpu is done using halt() in (lock_spinning) slowpath.

Overall, it results in a large reduction in code, it makes the native
and virtualized cases closer, and it removes a layer of indirection
around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in slowpath
state) is optimal, identical to the non-paravirtualized case.

The inner part of ticket lock code becomes:
inc = xadd(lock-tickets, inc);
inc.tail = ~TICKET_SLOWPATH_FLAG;

if (likely(inc.head == inc.tail))
goto out;
for (;;) {
unsigned count = SPIN_THRESHOLD;
do {
if (ACCESS_ONCE(lock-tickets.head) == inc.tail)
goto out;
cpu_relax();
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
out:barrier();
which results in:
push   %rbp
mov%rsp,%rbp

mov$0x200,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f   # Slowpath if lock in contention

pop%rbp
retq   

### SLOWPATH START
1:  and$-2,%edx
movzbl %dl,%esi

2:  mov$0x800,%eax
jmp4f

3:  pause  
sub$0x1,%eax
je 5f

4:  movzbl (%rdi),%ecx
cmp%cl,%dl
jne3b

pop%rbp
retq   

5:  callq  *__ticket_lock_spinning
jmp2b
### SLOWPATH END

with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
the fastpath case is straight through (taking the lock without
contention), and the spin loop is out of line:

push   %rbp
mov%rsp,%rbp

mov$0x100,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f

pop%rbp
retq   

### SLOWPATH START
1:  pause  
movzbl (%rdi),%eax
cmp%dl,%al
jne1b

pop%rbp
retq   
### SLOWPATH END

The unlock code is complicated by the need to both add to the lock's
head and fetch the slowpath flag from tail.  This version of the
patch uses a locked add to do this, followed by a test to see if the
slowflag is set.  The lock prefix acts as a full memory barrier, so we
can be sure that other CPUs will have seen the unlock before we read
the flag (without the barrier the read could be fetched from the
store queue before it hits 

[PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-01 Thread Raghavendra K T

This series replaces the existing paravirtualized spinlock mechanism
with a paravirtualized ticketlock mechanism. The series provides
implementation for both Xen and KVM.

Changes in V9:
- Changed spin_threshold to 32k to avoid excess halt exits that are
   causing undercommit degradation (after PLE handler improvement).
- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
- Optimized halt exit path to use PLE handler

V8 of PVspinlock was posted last year. After Avi's suggestions to look
at PLE handler's improvements, various optimizations in PLE handling
have been tried.

With this series we see that we could get little more improvements on top
of that. 

Ticket locks have an inherent problem in a virtualized case, because
the vCPUs are scheduled rather than running concurrently (ignoring
gang scheduled vCPUs).  This can result in catastrophic performance
collapses when the vCPU scheduler doesn't schedule the correct next
vCPU, and ends up scheduling a vCPU which burns its entire timeslice
spinning.  (Note that this is not the same problem as lock-holder
preemption, which this series also addresses; that's also a problem,
but not catastrophic).

(See Thomas Friebel's talk Prevent Guests from Spinning Around
http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.)

Currently we deal with this by having PV spinlocks, which adds a layer
of indirection in front of all the spinlock functions, and defining a
completely new implementation for Xen (and for other pvops users, but
there are none at present).

PV ticketlocks keeps the existing ticketlock implemenentation
(fastpath) as-is, but adds a couple of pvops for the slow paths:

- If a CPU has been waiting for a spinlock for SPIN_THRESHOLD
  iterations, then call out to the __ticket_lock_spinning() pvop,
  which allows a backend to block the vCPU rather than spinning.  This
  pvop can set the lock into slowpath state.

- When releasing a lock, if it is in slowpath state, the call
  __ticket_unlock_kick() to kick the next vCPU in line awake.  If the
  lock is no longer in contention, it also clears the slowpath flag.

The slowpath state is stored in the LSB of the within the lock tail
ticket.  This has the effect of reducing the max number of CPUs by
half (so, a small ticket can deal with 128 CPUs, and large ticket
32768).

For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick
another vcpu out of halt state.
The blocking of vcpu is done using halt() in (lock_spinning) slowpath.

Overall, it results in a large reduction in code, it makes the native
and virtualized cases closer, and it removes a layer of indirection
around all the spinlock functions.

The fast path (taking an uncontended lock which isn't in slowpath
state) is optimal, identical to the non-paravirtualized case.

The inner part of ticket lock code becomes:
inc = xadd(lock-tickets, inc);
inc.tail = ~TICKET_SLOWPATH_FLAG;

if (likely(inc.head == inc.tail))
goto out;
for (;;) {
unsigned count = SPIN_THRESHOLD;
do {
if (ACCESS_ONCE(lock-tickets.head) == inc.tail)
goto out;
cpu_relax();
} while (--count);
__ticket_lock_spinning(lock, inc.tail);
}
out:barrier();
which results in:
push   %rbp
mov%rsp,%rbp

mov$0x200,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f   # Slowpath if lock in contention

pop%rbp
retq   

### SLOWPATH START
1:  and$-2,%edx
movzbl %dl,%esi

2:  mov$0x800,%eax
jmp4f

3:  pause  
sub$0x1,%eax
je 5f

4:  movzbl (%rdi),%ecx
cmp%cl,%dl
jne3b

pop%rbp
retq   

5:  callq  *__ticket_lock_spinning
jmp2b
### SLOWPATH END

with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where
the fastpath case is straight through (taking the lock without
contention), and the spin loop is out of line:

push   %rbp
mov%rsp,%rbp

mov$0x100,%eax
lock xadd %ax,(%rdi)
movzbl %ah,%edx
cmp%al,%dl
jne1f

pop%rbp
retq   

### SLOWPATH START
1:  pause  
movzbl (%rdi),%eax
cmp%dl,%al
jne1b

pop%rbp
retq   
### SLOWPATH END

The unlock code is complicated by the need to both add to the lock's
head and fetch the slowpath flag from tail.  This version of the
patch uses a locked add to do this, followed by a test to see if the
slowflag is set.  The lock prefix acts as a full memory barrier, so we
can be sure that other CPUs will have seen the unlock before we read
the flag (without the barrier the read could be fetched from the
store queue before it hits 

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-01 Thread Andi Kleen

FWIW I use the paravirt spinlock ops for adding lock elision
to the spinlocks.

This needs to be done at the top level (so the level you're removing)

However I don't like the pv mechanism very much and would 
be fine with using an static key hook in the main path
like I do for all the other lock types.

It also uses interrupt ops patching, for that it would 
be still needed though.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-01 Thread Jeremy Fitzhardinge
On 06/01/2013 01:14 PM, Andi Kleen wrote:
 FWIW I use the paravirt spinlock ops for adding lock elision
 to the spinlocks.

Does lock elision still use the ticketlock algorithm/structure, or are
they different?  If they're still basically ticketlocks, then it seems
to me that they're complimentary - hle handles the fastpath, and pv the
slowpath.

 This needs to be done at the top level (so the level you're removing)

 However I don't like the pv mechanism very much and would 
 be fine with using an static key hook in the main path
 like I do for all the other lock types.

Right.

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-01 Thread Andi Kleen
On Sat, Jun 01, 2013 at 01:28:00PM -0700, Jeremy Fitzhardinge wrote:
 On 06/01/2013 01:14 PM, Andi Kleen wrote:
  FWIW I use the paravirt spinlock ops for adding lock elision
  to the spinlocks.
 
 Does lock elision still use the ticketlock algorithm/structure, or are
 they different?  If they're still basically ticketlocks, then it seems
 to me that they're complimentary - hle handles the fastpath, and pv the
 slowpath.

It uses the ticketlock algorithm/structure, but:

- it needs to know that the lock is free with an own operation
- it has an additional field for strong adaptation state
(but that field is independent of the low level lock implementation,
so can be used with any kind of lock)

So currently it inlines the ticket lock code into its own.

Doing pv on the slow path would be possible, but would need
some additional (minor) hooks I think.

-Andi

-- 
a...@linux.intel.com -- Speaking for myself only.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html