Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 07/10/2013 04:03 PM, Gleb Natapov wrote: [...] trimmed Yes. you are right. dynamic ple window was an attempt to solve it. Probelm is, reducing the SPIN_THRESHOLD is resulting in excess halt exits in under-commits and increasing ple_window may be sometimes counter productive as it affects other busy-wait constructs such as flush_tlb AFAIK. So if we could have had a dynamically changing SPIN_THRESHOLD too, that would be nice. Gleb, Andrew, I tested with the global ple window change (similar to what I posted here https://lkml.org/lkml/2012/11/11/14 ), This does not look global. It changes PLE per vcpu. Okay. Got it. I was thinking it would change the global value. But IIRC It is changing global sysfs value and per vcpu ple_window. Sorry. I missed this part yesterday. But did not see good result. May be it is good to go with per VM ple_window. Gleb, Can you elaborate little more on what you have in mind regarding per VM ple_window. (maintaining part of it as a per vm variable is clear to me), but is it that we have to load that every time of guest entry? Only when it changes, shouldn't be to often no? Ok. Thinking how to do. read the register and writeback if there need to be a change during guest entry? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Thu, Jul 11, 2013 at 02:43:03PM +0530, Raghavendra K T wrote: On 07/10/2013 04:03 PM, Gleb Natapov wrote: [...] trimmed Yes. you are right. dynamic ple window was an attempt to solve it. Probelm is, reducing the SPIN_THRESHOLD is resulting in excess halt exits in under-commits and increasing ple_window may be sometimes counter productive as it affects other busy-wait constructs such as flush_tlb AFAIK. So if we could have had a dynamically changing SPIN_THRESHOLD too, that would be nice. Gleb, Andrew, I tested with the global ple window change (similar to what I posted here https://lkml.org/lkml/2012/11/11/14 ), This does not look global. It changes PLE per vcpu. Okay. Got it. I was thinking it would change the global value. But IIRC It is changing global sysfs value and per vcpu ple_window. Sorry. I missed this part yesterday. Yes, it changes sysfs value but this does not affect already created vcpus. But did not see good result. May be it is good to go with per VM ple_window. Gleb, Can you elaborate little more on what you have in mind regarding per VM ple_window. (maintaining part of it as a per vm variable is clear to me), but is it that we have to load that every time of guest entry? Only when it changes, shouldn't be to often no? Ok. Thinking how to do. read the register and writeback if there need to be a change during guest entry? Why not do it like in the patch you've linked? When value changes write it to VMCS of the current vcpu. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 07/11/2013 03:18 PM, Gleb Natapov wrote: On Thu, Jul 11, 2013 at 02:43:03PM +0530, Raghavendra K T wrote: On 07/10/2013 04:03 PM, Gleb Natapov wrote: [...] trimmed Yes. you are right. dynamic ple window was an attempt to solve it. Probelm is, reducing the SPIN_THRESHOLD is resulting in excess halt exits in under-commits and increasing ple_window may be sometimes counter productive as it affects other busy-wait constructs such as flush_tlb AFAIK. So if we could have had a dynamically changing SPIN_THRESHOLD too, that would be nice. Gleb, Andrew, I tested with the global ple window change (similar to what I posted here https://lkml.org/lkml/2012/11/11/14 ), This does not look global. It changes PLE per vcpu. Okay. Got it. I was thinking it would change the global value. But IIRC It is changing global sysfs value and per vcpu ple_window. Sorry. I missed this part yesterday. Yes, it changes sysfs value but this does not affect already created vcpus. But did not see good result. May be it is good to go with per VM ple_window. Gleb, Can you elaborate little more on what you have in mind regarding per VM ple_window. (maintaining part of it as a per vm variable is clear to me), but is it that we have to load that every time of guest entry? Only when it changes, shouldn't be to often no? Ok. Thinking how to do. read the register and writeback if there need to be a change during guest entry? Why not do it like in the patch you've linked? When value changes write it to VMCS of the current vcpu. Yes. can be done. So the running vcpu's ple_window gets updated only after next pl-exit. right? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Thu, Jul 11, 2013 at 03:40:38PM +0530, Raghavendra K T wrote: Gleb, Can you elaborate little more on what you have in mind regarding per VM ple_window. (maintaining part of it as a per vm variable is clear to me), but is it that we have to load that every time of guest entry? Only when it changes, shouldn't be to often no? Ok. Thinking how to do. read the register and writeback if there need to be a change during guest entry? Why not do it like in the patch you've linked? When value changes write it to VMCS of the current vcpu. Yes. can be done. So the running vcpu's ple_window gets updated only after next pl-exit. right? I am not sure what you mean. You cannot change vcpu's ple_window while vcpu is in a guest mode. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 07/11/2013 03:41 PM, Gleb Natapov wrote: On Thu, Jul 11, 2013 at 03:40:38PM +0530, Raghavendra K T wrote: Gleb, Can you elaborate little more on what you have in mind regarding per VM ple_window. (maintaining part of it as a per vm variable is clear to me), but is it that we have to load that every time of guest entry? Only when it changes, shouldn't be to often no? Ok. Thinking how to do. read the register and writeback if there need to be a change during guest entry? Why not do it like in the patch you've linked? When value changes write it to VMCS of the current vcpu. Yes. can be done. So the running vcpu's ple_window gets updated only after next pl-exit. right? I am not sure what you mean. You cannot change vcpu's ple_window while vcpu is in a guest mode. I agree with that. Both of us are on the same page. What I meant is, suppose the per VM ple_window changes when a vcpu x of that VM was running, it will get its ple_window updated during next run. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Thu, Jul 11, 2013 at 04:23:58PM +0530, Raghavendra K T wrote: On 07/11/2013 03:41 PM, Gleb Natapov wrote: On Thu, Jul 11, 2013 at 03:40:38PM +0530, Raghavendra K T wrote: Gleb, Can you elaborate little more on what you have in mind regarding per VM ple_window. (maintaining part of it as a per vm variable is clear to me), but is it that we have to load that every time of guest entry? Only when it changes, shouldn't be to often no? Ok. Thinking how to do. read the register and writeback if there need to be a change during guest entry? Why not do it like in the patch you've linked? When value changes write it to VMCS of the current vcpu. Yes. can be done. So the running vcpu's ple_window gets updated only after next pl-exit. right? I am not sure what you mean. You cannot change vcpu's ple_window while vcpu is in a guest mode. I agree with that. Both of us are on the same page. What I meant is, suppose the per VM ple_window changes when a vcpu x of that VM was running, it will get its ple_window updated during next run. Ah, I think per VM is what confuses me. Why do you want to have per VM ple_windows and not per vcpu one? With per vcpu one ple_windows cannot change while vcpu is running. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 07/11/2013 04:26 PM, Gleb Natapov wrote: On Thu, Jul 11, 2013 at 04:23:58PM +0530, Raghavendra K T wrote: On 07/11/2013 03:41 PM, Gleb Natapov wrote: On Thu, Jul 11, 2013 at 03:40:38PM +0530, Raghavendra K T wrote: Gleb, Can you elaborate little more on what you have in mind regarding per VM ple_window. (maintaining part of it as a per vm variable is clear to me), but is it that we have to load that every time of guest entry? Only when it changes, shouldn't be to often no? Ok. Thinking how to do. read the register and writeback if there need to be a change during guest entry? Why not do it like in the patch you've linked? When value changes write it to VMCS of the current vcpu. Yes. can be done. So the running vcpu's ple_window gets updated only after next pl-exit. right? I am not sure what you mean. You cannot change vcpu's ple_window while vcpu is in a guest mode. I agree with that. Both of us are on the same page. What I meant is, suppose the per VM ple_window changes when a vcpu x of that VM was running, it will get its ple_window updated during next run. Ah, I think per VM is what confuses me. Why do you want to have per VM ple_windows and not per vcpu one? With per vcpu one ple_windows cannot change while vcpu is running. Okay. Got that. My initial feeling was vcpu does not feel the global load. But I think that should be of no problem. instead we will not need atomic operations to update ple_window, which is better. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Tue, Jul 09, 2013 at 02:41:30PM +0530, Raghavendra K T wrote: On 06/26/2013 11:24 PM, Raghavendra K T wrote: On 06/26/2013 09:41 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote: On 06/26/2013 06:22 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughput(MB/s)Notes 3.10-default-ple_on229455% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off231845% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on228955% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off230515% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total ConfigurationThroughputNotes 3.10-default-ple_on 628755% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 18492% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 669150% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off164648% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total ConfigurationThroughputNotes 3.10-default-ple_on227366% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off233775% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on224716% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off234455% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughputNotes 3.10-default-ple_on 196570% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 2262% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 194270% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off 800311% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but only problem I see is what if the guests are mixed. (i.e one guest has pvspinlock support but other does not. Host supports pv) How about reintroducing the idea to create per-kvm ple_gap,ple_window state. We were headed down that road when considering a dynamic window at one point. Then
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote: Here's an idea, trim the damn email ;-) -- not only directed at gleb. Ingo, Gleb, From the results perspective, Andrew Theurer, Vinod's test results are pro-pvspinlock. Could you please help me to know what will make it a mergeable candidate?. I need to spend more time reviewing it :) The problem with PV interfaces is that they are easy to add but hard to get rid of if better solution (HW or otherwise) appears. How so? Just make sure the registration for the PV interface is optional; that is, allow it to fail. A guest that fails the PV setup will either have to try another PV interface or fall back to 'native'. I agree that Jiannan's Preemptable Lock idea is promising and we could evaluate that approach, and make the best one get into kernel and also will carry on discussion with Jiannan to improve that patch. That would be great. The work is stalled from what I can tell. I absolutely hated that stuff because it wrecked the native code. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote: On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote: Here's an idea, trim the damn email ;-) -- not only directed at gleb. Good idea. Ingo, Gleb, From the results perspective, Andrew Theurer, Vinod's test results are pro-pvspinlock. Could you please help me to know what will make it a mergeable candidate?. I need to spend more time reviewing it :) The problem with PV interfaces is that they are easy to add but hard to get rid of if better solution (HW or otherwise) appears. How so? Just make sure the registration for the PV interface is optional; that is, allow it to fail. A guest that fails the PV setup will either have to try another PV interface or fall back to 'native'. We have to carry PV around for live migration purposes. PV interface cannot disappear under a running guest. I agree that Jiannan's Preemptable Lock idea is promising and we could evaluate that approach, and make the best one get into kernel and also will carry on discussion with Jiannan to improve that patch. That would be great. The work is stalled from what I can tell. I absolutely hated that stuff because it wrecked the native code. Yes, the idea was to hide it from native code behind PV hooks. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 07/10/2013 04:03 PM, Gleb Natapov wrote: On Tue, Jul 09, 2013 at 02:41:30PM +0530, Raghavendra K T wrote: On 06/26/2013 11:24 PM, Raghavendra K T wrote: On 06/26/2013 09:41 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote: On 06/26/2013 06:22 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughput(MB/s)Notes 3.10-default-ple_on229455% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off231845% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on228955% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off230515% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total ConfigurationThroughputNotes 3.10-default-ple_on 628755% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 18492% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 669150% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off164648% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total ConfigurationThroughputNotes 3.10-default-ple_on227366% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off233775% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on224716% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off234455% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughputNotes 3.10-default-ple_on 196570% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 2262% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 194270% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off 800311% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but only problem I see is what if the guests are mixed. (i.e one guest has pvspinlock support but other does not. Host supports pv) How about reintroducing the idea to create per-kvm ple_gap,ple_window state. We were headed down that road when considering a dynamic window at one point. Then you can just set a single guest's ple_gap to zero, which would lead to
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 07/10/2013 04:17 PM, Gleb Natapov wrote: On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote: On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote: Here's an idea, trim the damn email ;-) -- not only directed at gleb. Good idea. Ingo, Gleb, From the results perspective, Andrew Theurer, Vinod's test results are pro-pvspinlock. Could you please help me to know what will make it a mergeable candidate?. I need to spend more time reviewing it :) The problem with PV interfaces is that they are easy to add but hard to get rid of if better solution (HW or otherwise) appears. How so? Just make sure the registration for the PV interface is optional; that is, allow it to fail. A guest that fails the PV setup will either have to try another PV interface or fall back to 'native'. We have to carry PV around for live migration purposes. PV interface cannot disappear under a running guest. IIRC, The only requirement was running state of the vcpu to be retained. This was addressed by [PATCH RFC V10 13/18] kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration. I would have to know more if I missed something here. I agree that Jiannan's Preemptable Lock idea is promising and we could evaluate that approach, and make the best one get into kernel and also will carry on discussion with Jiannan to improve that patch. That would be great. The work is stalled from what I can tell. I absolutely hated that stuff because it wrecked the native code. Yes, the idea was to hide it from native code behind PV hooks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, Jul 10, 2013 at 04:58:29PM +0530, Raghavendra K T wrote: On 07/10/2013 04:17 PM, Gleb Natapov wrote: On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote: On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote: Here's an idea, trim the damn email ;-) -- not only directed at gleb. Good idea. Ingo, Gleb, From the results perspective, Andrew Theurer, Vinod's test results are pro-pvspinlock. Could you please help me to know what will make it a mergeable candidate?. I need to spend more time reviewing it :) The problem with PV interfaces is that they are easy to add but hard to get rid of if better solution (HW or otherwise) appears. How so? Just make sure the registration for the PV interface is optional; that is, allow it to fail. A guest that fails the PV setup will either have to try another PV interface or fall back to 'native'. We have to carry PV around for live migration purposes. PV interface cannot disappear under a running guest. IIRC, The only requirement was running state of the vcpu to be retained. This was addressed by [PATCH RFC V10 13/18] kvm : Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration. I would have to know more if I missed something here. I was not talking about the state that has to be migrated, but HV-guest interface that has to be preserved after migration. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
dropping stephen becuase of bounce On 07/10/2013 04:58 PM, Raghavendra K T wrote: On 07/10/2013 04:17 PM, Gleb Natapov wrote: On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote: On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote: Here's an idea, trim the damn email ;-) -- not only directed at gleb. Good idea. Ingo, Gleb, From the results perspective, Andrew Theurer, Vinod's test results are pro-pvspinlock. Could you please help me to know what will make it a mergeable candidate?. I need to spend more time reviewing it :) The problem with PV interfaces is that they are easy to add but hard to get rid of if better solution (HW or otherwise) appears. How so? Just make sure the registration for the PV interface is optional; that is, allow it to fail. A guest that fails the PV setup will either have to try another PV interface or fall back to 'native'. Forgot to add. Yes currently pvspinlocks are not enabled by default and also, we have jump_label mechanism to enable it. [...] -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, Jul 10, 2013 at 04:54:12PM +0530, Raghavendra K T wrote: Ingo, Gleb, From the results perspective, Andrew Theurer, Vinod's test results are pro-pvspinlock. Could you please help me to know what will make it a mergeable candidate?. I need to spend more time reviewing it :) The problem with PV interfaces is that they are easy to add but hard to get rid of if better solution (HW or otherwise) appears. Infact Avi had acked the whole V8 series, but delayed for seeing how PLE improvement would affect it. I see that Ingo was happy with it too. The only addition from that series has been 1. tuning the SPIN_THRESHOLD to 32k (from 2k) and 2. the halt handler now calls vcpu_on_spin to take the advantage of PLE improvements. (this can also go as an independent patch into kvm) The rationale for making SPIN_THERSHOLD 32k needs big explanation. Before PLE improvements, as you know, kvm undercommit scenario was very worse in ple enabled cases. (compared to ple disabled cases). pvspinlock patches behaved equally bad in undercommit. Both had similar reason so at the end there was no degradation w.r.t base. The reason for bad performance in PLE case was unneeded vcpu iteration in ple handler resulting in high yield_to calls and double run queue locks. With pvspinlock applied, same villain role was played by excessive halt exits. But after ple handler improved, we needed to throttle unnecessary halts in undercommit for pvspinlock to be on par with 1x result. Make sense. I will review it ASAP. BTW the latest version is V10 right? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 07/10/2013 05:11 PM, Gleb Natapov wrote: On Wed, Jul 10, 2013 at 04:54:12PM +0530, Raghavendra K T wrote: Ingo, Gleb, From the results perspective, Andrew Theurer, Vinod's test results are pro-pvspinlock. Could you please help me to know what will make it a mergeable candidate?. I need to spend more time reviewing it :) The problem with PV interfaces is that they are easy to add but hard to get rid of if better solution (HW or otherwise) appears. Infact Avi had acked the whole V8 series, but delayed for seeing how PLE improvement would affect it. I see that Ingo was happy with it too. The only addition from that series has been 1. tuning the SPIN_THRESHOLD to 32k (from 2k) and 2. the halt handler now calls vcpu_on_spin to take the advantage of PLE improvements. (this can also go as an independent patch into kvm) The rationale for making SPIN_THERSHOLD 32k needs big explanation. Before PLE improvements, as you know, kvm undercommit scenario was very worse in ple enabled cases. (compared to ple disabled cases). pvspinlock patches behaved equally bad in undercommit. Both had similar reason so at the end there was no degradation w.r.t base. The reason for bad performance in PLE case was unneeded vcpu iteration in ple handler resulting in high yield_to calls and double run queue locks. With pvspinlock applied, same villain role was played by excessive halt exits. But after ple handler improved, we needed to throttle unnecessary halts in undercommit for pvspinlock to be on par with 1x result. Make sense. I will review it ASAP. BTW the latest version is V10 right? Yes. Thank you. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote: On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote: On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote: Here's an idea, trim the damn email ;-) -- not only directed at gleb. Good idea. Ingo, Gleb, From the results perspective, Andrew Theurer, Vinod's test results are pro-pvspinlock. Could you please help me to know what will make it a mergeable candidate?. I need to spend more time reviewing it :) The problem with PV interfaces is that they are easy to add but hard to get rid of if better solution (HW or otherwise) appears. How so? Just make sure the registration for the PV interface is optional; that is, allow it to fail. A guest that fails the PV setup will either have to try another PV interface or fall back to 'native'. We have to carry PV around for live migration purposes. PV interface cannot disappear under a running guest. Why can't it? This is the same as handling say XSAVE operations. Some hosts might have it - some might not. It is the job of the toolstack to make sure to not migrate to the hosts which don't have it. Or bound the guest to the lowest interface (so don't enable the PV interface if the other hosts in the cluster can't support this flag)? I agree that Jiannan's Preemptable Lock idea is promising and we could evaluate that approach, and make the best one get into kernel and also will carry on discussion with Jiannan to improve that patch. That would be great. The work is stalled from what I can tell. I absolutely hated that stuff because it wrecked the native code. Yes, the idea was to hide it from native code behind PV hooks. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, Jul 10, 2013 at 11:03:15AM -0400, Konrad Rzeszutek Wilk wrote: On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote: On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote: On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote: Here's an idea, trim the damn email ;-) -- not only directed at gleb. Good idea. Ingo, Gleb, From the results perspective, Andrew Theurer, Vinod's test results are pro-pvspinlock. Could you please help me to know what will make it a mergeable candidate?. I need to spend more time reviewing it :) The problem with PV interfaces is that they are easy to add but hard to get rid of if better solution (HW or otherwise) appears. How so? Just make sure the registration for the PV interface is optional; that is, allow it to fail. A guest that fails the PV setup will either have to try another PV interface or fall back to 'native'. We have to carry PV around for live migration purposes. PV interface cannot disappear under a running guest. Why can't it? This is the same as handling say XSAVE operations. Some hosts might have it - some might not. It is the job of the toolstack to make sure to not migrate to the hosts which don't have it. Or bound the guest to the lowest interface (so don't enable the PV interface if the other hosts in the cluster can't support this flag)? XSAVE is HW feature and it is not going disappear under you after software upgrade. Upgrading kernel on part of your hosts and no longer been able to migrate to them is not something people who use live migration expect. In practise it means that updating all hosts in a datacenter to newer kernel is no longer possible without rebooting VMs. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
Gleb Natapov g...@redhat.com wrote: On Wed, Jul 10, 2013 at 11:03:15AM -0400, Konrad Rzeszutek Wilk wrote: On Wed, Jul 10, 2013 at 01:47:17PM +0300, Gleb Natapov wrote: On Wed, Jul 10, 2013 at 12:40:47PM +0200, Peter Zijlstra wrote: On Wed, Jul 10, 2013 at 01:33:25PM +0300, Gleb Natapov wrote: Here's an idea, trim the damn email ;-) -- not only directed at gleb. Good idea. Ingo, Gleb, From the results perspective, Andrew Theurer, Vinod's test results are pro-pvspinlock. Could you please help me to know what will make it a mergeable candidate?. I need to spend more time reviewing it :) The problem with PV interfaces is that they are easy to add but hard to get rid of if better solution (HW or otherwise) appears. How so? Just make sure the registration for the PV interface is optional; that is, allow it to fail. A guest that fails the PV setup will either have to try another PV interface or fall back to 'native'. We have to carry PV around for live migration purposes. PV interface cannot disappear under a running guest. Why can't it? This is the same as handling say XSAVE operations. Some hosts might have it - some might not. It is the job of the toolstack to make sure to not migrate to the hosts which don't have it. Or bound the guest to the lowest interface (so don't enable the PV interface if the other hosts in the cluster can't support this flag)? XSAVE is HW feature and it is not going disappear under you after software upgrade. Upgrading kernel on part of your hosts and no longer been able to migrate to them is not something people who use live migration expect. In practise it means that updating all hosts in a datacenter to newer kernel is no longer possible without rebooting VMs. -- Gleb. I see. Perhaps then if the hardware becomes much better at this then another PV interface can be provided which will use the static_key to turn off the PV spin lock and use the bare metal version (or perhaps some forms of super ellision locks). That does mean the host has to do something when this PV interface is invoked for the older guests. Anyhow that said I think the benefits are pretty neat right now and benefit much and worrying about whether the hardware vendors will provide something new is not benefiting users. What perhaps then needs to be addressed is how to have an obsolete mechanism in this if the hardware becomes superb? -- Sent from my Android phone. Please excuse my brevity. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/26/2013 11:24 PM, Raghavendra K T wrote: On 06/26/2013 09:41 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote: On 06/26/2013 06:22 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughput(MB/s)Notes 3.10-default-ple_on229455% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off231845% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on228955% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off230515% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total ConfigurationThroughputNotes 3.10-default-ple_on 628755% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 18492% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 669150% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off164648% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total ConfigurationThroughputNotes 3.10-default-ple_on227366% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off233775% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on224716% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off234455% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughputNotes 3.10-default-ple_on 196570% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 2262% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 194270% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off 800311% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but only problem I see is what if the guests are mixed. (i.e one guest has pvspinlock support but other does not. Host supports pv) How about reintroducing the idea to create per-kvm ple_gap,ple_window state. We were headed down that road when considering a dynamic window at one point. Then you can just set a single guest's ple_gap to zero, which would lead to PLE being disabled for that guest. We could also revisit the dynamic window then. Can be done, but lets
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/26/2013 09:26 PM, Andrew Theurer wrote: On Wed, 2013-06-26 at 15:52 +0300, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput(MB/s)Notes 3.10-default-ple_on 22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off 23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total Configuration Throughput Notes 3.10-default-ple_on 6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off 16464 8% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off 23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but only problem I see is what if the guests are mixed. (i.e one guest has pvspinlock support but other does not. Host supports pv) How about
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput(MB/s)Notes 3.10-default-ple_on 22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off 23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total Configuration Throughput Notes 3.10-default-ple_on 6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off 16464 8% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off 23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but only problem I see is what if the guests are mixed. (i.e one guest has pvspinlock support but other does not. Host supports pv) /me thinks In summary, I would state that the pv-ticket is an overall win, but the current PLE handler tends to get in the way on these larger guests. -Andrew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughput(MB/s)Notes 3.10-default-ple_on 22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off 23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on 22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total ConfigurationThroughput Notes 3.10-default-ple_on 6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off16464 8% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total ConfigurationThroughput Notes 3.10-default-ple_on 22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off 23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on 22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughput Notes 3.10-default-ple_on 1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off 8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but only problem I see is what if the guests are mixed. (i.e one guest has pvspinlock support but other does not. Host supports pv) How about reintroducing the idea to create per-kvm ple_gap,ple_window state. We were headed
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput(MB/s)Notes 3.10-default-ple_on22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off 23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on 22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off 23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total Configuration Throughput Notes 3.10-default-ple_on 6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off 16464 8% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off 23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on 22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off 23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off 8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/26/2013 06:22 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput(MB/s)Notes 3.10-default-ple_on 22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off 23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total Configuration Throughput Notes 3.10-default-ple_on 6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off 16464 8% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off 23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but only problem I see is what if the guests are mixed. (i.e one guest has pvspinlock support but other does not. Host supports pv) How about reintroducing the idea to create per-kvm ple_gap,ple_window
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, Jun 26, 2013 at 03:52:40PM +0300, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughput(MB/s)Notes 3.10-default-ple_on 22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off 23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on 22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total ConfigurationThroughput Notes 3.10-default-ple_on 6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off16464 8% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total ConfigurationThroughput Notes 3.10-default-ple_on 22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off 23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on 22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughput Notes 3.10-default-ple_on 1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off 8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/26/2013 08:09 PM, Chegu Vinod wrote: On 6/26/2013 6:40 AM, Raghavendra K T wrote: On 06/26/2013 06:22 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughput(MB/s)Notes 3.10-default-ple_on229455% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off231845% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on228955% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off230515% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total ConfigurationThroughputNotes 3.10-default-ple_on 628755% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 18492% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 669150% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off164648% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total ConfigurationThroughputNotes 3.10-default-ple_on227366% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off233775% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on224716% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off234455% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughputNotes 3.10-default-ple_on 196570% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 2262% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 194270% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off 800311% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but only problem I see is what if the guests are mixed. (i.e one guest has pvspinlock support but other does not. Host supports pv) How about reintroducing the idea to create per-kvm ple_gap,ple_window state. We were headed down that road when considering a dynamic window at one point. Then you can just set a single guest's ple_gap to zero, which would lead to PLE being disabled for that guest. We could also revisit the dynamic window then. Can be done, but lets understand why ple on is such a big problem. Is it possible that ple gap
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, 2013-06-26 at 15:52 +0300, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughput(MB/s)Notes 3.10-default-ple_on 22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off 23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on 22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total ConfigurationThroughput Notes 3.10-default-ple_on 6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off16464 8% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total ConfigurationThroughput Notes 3.10-default-ple_on 22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off 23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on 22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total ConfigurationThroughput Notes 3.10-default-ple_on 1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off 8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote: On 06/26/2013 06:22 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput(MB/s)Notes 3.10-default-ple_on 22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off 23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on 22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off 23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total Configuration Throughput Notes 3.10-default-ple_on6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off 16464 8% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off 23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on 22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off 23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off 8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but only problem I see is what
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/26/2013 09:41 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 07:10:21PM +0530, Raghavendra K T wrote: On 06/26/2013 06:22 PM, Gleb Natapov wrote: On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: On 06/25/2013 08:20 PM, Andrew Theurer wrote: On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. Hi Andrew, Thanks for testing. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput(MB/s)Notes 3.10-default-ple_on 22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off 23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] Yes. The 1x results look too close 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total Configuration Throughput Notes 3.10-default-ple_on 6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off 16464 8% CPU in host kernel, 33% spin_lock in guests I see 6.426% improvement with ple_on and 161.87% improvement with ple_off. I think this is a very good sign for the patches [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] Okay, The ideal throughput you are referring is getting around atleast 80% of 1x throughput for over-commit. Yes we are still far away from there. 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off 23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] I see ple_off is little better here. 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] This is again a remarkable improvement (307%). This motivates me to add a patch to disable ple when pvspinlock is on. probably we can add a hypercall that disables ple in kvm init patch. but only problem I see is what if the guests are mixed. (i.e one guest has pvspinlock support
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput(MB/s)Notes 3.10-default-ple_on 22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off 23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total Configuration Throughput Notes 3.10-default-ple_on 6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off 16464 8% CPU in host kernel, 33% spin_lock in guests [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere 2)] 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off 23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] In summary, I would state that the pv-ticket is an overall win, but the current PLE handler tends to get in the way on these larger guests. -Andrew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/03/2013 11:51 AM, Raghavendra K T wrote: On 06/03/2013 07:10 AM, Raghavendra K T wrote: On 06/02/2013 09:50 PM, Jiannan Ouyang wrote: On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote: High level question here. We have a big hope for Preemptable Ticket Spinlock patch series by Jiannan Ouyang to solve most, if not all, ticketing spinlocks in overcommit scenarios problem without need for PV. So how this patch series compares with his patches on PLE enabled processors? No experiment results yet. An error is reported on a 20 core VM. I'm during an internship relocation, and will start work on it next week. Preemptable spinlocks' testing update: I hit the same softlockup problem while testing on 32 core machine with 32 guest vcpus that Andrew had reported. After that i started tuning TIMEOUT_UNIT, and when I went till (18), things seemed to be manageable for undercommit cases. But I still see degradation for undercommit w.r.t baseline itself on 32 core machine (after tuning). (37.5% degradation w.r.t base line). I can give the full report after the all tests complete. For over-commit cases, I again started hitting softlockups (and degradation is worse). But as I said in the preemptable thread, the concept of preemptable locks looks promising (though I am still not a fan of embedded TIMEOUT mechanism) Here is my opinion of TODOs for preemptable locks to make it better ( I think I need to paste in the preemptable thread also) 1. Current TIMEOUT UNIT seem to be on higher side and also it does not scale well with large guests and also overcommit. we need to have a sort of adaptive mechanism and better is sort of different TIMEOUT_UNITS for different types of lock too. The hashing mechanism that was used in Rik's spinlock backoff series fits better probably. 2. I do not think TIMEOUT_UNIT itself would work great when we have a big queue (for large guests / overcommits) for lock. one way is to add a PV hook that does yield hypercall immediately for the waiters above some THRESHOLD so that they don't burn the CPU. ( I can do POC to check if that idea works in improving situation at some later point of time) Preemptable-lock results from my run with 2^8 TIMEOUT: +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 5574.9000 237.49973484.2000 113.4449 -37.50202 2x 2741.5000 561.3090 351.5000 140.5420 -87.17855 3x 2146.2500 216.7718 194.833385.0303 -90.92215 4x 1663. 141.9235 101.57.7853 -93.92664 +---+---+---++---+ +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 14111.5600 754.4525 3930.1602 2547.2369-72.14936 2x 2481.627071.2665 181.181689.5368-92.69908 3x 1510.248331.8634 104.724353.2470-93.06576 4x 1029.487516.9166 72.373838.2432-92.96992 +---+---+---++---+ Note we can not trust on overcommit results because of softlock-ups Hi, I tried (1) TIMEOUT=(2^7) (2) having yield hypercall that uses kvm_vcpu_on_spin() to do directed yield to other vCPUs. Now I do not see any soft-lockup in overcommit cases and results are better now (except ebizzy 1x). and for dbench I see now it is closer to base and even improvement in 4x +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 5574.9000 237.4997 523.7000 1.4181 -90.60611 2741.5000 561.3090 597.800034.9755 -78.19442 2146.2500 216.7718 902.666782.4228 -57.94215 1663. 141.92351245.67.2989 -25.13530 +---+---+---++---+ +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 14111.5600 754.4525 884.905124.4723 -93.72922 2481.627071.26652383.5700 333.2435-3.95132 1510.248331.86341477.735850.5126-2.15279 1029.487516.91661075.922513.9911
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Fri, 2013-06-07 at 11:45 +0530, Raghavendra K T wrote: On 06/03/2013 11:51 AM, Raghavendra K T wrote: On 06/03/2013 07:10 AM, Raghavendra K T wrote: On 06/02/2013 09:50 PM, Jiannan Ouyang wrote: On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote: High level question here. We have a big hope for Preemptable Ticket Spinlock patch series by Jiannan Ouyang to solve most, if not all, ticketing spinlocks in overcommit scenarios problem without need for PV. So how this patch series compares with his patches on PLE enabled processors? No experiment results yet. An error is reported on a 20 core VM. I'm during an internship relocation, and will start work on it next week. Preemptable spinlocks' testing update: I hit the same softlockup problem while testing on 32 core machine with 32 guest vcpus that Andrew had reported. After that i started tuning TIMEOUT_UNIT, and when I went till (18), things seemed to be manageable for undercommit cases. But I still see degradation for undercommit w.r.t baseline itself on 32 core machine (after tuning). (37.5% degradation w.r.t base line). I can give the full report after the all tests complete. For over-commit cases, I again started hitting softlockups (and degradation is worse). But as I said in the preemptable thread, the concept of preemptable locks looks promising (though I am still not a fan of embedded TIMEOUT mechanism) Here is my opinion of TODOs for preemptable locks to make it better ( I think I need to paste in the preemptable thread also) 1. Current TIMEOUT UNIT seem to be on higher side and also it does not scale well with large guests and also overcommit. we need to have a sort of adaptive mechanism and better is sort of different TIMEOUT_UNITS for different types of lock too. The hashing mechanism that was used in Rik's spinlock backoff series fits better probably. 2. I do not think TIMEOUT_UNIT itself would work great when we have a big queue (for large guests / overcommits) for lock. one way is to add a PV hook that does yield hypercall immediately for the waiters above some THRESHOLD so that they don't burn the CPU. ( I can do POC to check if that idea works in improving situation at some later point of time) Preemptable-lock results from my run with 2^8 TIMEOUT: +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 5574.9000 237.49973484.2000 113.4449 -37.50202 2x 2741.5000 561.3090 351.5000 140.5420 -87.17855 3x 2146.2500 216.7718 194.833385.0303 -90.92215 4x 1663. 141.9235 101.57.7853 -93.92664 +---+---+---++---+ +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 14111.5600 754.4525 3930.1602 2547.2369-72.14936 2x 2481.627071.2665 181.181689.5368-92.69908 3x 1510.248331.8634 104.724353.2470-93.06576 4x 1029.487516.9166 72.373838.2432-92.96992 +---+---+---++---+ Note we can not trust on overcommit results because of softlock-ups Hi, I tried (1) TIMEOUT=(2^7) (2) having yield hypercall that uses kvm_vcpu_on_spin() to do directed yield to other vCPUs. Now I do not see any soft-lockup in overcommit cases and results are better now (except ebizzy 1x). and for dbench I see now it is closer to base and even improvement in 4x +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 5574.9000 237.4997 523.7000 1.4181 -90.60611 2741.5000 561.3090 597.800034.9755 -78.19442 2146.2500 216.7718 902.666782.4228 -57.94215 1663. 141.92351245.67.2989 -25.13530 +---+---+---++---+ +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
Raghu, thanks for you input. I'm more than glad to work together with you to make this idea work better. -Jiannan On Thu, Jun 6, 2013 at 11:15 PM, Raghavendra K T raghavendra...@linux.vnet.ibm.com wrote: On 06/03/2013 11:51 AM, Raghavendra K T wrote: On 06/03/2013 07:10 AM, Raghavendra K T wrote: On 06/02/2013 09:50 PM, Jiannan Ouyang wrote: On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote: High level question here. We have a big hope for Preemptable Ticket Spinlock patch series by Jiannan Ouyang to solve most, if not all, ticketing spinlocks in overcommit scenarios problem without need for PV. So how this patch series compares with his patches on PLE enabled processors? No experiment results yet. An error is reported on a 20 core VM. I'm during an internship relocation, and will start work on it next week. Preemptable spinlocks' testing update: I hit the same softlockup problem while testing on 32 core machine with 32 guest vcpus that Andrew had reported. After that i started tuning TIMEOUT_UNIT, and when I went till (18), things seemed to be manageable for undercommit cases. But I still see degradation for undercommit w.r.t baseline itself on 32 core machine (after tuning). (37.5% degradation w.r.t base line). I can give the full report after the all tests complete. For over-commit cases, I again started hitting softlockups (and degradation is worse). But as I said in the preemptable thread, the concept of preemptable locks looks promising (though I am still not a fan of embedded TIMEOUT mechanism) Here is my opinion of TODOs for preemptable locks to make it better ( I think I need to paste in the preemptable thread also) 1. Current TIMEOUT UNIT seem to be on higher side and also it does not scale well with large guests and also overcommit. we need to have a sort of adaptive mechanism and better is sort of different TIMEOUT_UNITS for different types of lock too. The hashing mechanism that was used in Rik's spinlock backoff series fits better probably. 2. I do not think TIMEOUT_UNIT itself would work great when we have a big queue (for large guests / overcommits) for lock. one way is to add a PV hook that does yield hypercall immediately for the waiters above some THRESHOLD so that they don't burn the CPU. ( I can do POC to check if that idea works in improving situation at some later point of time) Preemptable-lock results from my run with 2^8 TIMEOUT: +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 5574.9000 237.49973484.2000 113.4449 -37.50202 2x 2741.5000 561.3090 351.5000 140.5420 -87.17855 3x 2146.2500 216.7718 194.833385.0303 -90.92215 4x 1663. 141.9235 101.57.7853 -93.92664 +---+---+---++---+ +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 14111.5600 754.4525 3930.1602 2547.2369-72.14936 2x 2481.627071.2665 181.181689.5368-92.69908 3x 1510.248331.8634 104.724353.2470-93.06576 4x 1029.487516.9166 72.373838.2432-92.96992 +---+---+---++---+ Note we can not trust on overcommit results because of softlock-ups Hi, I tried (1) TIMEOUT=(2^7) (2) having yield hypercall that uses kvm_vcpu_on_spin() to do directed yield to other vCPUs. Now I do not see any soft-lockup in overcommit cases and results are better now (except ebizzy 1x). and for dbench I see now it is closer to base and even improvement in 4x +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 5574.9000 237.4997 523.7000 1.4181 -90.60611 2741.5000 561.3090 597.800034.9755 -78.19442 2146.2500 216.7718 902.666782.4228 -57.94215 1663. 141.92351245.67.2989 -25.13530 +---+---+---++---+ +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/02/2013 01:44 AM, Andi Kleen wrote: FWIW I use the paravirt spinlock ops for adding lock elision to the spinlocks. This needs to be done at the top level (so the level you're removing) However I don't like the pv mechanism very much and would be fine with using an static key hook in the main path like I do for all the other lock types. It also uses interrupt ops patching, for that it would be still needed though. Hi Andi, IIUC, you are okay with the current approach overall right? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/03/2013 07:10 AM, Raghavendra K T wrote: On 06/02/2013 09:50 PM, Jiannan Ouyang wrote: On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote: High level question here. We have a big hope for Preemptable Ticket Spinlock patch series by Jiannan Ouyang to solve most, if not all, ticketing spinlocks in overcommit scenarios problem without need for PV. So how this patch series compares with his patches on PLE enabled processors? No experiment results yet. An error is reported on a 20 core VM. I'm during an internship relocation, and will start work on it next week. Preemptable spinlocks' testing update: I hit the same softlockup problem while testing on 32 core machine with 32 guest vcpus that Andrew had reported. After that i started tuning TIMEOUT_UNIT, and when I went till (18), things seemed to be manageable for undercommit cases. But I still see degradation for undercommit w.r.t baseline itself on 32 core machine (after tuning). (37.5% degradation w.r.t base line). I can give the full report after the all tests complete. For over-commit cases, I again started hitting softlockups (and degradation is worse). But as I said in the preemptable thread, the concept of preemptable locks looks promising (though I am still not a fan of embedded TIMEOUT mechanism) Here is my opinion of TODOs for preemptable locks to make it better ( I think I need to paste in the preemptable thread also) 1. Current TIMEOUT UNIT seem to be on higher side and also it does not scale well with large guests and also overcommit. we need to have a sort of adaptive mechanism and better is sort of different TIMEOUT_UNITS for different types of lock too. The hashing mechanism that was used in Rik's spinlock backoff series fits better probably. 2. I do not think TIMEOUT_UNIT itself would work great when we have a big queue (for large guests / overcommits) for lock. one way is to add a PV hook that does yield hypercall immediately for the waiters above some THRESHOLD so that they don't burn the CPU. ( I can do POC to check if that idea works in improving situation at some later point of time) Preemptable-lock results from my run with 2^8 TIMEOUT: +---+---+---++---+ ebizzy (records/sec) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 5574.9000 237.49973484.2000 113.4449 -37.50202 2x 2741.5000 561.3090 351.5000 140.5420 -87.17855 3x 2146.2500 216.7718 194.833385.0303 -90.92215 4x 1663. 141.9235 101.57.7853 -93.92664 +---+---+---++---+ +---+---+---++---+ dbench (Throughput) higher is better +---+---+---++---+ basestdevpatchedstdev%improvement +---+---+---++---+ 1x 14111.5600 754.4525 3930.1602 2547.2369-72.14936 2x 2481.627071.2665 181.181689.5368-92.69908 3x 1510.248331.8634 104.724353.2470-93.06576 4x 1029.487516.9166 72.373838.2432-92.96992 +---+---+---++---+ Note we can not trust on overcommit results because of softlock-ups -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Sun, Jun 02, 2013 at 12:51:25AM +0530, Raghavendra K T wrote: This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. High level question here. We have a big hope for Preemptable Ticket Spinlock patch series by Jiannan Ouyang to solve most, if not all, ticketing spinlocks in overcommit scenarios problem without need for PV. So how this patch series compares with his patches on PLE enabled processors? Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. With this series we see that we could get little more improvements on top of that. Ticket locks have an inherent problem in a virtualized case, because the vCPUs are scheduled rather than running concurrently (ignoring gang scheduled vCPUs). This can result in catastrophic performance collapses when the vCPU scheduler doesn't schedule the correct next vCPU, and ends up scheduling a vCPU which burns its entire timeslice spinning. (Note that this is not the same problem as lock-holder preemption, which this series also addresses; that's also a problem, but not catastrophic). (See Thomas Friebel's talk Prevent Guests from Spinning Around http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) Currently we deal with this by having PV spinlocks, which adds a layer of indirection in front of all the spinlock functions, and defining a completely new implementation for Xen (and for other pvops users, but there are none at present). PV ticketlocks keeps the existing ticketlock implemenentation (fastpath) as-is, but adds a couple of pvops for the slow paths: - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD iterations, then call out to the __ticket_lock_spinning() pvop, which allows a backend to block the vCPU rather than spinning. This pvop can set the lock into slowpath state. - When releasing a lock, if it is in slowpath state, the call __ticket_unlock_kick() to kick the next vCPU in line awake. If the lock is no longer in contention, it also clears the slowpath flag. The slowpath state is stored in the LSB of the within the lock tail ticket. This has the effect of reducing the max number of CPUs by half (so, a small ticket can deal with 128 CPUs, and large ticket 32768). For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick another vcpu out of halt state. The blocking of vcpu is done using halt() in (lock_spinning) slowpath. Overall, it results in a large reduction in code, it makes the native and virtualized cases closer, and it removes a layer of indirection around all the spinlock functions. The fast path (taking an uncontended lock which isn't in slowpath state) is optimal, identical to the non-paravirtualized case. The inner part of ticket lock code becomes: inc = xadd(lock-tickets, inc); inc.tail = ~TICKET_SLOWPATH_FLAG; if (likely(inc.head == inc.tail)) goto out; for (;;) { unsigned count = SPIN_THRESHOLD; do { if (ACCESS_ONCE(lock-tickets.head) == inc.tail) goto out; cpu_relax(); } while (--count); __ticket_lock_spinning(lock, inc.tail); } out: barrier(); which results in: push %rbp mov%rsp,%rbp mov$0x200,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f # Slowpath if lock in contention pop%rbp retq ### SLOWPATH START 1:and$-2,%edx movzbl %dl,%esi 2:mov$0x800,%eax jmp4f 3:pause sub$0x1,%eax je 5f 4:movzbl (%rdi),%ecx cmp%cl,%dl jne3b pop%rbp retq 5:callq *__ticket_lock_spinning jmp2b ### SLOWPATH END with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where the fastpath case is straight through (taking the lock without contention), and the spin loop is out of line: push %rbp mov%rsp,%rbp mov$0x100,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f pop%rbp retq ### SLOWPATH START 1:pause movzbl (%rdi),%eax cmp%dl,%al jne1b pop%rbp retq ### SLOWPATH END The unlock code is complicated by the need to both add to
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote: High level question here. We have a big hope for Preemptable Ticket Spinlock patch series by Jiannan Ouyang to solve most, if not all, ticketing spinlocks in overcommit scenarios problem without need for PV. So how this patch series compares with his patches on PLE enabled processors? No experiment results yet. An error is reported on a 20 core VM. I'm during an internship relocation, and will start work on it next week. -- Jiannan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/02/2013 09:50 PM, Jiannan Ouyang wrote: On Sun, Jun 2, 2013 at 1:07 AM, Gleb Natapov g...@redhat.com wrote: High level question here. We have a big hope for Preemptable Ticket Spinlock patch series by Jiannan Ouyang to solve most, if not all, ticketing spinlocks in overcommit scenarios problem without need for PV. So how this patch series compares with his patches on PLE enabled processors? No experiment results yet. An error is reported on a 20 core VM. I'm during an internship relocation, and will start work on it next week. Preemptable spinlocks' testing update: I hit the same softlockup problem while testing on 32 core machine with 32 guest vcpus that Andrew had reported. After that i started tuning TIMEOUT_UNIT, and when I went till (18), things seemed to be manageable for undercommit cases. But I still see degradation for undercommit w.r.t baseline itself on 32 core machine (after tuning). (37.5% degradation w.r.t base line). I can give the full report after the all tests complete. For over-commit cases, I again started hitting softlockups (and degradation is worse). But as I said in the preemptable thread, the concept of preemptable locks looks promising (though I am still not a fan of embedded TIMEOUT mechanism) Here is my opinion of TODOs for preemptable locks to make it better ( I think I need to paste in the preemptable thread also) 1. Current TIMEOUT UNIT seem to be on higher side and also it does not scale well with large guests and also overcommit. we need to have a sort of adaptive mechanism and better is sort of different TIMEOUT_UNITS for different types of lock too. The hashing mechanism that was used in Rik's spinlock backoff series fits better probably. 2. I do not think TIMEOUT_UNIT itself would work great when we have a big queue (for large guests / overcommits) for lock. one way is to add a PV hook that does yield hypercall immediately for the waiters above some THRESHOLD so that they don't burn the CPU. ( I can do POC to check if that idea works in improving situation at some later point of time) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. With this series we see that we could get little more improvements on top of that. Ticket locks have an inherent problem in a virtualized case, because the vCPUs are scheduled rather than running concurrently (ignoring gang scheduled vCPUs). This can result in catastrophic performance collapses when the vCPU scheduler doesn't schedule the correct next vCPU, and ends up scheduling a vCPU which burns its entire timeslice spinning. (Note that this is not the same problem as lock-holder preemption, which this series also addresses; that's also a problem, but not catastrophic). (See Thomas Friebel's talk Prevent Guests from Spinning Around http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) Currently we deal with this by having PV spinlocks, which adds a layer of indirection in front of all the spinlock functions, and defining a completely new implementation for Xen (and for other pvops users, but there are none at present). PV ticketlocks keeps the existing ticketlock implemenentation (fastpath) as-is, but adds a couple of pvops for the slow paths: - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD iterations, then call out to the __ticket_lock_spinning() pvop, which allows a backend to block the vCPU rather than spinning. This pvop can set the lock into slowpath state. - When releasing a lock, if it is in slowpath state, the call __ticket_unlock_kick() to kick the next vCPU in line awake. If the lock is no longer in contention, it also clears the slowpath flag. The slowpath state is stored in the LSB of the within the lock tail ticket. This has the effect of reducing the max number of CPUs by half (so, a small ticket can deal with 128 CPUs, and large ticket 32768). For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick another vcpu out of halt state. The blocking of vcpu is done using halt() in (lock_spinning) slowpath. Overall, it results in a large reduction in code, it makes the native and virtualized cases closer, and it removes a layer of indirection around all the spinlock functions. The fast path (taking an uncontended lock which isn't in slowpath state) is optimal, identical to the non-paravirtualized case. The inner part of ticket lock code becomes: inc = xadd(lock-tickets, inc); inc.tail = ~TICKET_SLOWPATH_FLAG; if (likely(inc.head == inc.tail)) goto out; for (;;) { unsigned count = SPIN_THRESHOLD; do { if (ACCESS_ONCE(lock-tickets.head) == inc.tail) goto out; cpu_relax(); } while (--count); __ticket_lock_spinning(lock, inc.tail); } out:barrier(); which results in: push %rbp mov%rsp,%rbp mov$0x200,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f # Slowpath if lock in contention pop%rbp retq ### SLOWPATH START 1: and$-2,%edx movzbl %dl,%esi 2: mov$0x800,%eax jmp4f 3: pause sub$0x1,%eax je 5f 4: movzbl (%rdi),%ecx cmp%cl,%dl jne3b pop%rbp retq 5: callq *__ticket_lock_spinning jmp2b ### SLOWPATH END with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where the fastpath case is straight through (taking the lock without contention), and the spin loop is out of line: push %rbp mov%rsp,%rbp mov$0x100,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f pop%rbp retq ### SLOWPATH START 1: pause movzbl (%rdi),%eax cmp%dl,%al jne1b pop%rbp retq ### SLOWPATH END The unlock code is complicated by the need to both add to the lock's head and fetch the slowpath flag from tail. This version of the patch uses a locked add to do this, followed by a test to see if the slowflag is set. The lock prefix acts as a full memory barrier, so we can be sure that other CPUs will have seen the unlock before we read the flag (without the barrier the read could be fetched from the store queue before it hits
[PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
This series replaces the existing paravirtualized spinlock mechanism with a paravirtualized ticketlock mechanism. The series provides implementation for both Xen and KVM. Changes in V9: - Changed spin_threshold to 32k to avoid excess halt exits that are causing undercommit degradation (after PLE handler improvement). - Added kvm_irq_delivery_to_apic (suggested by Gleb) - Optimized halt exit path to use PLE handler V8 of PVspinlock was posted last year. After Avi's suggestions to look at PLE handler's improvements, various optimizations in PLE handling have been tried. With this series we see that we could get little more improvements on top of that. Ticket locks have an inherent problem in a virtualized case, because the vCPUs are scheduled rather than running concurrently (ignoring gang scheduled vCPUs). This can result in catastrophic performance collapses when the vCPU scheduler doesn't schedule the correct next vCPU, and ends up scheduling a vCPU which burns its entire timeslice spinning. (Note that this is not the same problem as lock-holder preemption, which this series also addresses; that's also a problem, but not catastrophic). (See Thomas Friebel's talk Prevent Guests from Spinning Around http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) Currently we deal with this by having PV spinlocks, which adds a layer of indirection in front of all the spinlock functions, and defining a completely new implementation for Xen (and for other pvops users, but there are none at present). PV ticketlocks keeps the existing ticketlock implemenentation (fastpath) as-is, but adds a couple of pvops for the slow paths: - If a CPU has been waiting for a spinlock for SPIN_THRESHOLD iterations, then call out to the __ticket_lock_spinning() pvop, which allows a backend to block the vCPU rather than spinning. This pvop can set the lock into slowpath state. - When releasing a lock, if it is in slowpath state, the call __ticket_unlock_kick() to kick the next vCPU in line awake. If the lock is no longer in contention, it also clears the slowpath flag. The slowpath state is stored in the LSB of the within the lock tail ticket. This has the effect of reducing the max number of CPUs by half (so, a small ticket can deal with 128 CPUs, and large ticket 32768). For KVM, one hypercall is introduced in hypervisor,that allows a vcpu to kick another vcpu out of halt state. The blocking of vcpu is done using halt() in (lock_spinning) slowpath. Overall, it results in a large reduction in code, it makes the native and virtualized cases closer, and it removes a layer of indirection around all the spinlock functions. The fast path (taking an uncontended lock which isn't in slowpath state) is optimal, identical to the non-paravirtualized case. The inner part of ticket lock code becomes: inc = xadd(lock-tickets, inc); inc.tail = ~TICKET_SLOWPATH_FLAG; if (likely(inc.head == inc.tail)) goto out; for (;;) { unsigned count = SPIN_THRESHOLD; do { if (ACCESS_ONCE(lock-tickets.head) == inc.tail) goto out; cpu_relax(); } while (--count); __ticket_lock_spinning(lock, inc.tail); } out:barrier(); which results in: push %rbp mov%rsp,%rbp mov$0x200,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f # Slowpath if lock in contention pop%rbp retq ### SLOWPATH START 1: and$-2,%edx movzbl %dl,%esi 2: mov$0x800,%eax jmp4f 3: pause sub$0x1,%eax je 5f 4: movzbl (%rdi),%ecx cmp%cl,%dl jne3b pop%rbp retq 5: callq *__ticket_lock_spinning jmp2b ### SLOWPATH END with CONFIG_PARAVIRT_SPINLOCKS=n, the code has changed slightly, where the fastpath case is straight through (taking the lock without contention), and the spin loop is out of line: push %rbp mov%rsp,%rbp mov$0x100,%eax lock xadd %ax,(%rdi) movzbl %ah,%edx cmp%al,%dl jne1f pop%rbp retq ### SLOWPATH START 1: pause movzbl (%rdi),%eax cmp%dl,%al jne1b pop%rbp retq ### SLOWPATH END The unlock code is complicated by the need to both add to the lock's head and fetch the slowpath flag from tail. This version of the patch uses a locked add to do this, followed by a test to see if the slowflag is set. The lock prefix acts as a full memory barrier, so we can be sure that other CPUs will have seen the unlock before we read the flag (without the barrier the read could be fetched from the store queue before it hits
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
FWIW I use the paravirt spinlock ops for adding lock elision to the spinlocks. This needs to be done at the top level (so the level you're removing) However I don't like the pv mechanism very much and would be fine with using an static key hook in the main path like I do for all the other lock types. It also uses interrupt ops patching, for that it would be still needed though. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/01/2013 01:14 PM, Andi Kleen wrote: FWIW I use the paravirt spinlock ops for adding lock elision to the spinlocks. Does lock elision still use the ticketlock algorithm/structure, or are they different? If they're still basically ticketlocks, then it seems to me that they're complimentary - hle handles the fastpath, and pv the slowpath. This needs to be done at the top level (so the level you're removing) However I don't like the pv mechanism very much and would be fine with using an static key hook in the main path like I do for all the other lock types. Right. J -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Sat, Jun 01, 2013 at 01:28:00PM -0700, Jeremy Fitzhardinge wrote: On 06/01/2013 01:14 PM, Andi Kleen wrote: FWIW I use the paravirt spinlock ops for adding lock elision to the spinlocks. Does lock elision still use the ticketlock algorithm/structure, or are they different? If they're still basically ticketlocks, then it seems to me that they're complimentary - hle handles the fastpath, and pv the slowpath. It uses the ticketlock algorithm/structure, but: - it needs to know that the lock is free with an own operation - it has an additional field for strong adaptation state (but that field is independent of the low level lock implementation, so can be used with any kind of lock) So currently it inlines the ticket lock code into its own. Doing pv on the slow path would be possible, but would need some additional (minor) hooks I think. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html