subject:"Re\: \[PATCH RFC 1\/2\] kvm\: Handle undercommitted guest case in PLE handler"

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-19 Thread Raghavendra K T


On 10/18/2012 06:09 PM, Avi Kivity wrote:

On 10/09/2012 08:51 PM, Raghavendra K T wrote:

Here is the summary:
We do get good benefit by increasing ple window. Though we don't
see good benefit for kernbench and sysbench, for ebizzy, we get huge
improvement for 1x scenario. (almost 2/3rd of ple disabled case).

Let me know if you think we can increase the default ple_window
itself to 16k.



I think so, there is no point running with untuned defaults.



Oaky.



I can respin the whole series including this default ple_window change.


It can come as a separate patch.


Yes. Will spin it separately.





I also have the perf kvm top result for both ebizzy and kernbench.
I think they are in expected lines now.

Improvements


16 core PLE machine with 16 vcpu guest

base = 3.6.0-rc5 + ple handler optimization patches
base_pleopt_16k = base + ple_window = 16k
base_pleopt_32k = base + ple_window = 32k
base_pleopt_nople = base + ple_gap = 0
kernbench, hackbench, sysbench (time in sec lower is better)
ebizzy (rec/sec higher is better)

% improvements w.r.t base (ple_window = 4k)
---+---+-+---+
|base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
---+---+-+---+
kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
---+---+-+---+
sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
---+---+-+---+
ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
---+---+-+---+


So it seems we want dynamic PLE windows.  As soon as we enter overcommit
we need to decrease the window.



Okay.
I have some rough idea on the implementation. I 'll try that after this
V2 experiments are over.
So in brief, I have this in my queue priority wise

1) V2 version of this patch series( in progress)
2) default PLE window
3) preemption notifiers
4) Pv spinlock

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-19 Thread Raghavendra K T


On 10/15/2012 08:04 PM, Andrew Theurer wrote:

On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:

On 10/11/2012 01:06 AM, Andrew Theurer wrote:

On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:

On 10/10/2012 08:29 AM, Andrew Theurer wrote:

On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:

* Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:


On 10/04/2012 03:07 PM, Peter Zijlstra wrote:

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:



[...]

A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with.  I do not think we should
try to optimize such a bad workload.



I think my way of running dbench has some flaw, so I went to ebizzy.
Could you let me know how you generally run dbench?


I mount a tmpfs and then specify that mount for dbench to run on.  This
eliminates all IO.  I use a 300 second run time and number of threads is
equal to number of vcpus.  All of the VMs of course need to have a
synchronized start.

I would also make sure you are using a recent kernel for dbench, where
the dcache scalability is much improved.  Without any lock-holder
preemption, the time in spin_lock should be very low:



  21.54%  78016 dbench  [kernel.kallsyms]   [k] 
copy_user_generic_unrolled
   3.51%  12723 dbench  libc-2.12.so[.] __strchr_sse42
   2.81%  10176 dbench  dbench  [.] child_run
   2.54%   9203 dbench  [kernel.kallsyms]   [k] _raw_spin_lock
   2.33%   8423 dbench  dbench  [.] next_token
   2.02%   7335 dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
   1.89%   6850 dbench  libc-2.12.so[.] __strstr_sse42
   1.53%   5537 dbench  libc-2.12.so[.] __memset_sse2
   1.47%   5337 dbench  [kernel.kallsyms]   [k] link_path_walk
   1.40%   5084 dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
   1.38%   5009 dbench  libc-2.12.so[.] memmove
   1.24%   4496 dbench  libc-2.12.so[.] vfprintf
   1.15%   4169 dbench  [kernel.kallsyms]   [k] 
__audit_syscall_exit




Hi Andrew,
I ran the test with dbench with tmpfs. I do not see any improvements in
dbench for 16k ple window.

So it seems apart from ebizzy no workload benefited by that. and I
agree that, it may not be good to optimize for ebizzy.
I shall drop changing to 16k default window and continue with other
original patch series. Need to experiment with latest kernel.


Thanks for running this again.  I do believe there are some workloads,
when run at 1x overcommit, would benefit from a larger ple_window [with
he current ple handling code], but I do not also want to potentially
degrade 1x with a larger window.  I do, however, think there may be a
another option.  I have not fully worked this out, but I think I am on
to something.

I decided to revert back to just a yield() instead of a yield_to().  My
motivation was that yield_to() [for large VMs] is like a dog chasing its
tail, round and round we go   Just yield(), in particular a yield()
which results in yielding to something -other- than the current VM's
vcpus, helps synchronize the execution of sibling vcpus by deferring
them until the lock holder vcpu is running again.  The more we can do to
get all vcpus running at the same time, the far less we deal with the
preemption problem.  The other benefit is that yield() is far, far lower
overhead than yield_to()

This does assume that vcpus from same VM do not share same runqueues.
Yielding to a sibling vcpu with yield() is not productive for larger VMs
in the same way that yield_to() is not.  My recent results include
restricting vcpu placement so that sibling vcpus do not get to run on
the same runqueue.  I do believe we could implement a initial placement
and load balance policy to strive for this restriction (making it purely
optional, but I bet could also help user apps which use spin locks).

For 1x VMs which still vm_exit due to PLE, I believe we could probably
just leave the ple_window alone, as long as we mostly use yield()
instead of yield_to().  The problem with the unneeded exits in this case
has been the overhead in routines leading up to yield_to() and the
yield_to() itself.  If we use yield() most of the time, this overhead
will go away.

Here is a comparison of yield_to() and yield():

dbench with 20-way VMs, 8 of them on 80-way host:

no PLE426 +/- 11.03%
no PLE w/ gangsched 32001 +/- .37%
PLE with yield()29207 +/- .28%
PLE with yield_to()  8175 +/- 1.37%

Yield() is far and way better than yield_to() here and almost approaches
gang sched result.  Here is a link for the perf sched map bitmap:

https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU

The thrashing is way down and sibling vcpus tend to run together,
approximating the behavior of the gang

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-19 Thread Andrew Theurer

On Fri, 2012-10-19 at 14:00 +0530, Raghavendra K T wrote:
 On 10/15/2012 08:04 PM, Andrew Theurer wrote:
  On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
  On 10/11/2012 01:06 AM, Andrew Theurer wrote:
  On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
  On 10/10/2012 08:29 AM, Andrew Theurer wrote:
  On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
  * Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:
 
  On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
  On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
 
  [...]
  A big concern I have (if this is 1x overcommit) for ebizzy is that it
  has just terrible scalability to begin with.  I do not think we should
  try to optimize such a bad workload.
 
 
  I think my way of running dbench has some flaw, so I went to ebizzy.
  Could you let me know how you generally run dbench?
 
  I mount a tmpfs and then specify that mount for dbench to run on.  This
  eliminates all IO.  I use a 300 second run time and number of threads is
  equal to number of vcpus.  All of the VMs of course need to have a
  synchronized start.
 
  I would also make sure you are using a recent kernel for dbench, where
  the dcache scalability is much improved.  Without any lock-holder
  preemption, the time in spin_lock should be very low:
 
 
21.54%  78016 dbench  [kernel.kallsyms]   [k] 
  copy_user_generic_unrolled
 3.51%  12723 dbench  libc-2.12.so[.] 
  __strchr_sse42
 2.81%  10176 dbench  dbench  [.] child_run
 2.54%   9203 dbench  [kernel.kallsyms]   [k] 
  _raw_spin_lock
 2.33%   8423 dbench  dbench  [.] 
  next_token
 2.02%   7335 dbench  [kernel.kallsyms]   [k] 
  __d_lookup_rcu
 1.89%   6850 dbench  libc-2.12.so[.] 
  __strstr_sse42
 1.53%   5537 dbench  libc-2.12.so[.] 
  __memset_sse2
 1.47%   5337 dbench  [kernel.kallsyms]   [k] 
  link_path_walk
 1.40%   5084 dbench  [kernel.kallsyms]   [k] 
  kmem_cache_alloc
 1.38%   5009 dbench  libc-2.12.so[.] memmove
 1.24%   4496 dbench  libc-2.12.so[.] vfprintf
 1.15%   4169 dbench  [kernel.kallsyms]   [k] 
  __audit_syscall_exit
 
 
  Hi Andrew,
  I ran the test with dbench with tmpfs. I do not see any improvements in
  dbench for 16k ple window.
 
  So it seems apart from ebizzy no workload benefited by that. and I
  agree that, it may not be good to optimize for ebizzy.
  I shall drop changing to 16k default window and continue with other
  original patch series. Need to experiment with latest kernel.
 
  Thanks for running this again.  I do believe there are some workloads,
  when run at 1x overcommit, would benefit from a larger ple_window [with
  he current ple handling code], but I do not also want to potentially
  degrade 1x with a larger window.  I do, however, think there may be a
  another option.  I have not fully worked this out, but I think I am on
  to something.
 
  I decided to revert back to just a yield() instead of a yield_to().  My
  motivation was that yield_to() [for large VMs] is like a dog chasing its
  tail, round and round we go   Just yield(), in particular a yield()
  which results in yielding to something -other- than the current VM's
  vcpus, helps synchronize the execution of sibling vcpus by deferring
  them until the lock holder vcpu is running again.  The more we can do to
  get all vcpus running at the same time, the far less we deal with the
  preemption problem.  The other benefit is that yield() is far, far lower
  overhead than yield_to()
 
  This does assume that vcpus from same VM do not share same runqueues.
  Yielding to a sibling vcpu with yield() is not productive for larger VMs
  in the same way that yield_to() is not.  My recent results include
  restricting vcpu placement so that sibling vcpus do not get to run on
  the same runqueue.  I do believe we could implement a initial placement
  and load balance policy to strive for this restriction (making it purely
  optional, but I bet could also help user apps which use spin locks).
 
  For 1x VMs which still vm_exit due to PLE, I believe we could probably
  just leave the ple_window alone, as long as we mostly use yield()
  instead of yield_to().  The problem with the unneeded exits in this case
  has been the overhead in routines leading up to yield_to() and the
  yield_to() itself.  If we use yield() most of the time, this overhead
  will go away.
 
  Here is a comparison of yield_to() and yield():
 
  dbench with 20-way VMs, 8 of them on 80-way host:
 
  no PLE426 +/- 11.03%
  no PLE w/ gangsched 32001 +/- .37%
  PLE with yield()29207 +/- .28%
  PLE with yield_to()  8175 +/- 1.37%
 
  Yield() is far and way better than yield_to() here and almost approaches

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-18 Thread Avi Kivity

On 10/09/2012 08:51 PM, Raghavendra K T wrote:
 Here is the summary:
 We do get good benefit by increasing ple window. Though we don't
 see good benefit for kernbench and sysbench, for ebizzy, we get huge
 improvement for 1x scenario. (almost 2/3rd of ple disabled case).
 
 Let me know if you think we can increase the default ple_window
 itself to 16k.


I think so, there is no point running with untuned defaults.

 
 I can respin the whole series including this default ple_window change.

It can come as a separate patch.

 
 I also have the perf kvm top result for both ebizzy and kernbench.
 I think they are in expected lines now.
 
 Improvements
 
 
 16 core PLE machine with 16 vcpu guest
 
 base = 3.6.0-rc5 + ple handler optimization patches
 base_pleopt_16k = base + ple_window = 16k
 base_pleopt_32k = base + ple_window = 32k
 base_pleopt_nople = base + ple_gap = 0
 kernbench, hackbench, sysbench (time in sec lower is better)
 ebizzy (rec/sec higher is better)
 
 % improvements w.r.t base (ple_window = 4k)
 ---+---+-+---+
|base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
 ---+---+-+---+
 kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
 kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
 ---+---+-+---+
 sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
 sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
 sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
 ---+---+-+---+
 ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
 ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
 ---+---+-+---+

So it seems we want dynamic PLE windows.  As soon as we enter overcommit
we need to decrease the window.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-15 Thread Raghavendra K T


On 10/11/2012 01:06 AM, Andrew Theurer wrote:

On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:

On 10/10/2012 08:29 AM, Andrew Theurer wrote:

On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:

* Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:


On 10/04/2012 03:07 PM, Peter Zijlstra wrote:

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:



[...]

A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with.  I do not think we should
try to optimize such a bad workload.



I think my way of running dbench has some flaw, so I went to ebizzy.
Could you let me know how you generally run dbench?


I mount a tmpfs and then specify that mount for dbench to run on.  This
eliminates all IO.  I use a 300 second run time and number of threads is
equal to number of vcpus.  All of the VMs of course need to have a
synchronized start.

I would also make sure you are using a recent kernel for dbench, where
the dcache scalability is much improved.  Without any lock-holder
preemption, the time in spin_lock should be very low:



 21.54%  78016 dbench  [kernel.kallsyms]   [k] 
copy_user_generic_unrolled
  3.51%  12723 dbench  libc-2.12.so[.] __strchr_sse42
  2.81%  10176 dbench  dbench  [.] child_run
  2.54%   9203 dbench  [kernel.kallsyms]   [k] _raw_spin_lock
  2.33%   8423 dbench  dbench  [.] next_token
  2.02%   7335 dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
  1.89%   6850 dbench  libc-2.12.so[.] __strstr_sse42
  1.53%   5537 dbench  libc-2.12.so[.] __memset_sse2
  1.47%   5337 dbench  [kernel.kallsyms]   [k] link_path_walk
  1.40%   5084 dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
  1.38%   5009 dbench  libc-2.12.so[.] memmove
  1.24%   4496 dbench  libc-2.12.so[.] vfprintf
  1.15%   4169 dbench  [kernel.kallsyms]   [k] 
__audit_syscall_exit




Hi Andrew,
I ran the test with dbench with tmpfs. I do not see any improvements in
dbench for 16k ple window.

So it seems apart from ebizzy no workload benefited by that. and I
agree that, it may not be good to optimize for ebizzy.
I shall drop changing to 16k default window and continue with other
original patch series. Need to experiment with latest kernel.

(PS: Thanks for pointing towards, perf in latest kernel. It works fine.)

Results:
dbench run for 120 sec 30 sec warmup 8 iterations using tmpfs
base = 3.6.0-rc5 with ple handler optimization patch.

x = base + ple_window = 4k
+ = base + ple_window = 16k
* = base + ple_gap = 0

dbench 1x overcommit case
=
N   Min   MaxMedian   AvgStddev
x   85322.5   5519.05   5482.71 5461.0962 63.522276
+   8   5255.45   5530.55   5496.94 5455.2137 93.070363
*   8   5350.85   5477.81  5408.065 5418.4338 44.762697


dbench 2x overcommit case
==

N   Min   MaxMedian   AvgStddev
x   8   3054.32   3194.47   3137.33  3132.625 54.491615
+   83040.8   3148.87  3088.615 3088.1887 32.862336
*   8   3031.51   3171.993083.6 3097.4612 50.526977

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-15 Thread Andrew Theurer

On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
 On 10/11/2012 01:06 AM, Andrew Theurer wrote:
  On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
  On 10/10/2012 08:29 AM, Andrew Theurer wrote:
  On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
  * Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:
 
  On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
  On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
 
 [...]
  A big concern I have (if this is 1x overcommit) for ebizzy is that it
  has just terrible scalability to begin with.  I do not think we should
  try to optimize such a bad workload.
 
 
  I think my way of running dbench has some flaw, so I went to ebizzy.
  Could you let me know how you generally run dbench?
 
  I mount a tmpfs and then specify that mount for dbench to run on.  This
  eliminates all IO.  I use a 300 second run time and number of threads is
  equal to number of vcpus.  All of the VMs of course need to have a
  synchronized start.
 
  I would also make sure you are using a recent kernel for dbench, where
  the dcache scalability is much improved.  Without any lock-holder
  preemption, the time in spin_lock should be very low:
 
 
   21.54%  78016 dbench  [kernel.kallsyms]   [k] 
  copy_user_generic_unrolled
3.51%  12723 dbench  libc-2.12.so[.] 
  __strchr_sse42
2.81%  10176 dbench  dbench  [.] child_run
2.54%   9203 dbench  [kernel.kallsyms]   [k] 
  _raw_spin_lock
2.33%   8423 dbench  dbench  [.] next_token
2.02%   7335 dbench  [kernel.kallsyms]   [k] 
  __d_lookup_rcu
1.89%   6850 dbench  libc-2.12.so[.] 
  __strstr_sse42
1.53%   5537 dbench  libc-2.12.so[.] 
  __memset_sse2
1.47%   5337 dbench  [kernel.kallsyms]   [k] 
  link_path_walk
1.40%   5084 dbench  [kernel.kallsyms]   [k] 
  kmem_cache_alloc
1.38%   5009 dbench  libc-2.12.so[.] memmove
1.24%   4496 dbench  libc-2.12.so[.] vfprintf
1.15%   4169 dbench  [kernel.kallsyms]   [k] 
  __audit_syscall_exit
 
 
 Hi Andrew,
 I ran the test with dbench with tmpfs. I do not see any improvements in
 dbench for 16k ple window.
 
 So it seems apart from ebizzy no workload benefited by that. and I
 agree that, it may not be good to optimize for ebizzy.
 I shall drop changing to 16k default window and continue with other
 original patch series. Need to experiment with latest kernel.

Thanks for running this again.  I do believe there are some workloads,
when run at 1x overcommit, would benefit from a larger ple_window [with
he current ple handling code], but I do not also want to potentially
degrade 1x with a larger window.  I do, however, think there may be a
another option.  I have not fully worked this out, but I think I am on
to something.

I decided to revert back to just a yield() instead of a yield_to().  My
motivation was that yield_to() [for large VMs] is like a dog chasing its
tail, round and round we go   Just yield(), in particular a yield()
which results in yielding to something -other- than the current VM's
vcpus, helps synchronize the execution of sibling vcpus by deferring
them until the lock holder vcpu is running again.  The more we can do to
get all vcpus running at the same time, the far less we deal with the
preemption problem.  The other benefit is that yield() is far, far lower
overhead than yield_to()

This does assume that vcpus from same VM do not share same runqueues.
Yielding to a sibling vcpu with yield() is not productive for larger VMs
in the same way that yield_to() is not.  My recent results include
restricting vcpu placement so that sibling vcpus do not get to run on
the same runqueue.  I do believe we could implement a initial placement
and load balance policy to strive for this restriction (making it purely
optional, but I bet could also help user apps which use spin locks).

For 1x VMs which still vm_exit due to PLE, I believe we could probably
just leave the ple_window alone, as long as we mostly use yield()
instead of yield_to().  The problem with the unneeded exits in this case
has been the overhead in routines leading up to yield_to() and the
yield_to() itself.  If we use yield() most of the time, this overhead
will go away.

Here is a comparison of yield_to() and yield():

dbench with 20-way VMs, 8 of them on 80-way host:

no PLE426 +/- 11.03%
no PLE w/ gangsched 32001 +/- .37%
PLE with yield()29207 +/- .28%
PLE with yield_to()  8175 +/- 1.37%

Yield() is far and way better than yield_to() here and almost approaches
gang sched result.  Here is a link for the perf sched map bitmap:

https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU

The thrashing is way down and sibling vcpus tend to run together,

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-11 Thread Nikunj A Dadhania

On Wed, 10 Oct 2012 09:24:55 -0500, Andrew Theurer 
haban...@linux.vnet.ibm.com wrote:
 
 Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
 scheduling patches.  While I am not recommending gang scheduling, I
 think it's a good data point.  The performance is 3.88x the PLE result.
 
 https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

That looks pretty good and serves the purpose. And the result says it all.

 Note that the task switching intervals of 4ms are quite obvious again,
 and this time all vCPUs from same VM run at the same time.  It
 represents the best possible outcome.
 
 
 Anyway, I thought the bitmaps might help better visualize what's going
 on.
 
 -Andrew
 

Regards
Nikunj

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-11 Thread Raghavendra K T


On 10/11/2012 12:57 AM, Andrew Theurer wrote:

On Wed, 2012-10-10 at 23:13 +0530, Raghavendra K T wrote:

On 10/10/2012 07:54 PM, Andrew Theurer wrote:

I ran 'perf sched map' on the dbench workload for medium and large VMs,
and I thought I would share some of the results.  I think it helps to
visualize what's going on regarding the yielding.

These files are png bitmaps, generated from processing output from 'perf
sched map' (and perf data generated from 'perf sched record').  The Y
axis is the host cpus, each row being 10 pixels high.  For these tests,
there are 80 host cpus, so the total height is 800 pixels.  The X axis
is time (in microseconds), with each pixel representing 1 microsecond.
Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
obviously, and zooming in/out while viewing is recommended.

Each row (each host cpu) is assigned a color based on what thread is
running.  vCPUs of the same VM are assigned a common color (like red,
blue, magenta, etc), and each vCPU has a unique brightness for that
color.  There are a maximum of 12 assignable colors, so in any VMs 12
revert to vCPU color of gray. I would use more colors, but it becomes
harder to distinguish one color from another.  The white color
represents missing data from perf, and black color represents any thread
which is not a vCPU.

For the following tests, VMs were pinned to host NUMA nodes and to
specific cpus to help with consistency and operate within the
constraints of the last test (gang scheduler).

Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
described above only 12 of the VMs have a color, rest are gray).

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU


This looks very nice to visualize what is happening. Beginning of the
graph looks little messy but later it is clear.



If you zoom out and look at the whole bitmap, you may notice the 4ms
intervals of the scheduler.  They are pretty well aligned across all
cpus.  Normally, for cpu bound workloads, we would expect to see each
thread to run for 4 ms, then something else getting to run, and so on.
That is mostly true in this test.  We have 2x over-commit and we
generally see the switching of threads at 4ms.  One thing to note is
that not all vCPU threads for the same VM run at exactly the same time,
and that is expected and the whole reason for lock-holder preemption.
Now, if you zoom in on the bitmap, you should notice within the 4ms
intervals there is some task switching going on.  This is most likely
because of the yield_to initiated by the PLE handler.  In this case
there is not that much yielding to do.   It's quite clean, and the
performance is quite good.

Below is an example of PLE, but this time with 20-way VMs, 8 of them.
CPU over-commit is still 2x.

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU


I think this link still 10x16. Could you paste the link again?


Oops
https://docs.google.com/open?id=0B6tfUNlZ-14wSGtYYzZtRTcyVjQ





This one looks quite different.  In short, it's a mess.  The switching
between tasks can be lower than 10 microseconds.  It basically never
recovers.  There is constant yielding all the time.

Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
scheduling patches.  While I am not recommending gang scheduling, I
think it's a good data point.  The performance is 3.88x the PLE result.

https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M


Yes.. we see lot of yields.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-10 Thread Raghavendra K T


On 10/10/2012 07:54 PM, Andrew Theurer wrote:

I ran 'perf sched map' on the dbench workload for medium and large VMs,
and I thought I would share some of the results.  I think it helps to
visualize what's going on regarding the yielding.

These files are png bitmaps, generated from processing output from 'perf
sched map' (and perf data generated from 'perf sched record').  The Y
axis is the host cpus, each row being 10 pixels high.  For these tests,
there are 80 host cpus, so the total height is 800 pixels.  The X axis
is time (in microseconds), with each pixel representing 1 microsecond.
Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
obviously, and zooming in/out while viewing is recommended.

Each row (each host cpu) is assigned a color based on what thread is
running.  vCPUs of the same VM are assigned a common color (like red,
blue, magenta, etc), and each vCPU has a unique brightness for that
color.  There are a maximum of 12 assignable colors, so in any VMs 12
revert to vCPU color of gray. I would use more colors, but it becomes
harder to distinguish one color from another.  The white color
represents missing data from perf, and black color represents any thread
which is not a vCPU.

For the following tests, VMs were pinned to host NUMA nodes and to
specific cpus to help with consistency and operate within the
constraints of the last test (gang scheduler).

Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
described above only 12 of the VMs have a color, rest are gray).

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU


This looks very nice to visualize what is happening. Beginning of the 
graph looks little messy but later it is clear.




If you zoom out and look at the whole bitmap, you may notice the 4ms
intervals of the scheduler.  They are pretty well aligned across all
cpus.  Normally, for cpu bound workloads, we would expect to see each
thread to run for 4 ms, then something else getting to run, and so on.
That is mostly true in this test.  We have 2x over-commit and we
generally see the switching of threads at 4ms.  One thing to note is
that not all vCPU threads for the same VM run at exactly the same time,
and that is expected and the whole reason for lock-holder preemption.
Now, if you zoom in on the bitmap, you should notice within the 4ms
intervals there is some task switching going on.  This is most likely
because of the yield_to initiated by the PLE handler.  In this case
there is not that much yielding to do.   It's quite clean, and the
performance is quite good.

Below is an example of PLE, but this time with 20-way VMs, 8 of them.
CPU over-commit is still 2x.

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU


I think this link still 10x16. Could you paste the link again?



This one looks quite different.  In short, it's a mess.  The switching
between tasks can be lower than 10 microseconds.  It basically never
recovers.  There is constant yielding all the time.

Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
scheduling patches.  While I am not recommending gang scheduling, I
think it's a good data point.  The performance is 3.88x the PLE result.

https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

Note that the task switching intervals of 4ms are quite obvious again,
and this time all vCPUs from same VM run at the same time.  It
represents the best possible outcome.


Anyway, I thought the bitmaps might help better visualize what's going
on.

-Andrew






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-10 Thread Raghavendra K T


On 10/10/2012 08:29 AM, Andrew Theurer wrote:

On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:

* Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:


On 10/04/2012 03:07 PM, Peter Zijlstra wrote:

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:


Again the numbers are ridiculously high for arch_local_irq_restore.
Maybe there's a bad perf/kvm interaction when we're injecting an
interrupt, I can't believe we're spending 84% of the time running the
popf instruction.


Smells like a software fallback that doesn't do NMI, hrtimer based
sampling typically hits popf where we re-enable interrupts.


Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
host will expose it (and a good idea anyway to get best performance).



Hi Avi, you are right. SandyBridge machine result was not proper.
I cleaned up the services, enabled PMU, re-ran all the test again.

Here is the summary:
We do get good benefit by increasing ple window. Though we don't
see good benefit for kernbench and sysbench, for ebizzy, we get huge
improvement for 1x scenario. (almost 2/3rd of ple disabled case).

Let me know if you think we can increase the default ple_window
itself to 16k.

I am experimenting with V2 version of undercommit improvement(this) patch
series, But I think if you wish  to go for increase of
default ple_window, then we would have to measure the benefit of patches
when ple_window = 16k.

I can respin the whole series including this default ple_window change.

I also have the perf kvm top result for both ebizzy and kernbench.
I think they are in expected lines now.

Improvements


16 core PLE machine with 16 vcpu guest

base = 3.6.0-rc5 + ple handler optimization patches
base_pleopt_16k = base + ple_window = 16k
base_pleopt_32k = base + ple_window = 32k
base_pleopt_nople = base + ple_gap = 0
kernbench, hackbench, sysbench (time in sec lower is better)
ebizzy (rec/sec higher is better)

% improvements w.r.t base (ple_window = 4k)
---+---+-+---+
|base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
---+---+-+---+
kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
---+---+-+---+
sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
---+---+-+---+
ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
---+---+-+---+

perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)



Is the perf data for 1x overcommit?


Yes, 16vcpu guest on 16 core




pleopt   ple_gap=0

ebizzy : 18131 records/s
63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
 5.65%  [guest.kernel]  [g] smp_call_function_many
 3.12%  [guest.kernel]  [g] clear_page
 3.02%  [guest.kernel]  [g] down_read_trylock
 1.85%  [guest.kernel]  [g] async_page_fault
 1.81%  [guest.kernel]  [g] up_read
 1.76%  [guest.kernel]  [g] native_apic_mem_write
 1.70%  [guest.kernel]  [g] find_vma


Does 'perf kvm top' not give host samples at the same time?  Would be
nice to see the host overhead as a function of varying ple window.  I
would expect that to be the major difference between 4/16/32k window
sizes.


No, I did something like this
perf kvm  --guestvmlinux ./vmlinux.guest top -g  -U -d 3. Yes that is a
good idea.

(I am getting some segfaults with perf top, I think it is already fixed
but yet to see the patch that fixes)





A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with.  I do not think we should
try to optimize such a bad workload.



I think my way of running dbench has some flaw, so I went to ebizzy.
Could you let me know how you generally run dbench?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-10 Thread David Ahern


On 10/10/12 11:54 AM, Raghavendra K T wrote:

No, I did something like this
perf kvm  --guestvmlinux ./vmlinux.guest top -g  -U -d 3. Yes that is a
good idea.

(I am getting some segfaults with perf top, I think it is already fixed
but yet to see the patch that fixes)


What version of perf:  perf --version


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-10 Thread Raghavendra K T


On 10/10/2012 11:33 PM, David Ahern wrote:

On 10/10/12 11:54 AM, Raghavendra K T wrote:

No, I did something like this
perf kvm  --guestvmlinux ./vmlinux.guest top -g  -U -d 3. Yes that is a
good idea.

(I am getting some segfaults with perf top, I think it is already fixed
but yet to see the patch that fixes)


What version of perf:  perf --version



perf version 2.6.32-279.el6.x86_64.debug

(I searched that it is fixed in 288. could not dig-out actual patch
though)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-10 Thread Andrew Theurer

On Wed, 2012-10-10 at 23:13 +0530, Raghavendra K T wrote:
 On 10/10/2012 07:54 PM, Andrew Theurer wrote:
  I ran 'perf sched map' on the dbench workload for medium and large VMs,
  and I thought I would share some of the results.  I think it helps to
  visualize what's going on regarding the yielding.
 
  These files are png bitmaps, generated from processing output from 'perf
  sched map' (and perf data generated from 'perf sched record').  The Y
  axis is the host cpus, each row being 10 pixels high.  For these tests,
  there are 80 host cpus, so the total height is 800 pixels.  The X axis
  is time (in microseconds), with each pixel representing 1 microsecond.
  Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
  obviously, and zooming in/out while viewing is recommended.
 
  Each row (each host cpu) is assigned a color based on what thread is
  running.  vCPUs of the same VM are assigned a common color (like red,
  blue, magenta, etc), and each vCPU has a unique brightness for that
  color.  There are a maximum of 12 assignable colors, so in any VMs 12
  revert to vCPU color of gray. I would use more colors, but it becomes
  harder to distinguish one color from another.  The white color
  represents missing data from perf, and black color represents any thread
  which is not a vCPU.
 
  For the following tests, VMs were pinned to host NUMA nodes and to
  specific cpus to help with consistency and operate within the
  constraints of the last test (gang scheduler).
 
  Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
  described above only 12 of the VMs have a color, rest are gray).
 
  https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
 
 This looks very nice to visualize what is happening. Beginning of the 
 graph looks little messy but later it is clear.
 
 
  If you zoom out and look at the whole bitmap, you may notice the 4ms
  intervals of the scheduler.  They are pretty well aligned across all
  cpus.  Normally, for cpu bound workloads, we would expect to see each
  thread to run for 4 ms, then something else getting to run, and so on.
  That is mostly true in this test.  We have 2x over-commit and we
  generally see the switching of threads at 4ms.  One thing to note is
  that not all vCPU threads for the same VM run at exactly the same time,
  and that is expected and the whole reason for lock-holder preemption.
  Now, if you zoom in on the bitmap, you should notice within the 4ms
  intervals there is some task switching going on.  This is most likely
  because of the yield_to initiated by the PLE handler.  In this case
  there is not that much yielding to do.   It's quite clean, and the
  performance is quite good.
 
  Below is an example of PLE, but this time with 20-way VMs, 8 of them.
  CPU over-commit is still 2x.
 
  https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
 
 I think this link still 10x16. Could you paste the link again?

Oops
https://docs.google.com/open?id=0B6tfUNlZ-14wSGtYYzZtRTcyVjQ

 
 
  This one looks quite different.  In short, it's a mess.  The switching
  between tasks can be lower than 10 microseconds.  It basically never
  recovers.  There is constant yielding all the time.
 
  Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
  scheduling patches.  While I am not recommending gang scheduling, I
  think it's a good data point.  The performance is 3.88x the PLE result.
 
  https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M
 
  Note that the task switching intervals of 4ms are quite obvious again,
  and this time all vCPUs from same VM run at the same time.  It
  represents the best possible outcome.
 
 
  Anyway, I thought the bitmaps might help better visualize what's going
  on.
 
  -Andrew
 
 
 
 
 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-10 Thread Andrew Theurer

On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
 On 10/10/2012 08:29 AM, Andrew Theurer wrote:
  On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
  * Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:
 
  On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
  On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
 
  Again the numbers are ridiculously high for arch_local_irq_restore.
  Maybe there's a bad perf/kvm interaction when we're injecting an
  interrupt, I can't believe we're spending 84% of the time running the
  popf instruction.
 
  Smells like a software fallback that doesn't do NMI, hrtimer based
  sampling typically hits popf where we re-enable interrupts.
 
  Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
  is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
  host will expose it (and a good idea anyway to get best performance).
 
 
  Hi Avi, you are right. SandyBridge machine result was not proper.
  I cleaned up the services, enabled PMU, re-ran all the test again.
 
  Here is the summary:
  We do get good benefit by increasing ple window. Though we don't
  see good benefit for kernbench and sysbench, for ebizzy, we get huge
  improvement for 1x scenario. (almost 2/3rd of ple disabled case).
 
  Let me know if you think we can increase the default ple_window
  itself to 16k.
 
  I am experimenting with V2 version of undercommit improvement(this) patch
  series, But I think if you wish  to go for increase of
  default ple_window, then we would have to measure the benefit of patches
  when ple_window = 16k.
 
  I can respin the whole series including this default ple_window change.
 
  I also have the perf kvm top result for both ebizzy and kernbench.
  I think they are in expected lines now.
 
  Improvements
  
 
  16 core PLE machine with 16 vcpu guest
 
  base = 3.6.0-rc5 + ple handler optimization patches
  base_pleopt_16k = base + ple_window = 16k
  base_pleopt_32k = base + ple_window = 32k
  base_pleopt_nople = base + ple_gap = 0
  kernbench, hackbench, sysbench (time in sec lower is better)
  ebizzy (rec/sec higher is better)
 
  % improvements w.r.t base (ple_window = 4k)
  ---+---+-+---+
  |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
  ---+---+-+---+
  kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
  kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
  ---+---+-+---+
  sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
  sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
  sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
  ---+---+-+---+
  ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
  ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
  ---+---+-+---+
 
  perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)
  
 
  Is the perf data for 1x overcommit?
 
 Yes, 16vcpu guest on 16 core
 
 
  pleopt   ple_gap=0
  
  ebizzy : 18131 records/s
  63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
   5.65%  [guest.kernel]  [g] smp_call_function_many
   3.12%  [guest.kernel]  [g] clear_page
   3.02%  [guest.kernel]  [g] down_read_trylock
   1.85%  [guest.kernel]  [g] async_page_fault
   1.81%  [guest.kernel]  [g] up_read
   1.76%  [guest.kernel]  [g] native_apic_mem_write
   1.70%  [guest.kernel]  [g] find_vma
 
  Does 'perf kvm top' not give host samples at the same time?  Would be
  nice to see the host overhead as a function of varying ple window.  I
  would expect that to be the major difference between 4/16/32k window
  sizes.
 
 No, I did something like this
 perf kvm  --guestvmlinux ./vmlinux.guest top -g  -U -d 3. Yes that is a
 good idea.
 
 (I am getting some segfaults with perf top, I think it is already fixed
 but yet to see the patch that fixes)
 
 
 
 
  A big concern I have (if this is 1x overcommit) for ebizzy is that it
  has just terrible scalability to begin with.  I do not think we should
  try to optimize such a bad workload.
 
 
 I think my way of running dbench has some flaw, so I went to ebizzy.
 Could you let me know how you generally run dbench?

I mount a tmpfs and then specify that mount for dbench to run on.  This
eliminates all IO.  I use a 300 second run time and number of threads is
equal to number of vcpus.  All of the VMs of course need to have a
synchronized start.

I would also make sure you are using a recent kernel for dbench, where
the dcache scalability is much improved.  Without any

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-10 Thread Andrew Theurer

I ran 'perf sched map' on the dbench workload for medium and large VMs,
and I thought I would share some of the results.  I think it helps to
visualize what's going on regarding the yielding.

These files are png bitmaps, generated from processing output from 'perf
sched map' (and perf data generated from 'perf sched record').  The Y
axis is the host cpus, each row being 10 pixels high.  For these tests,
there are 80 host cpus, so the total height is 800 pixels.  The X axis
is time (in microseconds), with each pixel representing 1 microsecond.
Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
obviously, and zooming in/out while viewing is recommended.

Each row (each host cpu) is assigned a color based on what thread is
running.  vCPUs of the same VM are assigned a common color (like red,
blue, magenta, etc), and each vCPU has a unique brightness for that
color.  There are a maximum of 12 assignable colors, so in any VMs 12
revert to vCPU color of gray. I would use more colors, but it becomes
harder to distinguish one color from another.  The white color
represents missing data from perf, and black color represents any thread
which is not a vCPU.

For the following tests, VMs were pinned to host NUMA nodes and to
specific cpus to help with consistency and operate within the
constraints of the last test (gang scheduler).

Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
described above only 12 of the VMs have a color, rest are gray).

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

If you zoom out and look at the whole bitmap, you may notice the 4ms
intervals of the scheduler.  They are pretty well aligned across all
cpus.  Normally, for cpu bound workloads, we would expect to see each
thread to run for 4 ms, then something else getting to run, and so on.
That is mostly true in this test.  We have 2x over-commit and we
generally see the switching of threads at 4ms.  One thing to note is
that not all vCPU threads for the same VM run at exactly the same time,
and that is expected and the whole reason for lock-holder preemption.
Now, if you zoom in on the bitmap, you should notice within the 4ms
intervals there is some task switching going on.  This is most likely
because of the yield_to initiated by the PLE handler.  In this case
there is not that much yielding to do.   It's quite clean, and the
performance is quite good.

Below is an example of PLE, but this time with 20-way VMs, 8 of them.
CPU over-commit is still 2x.

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

This one looks quite different.  In short, it's a mess.  The switching
between tasks can be lower than 10 microseconds.  It basically never
recovers.  There is constant yielding all the time.  

Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
scheduling patches.  While I am not recommending gang scheduling, I
think it's a good data point.  The performance is 3.88x the PLE result.

https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

Note that the task switching intervals of 4ms are quite obvious again,
and this time all vCPUs from same VM run at the same time.  It
represents the best possible outcome.


Anyway, I thought the bitmaps might help better visualize what's going
on.

-Andrew



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-09 Thread Raghavendra K T

* Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:

 On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
  On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
  
  Again the numbers are ridiculously high for arch_local_irq_restore.
  Maybe there's a bad perf/kvm interaction when we're injecting an
  interrupt, I can't believe we're spending 84% of the time running the
  popf instruction. 
  
  Smells like a software fallback that doesn't do NMI, hrtimer based
  sampling typically hits popf where we re-enable interrupts.
 
 Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
 is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
 host will expose it (and a good idea anyway to get best performance).
 

Hi Avi, you are right. SandyBridge machine result was not proper.
I cleaned up the services, enabled PMU, re-ran all the test again.

Here is the summary:
We do get good benefit by increasing ple window. Though we don't
see good benefit for kernbench and sysbench, for ebizzy, we get huge
improvement for 1x scenario. (almost 2/3rd of ple disabled case).

Let me know if you think we can increase the default ple_window
itself to 16k.

I am experimenting with V2 version of undercommit improvement(this) patch
series, But I think if you wish  to go for increase of
default ple_window, then we would have to measure the benefit of patches
when ple_window = 16k.

I can respin the whole series including this default ple_window change.

I also have the perf kvm top result for both ebizzy and kernbench.
I think they are in expected lines now.

Improvements


16 core PLE machine with 16 vcpu guest

base = 3.6.0-rc5 + ple handler optimization patches
base_pleopt_16k = base + ple_window = 16k
base_pleopt_32k = base + ple_window = 32k
base_pleopt_nople = base + ple_gap = 0
kernbench, hackbench, sysbench (time in sec lower is better)
ebizzy (rec/sec higher is better)

% improvements w.r.t base (ple_window = 4k)
---+---+-+---+
   |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
---+---+-+---+
kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
---+---+-+---+
sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
---+---+-+---+
ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
---+---+-+---+

perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) 

pleopt   ple_gap=0

ebizzy : 18131 records/s
63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
5.65%  [guest.kernel]  [g] smp_call_function_many
3.12%  [guest.kernel]  [g] clear_page
3.02%  [guest.kernel]  [g] down_read_trylock
1.85%  [guest.kernel]  [g] async_page_fault
1.81%  [guest.kernel]  [g] up_read
1.76%  [guest.kernel]  [g] native_apic_mem_write
1.70%  [guest.kernel]  [g] find_vma

kernbench :Elapsed Time 29.4933 (27.6007)
   5.72%  [guest.kernel]  [g] async_page_fault
3.48%  [guest.kernel]  [g] pvclock_clocksource_read
2.68%  [guest.kernel]  [g] copy_user_generic_unrolled
2.58%  [guest.kernel]  [g] clear_page
2.09%  [guest.kernel]  [g] page_cache_get_speculative
2.00%  [guest.kernel]  [g] do_raw_spin_lock
1.78%  [guest.kernel]  [g] unmap_single_vma
1.74%  [guest.kernel]  [g] kmem_cache_alloc

pleopt ple_window = 4k
---
ebizzy: 10176 records/s
   69.17%  [guest.kernel]  [g] _raw_spin_lock_irqsave
3.34%  [guest.kernel]  [g] clear_page
2.16%  [guest.kernel]  [g] down_read_trylock
1.94%  [guest.kernel]  [g] async_page_fault
1.89%  [guest.kernel]  [g] native_apic_mem_write
1.63%  [guest.kernel]  [g] smp_call_function_many
1.58%  [guest.kernel]  [g] SetPageLRU
1.37%  [guest.kernel]  [g] up_read
1.01%  [guest.kernel]  [g] find_vma


kernbench: 29.9533
nts: 240K cycles
6.04%  [guest.kernel]  [g] async_page_fault
4.17%  [guest.kernel]  [g] pvclock_clocksource_read
3.28%  [guest.kernel]  [g] clear_page
2.57%  [guest.kernel]  [g] copy_user_generic_unrolled
2.30%  [guest.kernel]  [g] do_raw_spin_lock
2.13%  [guest.kernel]  [g] _raw_spin_lock_irqsave
1.93%  [guest.kernel]  [g] page_cache_get_speculative
1.92%  [guest.kernel]  [g] unmap_single_vma
1.77%  [guest.kernel]  [g] kmem_cache_alloc
1.61%  [guest.kernel]  [g] __d_lookup_rcu
1.19%

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-09 Thread Andrew Theurer

On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
 * Avi Kivity a...@redhat.com [2012-10-04 17:00:28]:
 
  On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
   On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
   
   Again the numbers are ridiculously high for arch_local_irq_restore.
   Maybe there's a bad perf/kvm interaction when we're injecting an
   interrupt, I can't believe we're spending 84% of the time running the
   popf instruction. 
   
   Smells like a software fallback that doesn't do NMI, hrtimer based
   sampling typically hits popf where we re-enable interrupts.
  
  Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
  is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
  host will expose it (and a good idea anyway to get best performance).
  
 
 Hi Avi, you are right. SandyBridge machine result was not proper.
 I cleaned up the services, enabled PMU, re-ran all the test again.
 
 Here is the summary:
 We do get good benefit by increasing ple window. Though we don't
 see good benefit for kernbench and sysbench, for ebizzy, we get huge
 improvement for 1x scenario. (almost 2/3rd of ple disabled case).
 
 Let me know if you think we can increase the default ple_window
 itself to 16k.
 
 I am experimenting with V2 version of undercommit improvement(this) patch
 series, But I think if you wish  to go for increase of
 default ple_window, then we would have to measure the benefit of patches
 when ple_window = 16k.
 
 I can respin the whole series including this default ple_window change.
 
 I also have the perf kvm top result for both ebizzy and kernbench.
 I think they are in expected lines now.
 
 Improvements
 
 
 16 core PLE machine with 16 vcpu guest
 
 base = 3.6.0-rc5 + ple handler optimization patches
 base_pleopt_16k = base + ple_window = 16k
 base_pleopt_32k = base + ple_window = 32k
 base_pleopt_nople = base + ple_gap = 0
 kernbench, hackbench, sysbench (time in sec lower is better)
 ebizzy (rec/sec higher is better)
 
 % improvements w.r.t base (ple_window = 4k)
 ---+---+-+---+
|base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
 ---+---+-+---+
 kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
 kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
 ---+---+-+---+
 sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
 sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
 sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
 ---+---+-+---+
 ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
 ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
 ---+---+-+---+
 
 perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) 
 

Is the perf data for 1x overcommit?

 pleopt   ple_gap=0
 
 ebizzy : 18131 records/s
 63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
 5.65%  [guest.kernel]  [g] smp_call_function_many
 3.12%  [guest.kernel]  [g] clear_page
 3.02%  [guest.kernel]  [g] down_read_trylock
 1.85%  [guest.kernel]  [g] async_page_fault
 1.81%  [guest.kernel]  [g] up_read
 1.76%  [guest.kernel]  [g] native_apic_mem_write
 1.70%  [guest.kernel]  [g] find_vma

Does 'perf kvm top' not give host samples at the same time?  Would be
nice to see the host overhead as a function of varying ple window.  I
would expect that to be the major difference between 4/16/32k window
sizes.

A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with.  I do not think we should
try to optimize such a bad workload.

 kernbench :Elapsed Time 29.4933 (27.6007)
5.72%  [guest.kernel]  [g] async_page_fault
 3.48%  [guest.kernel]  [g] pvclock_clocksource_read
 2.68%  [guest.kernel]  [g] copy_user_generic_unrolled
 2.58%  [guest.kernel]  [g] clear_page
 2.09%  [guest.kernel]  [g] page_cache_get_speculative
 2.00%  [guest.kernel]  [g] do_raw_spin_lock
 1.78%  [guest.kernel]  [g] unmap_single_vma
 1.74%  [guest.kernel]  [g] kmem_cache_alloc

 
 pleopt ple_window = 4k
 ---
 ebizzy: 10176 records/s
69.17%  [guest.kernel]  [g] _raw_spin_lock_irqsave
 3.34%  [guest.kernel]  [g] clear_page
 2.16%  [guest.kernel]  [g] down_read_trylock
 1.94%  [guest.kernel]  [g] async_page_fault
 1.89%  [guest.kernel]  [g] native_apic_mem_write
 1.63%  [guest.kernel]  [g] smp_call_function_many
 1.58%  [guest.kernel]  [g] SetPageLRU
 1.37%  [guest.kernel]  [g] up_read
 1.01%

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-07 Thread Avi Kivity

On 10/05/2012 10:36 AM, Raghavendra K T wrote:

 You can store i in the vcpu itself:

set_bit(vcpu-index, kvm-preempted);

 This will make the fact that vcpus are stored in an array into API
 instead of implementation detail :( There were patches for vcpu
 destruction that changed the array to rculist. Well, it will be still
 possible to make the array rcu protected and copy it every time vcpu is
 deleted/added I guess.

 
 If IUC, summary is, we are going with
 - Let vcpu array be rcu protected.

That's for the future.  For now -vcpus[] is statically allocated.

 - we use index inside vcpu and should be updated when a vcpu is
 added/deleted.

Yes.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-05 Thread Raghavendra K T


On 10/04/2012 12:59 PM, Gleb Natapov wrote:

On Wed, Oct 03, 2012 at 04:56:57PM +0200, Avi Kivity wrote:

On 10/03/2012 04:17 PM, Raghavendra K T wrote:

* Avi Kivity a...@redhat.com [2012-09-30 13:13:09]:


On 09/30/2012 01:07 PM, Gleb Natapov wrote:

On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:

On 09/28/2012 08:16 AM, Raghavendra K T wrote:




 +struct pv_sched_info {
 +   unsigned long   sched_bitmap;


Thinking, whether we need something similar to cpumask here?
Only thing is we are representing guest (v)cpumask.



DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)


vcpu_id can be greater than KVM_MAX_VCPUS.


Use the index into the vcpu table as the bitmap index then.  In fact
it's better because then the lookup to get the vcpu pointer is trivial.


Did you mean, while setting the bitmap,

we should do
for (i = 1..n)
if (kvm-vcpus[i] == vcpu) set ith position in bitmap?


You can store i in the vcpu itself:

   set_bit(vcpu-index, kvm-preempted);


This will make the fact that vcpus are stored in an array into API
instead of implementation detail :( There were patches for vcpu
destruction that changed the array to rculist. Well, it will be still
possible to make the array rcu protected and copy it every time vcpu is
deleted/added I guess.



If IUC, summary is, we are going with
- Let vcpu array be rcu protected.
- we use index inside vcpu and should be updated when a vcpu is
added/deleted.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-05 Thread Raghavendra K T


On 10/04/2012 06:11 PM, Avi Kivity wrote:

On 10/04/2012 12:49 PM, Raghavendra K T wrote:

On 10/03/2012 10:35 PM, Avi Kivity wrote:

On 10/03/2012 02:22 PM, Raghavendra K T wrote:

So I think it's worth trying again with ple_window of 2-4.



Hi Avi,

I ran different benchmarks increasing ple_window, and results does not
seem to be encouraging for increasing ple_window.


Thanks for testing! Comments below.


Results:
16 core PLE machine with 16 vcpu guest.

base kernel = 3.6-rc5 + ple handler optimization patch
base_pleopt_8k = base kernel + ple window = 8k
base_pleopt_16k = base kernel + ple window = 16k
base_pleopt_32k = base kernel + ple window = 32k


Percentage improvements of benchmarks w.r.t base_pleopt with
ple_window = 4096

 base_pleopt_8kbase_pleopt_16kbase_pleopt_32k
-

kernbench_1x-5.54915-15.94529-44.31562
kernbench_2x-7.89399-17.75039-37.73498


So, 44% degradation even with no overcommit?  That's surprising.


Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
spending 8 times the original ple_window cycles for 16 vcpus
significant?


A PLE exit when not overcommitted cannot do any good, it is better to
spin in the guest rather that look for candidates on the host.  In fact
when we benchmark we often disable PLE completely.






I also got perf top output to analyse the difference. Difference comes
because of flushtlb (and also spinlock).


That's in the guest, yes?


Yes. Perf is in guest.





Ebizzy run for 4k ple_window
-  87.20%  [kernel]  [k] arch_local_irq_restore
 - arch_local_irq_restore
- 100.00% _raw_spin_unlock_irqrestore
   + 52.89% release_pages
   + 47.10% pagevec_lru_move_fn
-   5.71%  [kernel]  [k] arch_local_irq_restore
 - arch_local_irq_restore
+ 86.03% default_send_IPI_mask_allbutself_phys
+ 13.96% default_send_IPI_mask_sequence_phys
-   3.10%  [kernel]  [k] smp_call_function_many
   smp_call_function_many


Ebizzy run for 32k ple_window

-  91.40%  [kernel]  [k] arch_local_irq_restore
 - arch_local_irq_restore
- 100.00% _raw_spin_unlock_irqrestore
   + 53.13% release_pages
   + 46.86% pagevec_lru_move_fn
-   4.38%  [kernel]  [k] smp_call_function_many
   smp_call_function_many
-   2.51%  [kernel]  [k] arch_local_irq_restore
 - arch_local_irq_restore
+ 90.76% default_send_IPI_mask_allbutself_phys
+ 9.24% default_send_IPI_mask_sequence_phys



Both the 4k and the 32k results are crazy.  Why is
arch_local_irq_restore() so prominent?  Do you have a very high
interrupt rate in the guest?


How to measure if I have high interrupt rate in guest?
 From /proc/interrupt numbers I am not able to judge :(


'vmstat 1'



Thanks you. 'll save this. Apart from in,cs I think r: The number of 
processes waiting for run time, would be useful for me in vmstat.




I went back and got the results on a 32 core machine with 32 vcpu guest.
Strangely, I got result supporting the claim that increasing ple_window
helps for non-overcommitted scenario.

32 core 32 vcpu guest 1x scenarios.

ple_gap = 0
kernbench: Elapsed Time 38.61
ebizzy: 7463 records/s

ple_window = 4k
kernbench: Elapsed Time 43.5067
ebizzy:2528 records/s

ple_window = 32k
kernebench : Elapsed Time 39.4133
ebizzy: 7196 records/s


So maybe something was wrong with the first measurement.


May be I was not clear. The first time I had run on x240 (sandybridge)
16 core cpu,

Then ran on 32 core x3850 to confirm the perf top results.
But yes both had

[0.018997] Performance Events: Broken PMU hardware detected, using 
software events only.


problem as rightly pointed by you and PeterZ.

after -cpu host, I see that is fixed on x240,

[0.017997] Performance Events: 16-deep LBR, SandyBridge events, 
Intel PMU driver.
[0.018868] NMI watchdog: enabled on all CPUs, permanently consumes 
one hw-PMU counter.


So I 'll try it on x240 again.

( Some how mx3850 -cpu host resulted in
[0.026995] Performance Events: unsupported p6 CPU model 26 no PMU 
driver, software events only.

I think qemu needs some fix as pointed in
http://www.mail-archive.com/kvm@vger.kernel.org/msg55836.html







perf top for ebizzy for above:
ple_gap = 0
-  84.74%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   - 100.00% _raw_spin_unlock_irqrestore
  + 50.96% release_pages
  + 49.02% pagevec_lru_move_fn
-   6.57%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   + 92.54% default_send_IPI_mask_allbutself_phys
   + 7.46% default_send_IPI_mask_sequence_phys
-   1.54%  [kernel]  [k] smp_call_function_many
  smp_call_function_many


Again the numbers are ridiculously high for arch_local_irq_restore.
Maybe there's a bad perf/kvm interaction when we're injecting an
interrupt, I can't believe we're spending 84% of the time running the
popf

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-05 Thread Raghavendra K T


On 10/04/2012 08:11 PM, Andrew Theurer wrote:

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:

On 10/04/2012 12:49 PM, Raghavendra K T wrote:

On 10/03/2012 10:35 PM, Avi Kivity wrote:

On 10/03/2012 02:22 PM, Raghavendra K T wrote:

So I think it's worth trying again with ple_window of 2-4.



Hi Avi,

I ran different benchmarks increasing ple_window, and results does not
seem to be encouraging for increasing ple_window.


Thanks for testing! Comments below.


Results:
16 core PLE machine with 16 vcpu guest.

base kernel = 3.6-rc5 + ple handler optimization patch
base_pleopt_8k = base kernel + ple window = 8k
base_pleopt_16k = base kernel + ple window = 16k
base_pleopt_32k = base kernel + ple window = 32k


Percentage improvements of benchmarks w.r.t base_pleopt with
ple_window = 4096

 base_pleopt_8kbase_pleopt_16kbase_pleopt_32k
-

kernbench_1x-5.54915-15.94529-44.31562
kernbench_2x-7.89399-17.75039-37.73498


So, 44% degradation even with no overcommit?  That's surprising.


Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
spending 8 times the original ple_window cycles for 16 vcpus
significant?


A PLE exit when not overcommitted cannot do any good, it is better to
spin in the guest rather that look for candidates on the host.  In fact
when we benchmark we often disable PLE completely.


Agreed.  However, I really do not understand why the kernbench regressed
with bigger ple_window.  It should stay the same or improve.  Raghu, do
you have perf data for the kernbench runs?


Andrew, No. 'll get this with perf kvm.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-04 Thread Gleb Natapov

On Wed, Oct 03, 2012 at 04:56:57PM +0200, Avi Kivity wrote:
 On 10/03/2012 04:17 PM, Raghavendra K T wrote:
  * Avi Kivity a...@redhat.com [2012-09-30 13:13:09]:
  
  On 09/30/2012 01:07 PM, Gleb Natapov wrote:
   On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
   On 09/28/2012 08:16 AM, Raghavendra K T wrote:

   
+struct pv_sched_info {
+   unsigned long   sched_bitmap;

Thinking, whether we need something similar to cpumask here?
Only thing is we are representing guest (v)cpumask.

   
   DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
   
   vcpu_id can be greater than KVM_MAX_VCPUS.
  
  Use the index into the vcpu table as the bitmap index then.  In fact
  it's better because then the lookup to get the vcpu pointer is trivial.
  
  Did you mean, while setting the bitmap,
  
  we should do 
  for (i = 1..n)
  if (kvm-vcpus[i] == vcpu) set ith position in bitmap?
 
 You can store i in the vcpu itself:
 
   set_bit(vcpu-index, kvm-preempted);
 
This will make the fact that vcpus are stored in an array into API
instead of implementation detail :( There were patches for vcpu
destruction that changed the array to rculist. Well, it will be still
possible to make the array rcu protected and copy it every time vcpu is
deleted/added I guess.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-04 Thread Raghavendra K T


On 10/03/2012 10:35 PM, Avi Kivity wrote:

On 10/03/2012 02:22 PM, Raghavendra K T wrote:

So I think it's worth trying again with ple_window of 2-4.



Hi Avi,

I ran different benchmarks increasing ple_window, and results does not
seem to be encouraging for increasing ple_window.


Thanks for testing! Comments below.


Results:
16 core PLE machine with 16 vcpu guest.

base kernel = 3.6-rc5 + ple handler optimization patch
base_pleopt_8k = base kernel + ple window = 8k
base_pleopt_16k = base kernel + ple window = 16k
base_pleopt_32k = base kernel + ple window = 32k


Percentage improvements of benchmarks w.r.t base_pleopt with ple_window = 4096

base_pleopt_8k  base_pleopt_16k base_pleopt_32k
-   

kernbench_1x-5.54915-15.94529   -44.31562
kernbench_2x-7.89399-17.75039   -37.73498


So, 44% degradation even with no overcommit?  That's surprising.


Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
spending 8 times the original ple_window cycles for 16 vcpus
significant?




I also got perf top output to analyse the difference. Difference comes
because of flushtlb (and also spinlock).


That's in the guest, yes?


Yes. Perf is in guest.





Ebizzy run for 4k ple_window
-  87.20%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   - 100.00% _raw_spin_unlock_irqrestore
  + 52.89% release_pages
  + 47.10% pagevec_lru_move_fn
-   5.71%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   + 86.03% default_send_IPI_mask_allbutself_phys
   + 13.96% default_send_IPI_mask_sequence_phys
-   3.10%  [kernel]  [k] smp_call_function_many
  smp_call_function_many


Ebizzy run for 32k ple_window

-  91.40%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   - 100.00% _raw_spin_unlock_irqrestore
  + 53.13% release_pages
  + 46.86% pagevec_lru_move_fn
-   4.38%  [kernel]  [k] smp_call_function_many
  smp_call_function_many
-   2.51%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   + 90.76% default_send_IPI_mask_allbutself_phys
   + 9.24% default_send_IPI_mask_sequence_phys



Both the 4k and the 32k results are crazy.  Why is
arch_local_irq_restore() so prominent?  Do you have a very high
interrupt rate in the guest?


How to measure if I have high interrupt rate in guest?
From /proc/interrupt numbers I am not able to judge :(

I went back and got the results on a 32 core machine with 32 vcpu guest.
Strangely, I got result supporting the claim that increasing ple_window 
helps for non-overcommitted scenario.


32 core 32 vcpu guest 1x scenarios.

ple_gap = 0
kernbench: Elapsed Time 38.61
ebizzy: 7463 records/s

ple_window = 4k
kernbench: Elapsed Time 43.5067
ebizzy:2528 records/s

ple_window = 32k
kernebench : Elapsed Time 39.4133
ebizzy: 7196 records/s


perf top for ebizzy for above:
ple_gap = 0
-  84.74%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
  - 100.00% _raw_spin_unlock_irqrestore
 + 50.96% release_pages
 + 49.02% pagevec_lru_move_fn
-   6.57%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
  + 92.54% default_send_IPI_mask_allbutself_phys
  + 7.46% default_send_IPI_mask_sequence_phys
-   1.54%  [kernel]  [k] smp_call_function_many
 smp_call_function_many

ple_window = 32k
-  84.47%  [kernel]  [k] arch_local_irq_restore
   + arch_local_irq_restore
-   6.46%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
  + 93.51% default_send_IPI_mask_allbutself_phys
  + 6.49% default_send_IPI_mask_sequence_phys
-   1.80%  [kernel]  [k] smp_call_function_many
   - smp_call_function_many
  + 99.98% native_flush_tlb_others


ple_window = 4k
-  91.35%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
  - 100.00% _raw_spin_unlock_irqrestore
 + 53.19% release_pages
 + 46.81% pagevec_lru_move_fn
-   3.90%  [kernel]  [k] smp_call_function_many
 smp_call_function_many
-   2.94%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
  + 93.12% default_send_IPI_mask_allbutself_phys
  + 6.88% default_send_IPI_mask_sequence_phys

Let me know if I can try something here..
/me confused :(

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-04 Thread Avi Kivity

On 10/04/2012 12:49 PM, Raghavendra K T wrote:
 On 10/03/2012 10:35 PM, Avi Kivity wrote:
 On 10/03/2012 02:22 PM, Raghavendra K T wrote:
 So I think it's worth trying again with ple_window of 2-4.


 Hi Avi,

 I ran different benchmarks increasing ple_window, and results does not
 seem to be encouraging for increasing ple_window.

 Thanks for testing! Comments below.

 Results:
 16 core PLE machine with 16 vcpu guest.

 base kernel = 3.6-rc5 + ple handler optimization patch
 base_pleopt_8k = base kernel + ple window = 8k
 base_pleopt_16k = base kernel + ple window = 16k
 base_pleopt_32k = base kernel + ple window = 32k


 Percentage improvements of benchmarks w.r.t base_pleopt with
 ple_window = 4096

 base_pleopt_8kbase_pleopt_16kbase_pleopt_32k
 -   

 kernbench_1x-5.54915-15.94529-44.31562
 kernbench_2x-7.89399-17.75039-37.73498

 So, 44% degradation even with no overcommit?  That's surprising.
 
 Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
 spending 8 times the original ple_window cycles for 16 vcpus
 significant?

A PLE exit when not overcommitted cannot do any good, it is better to
spin in the guest rather that look for candidates on the host.  In fact
when we benchmark we often disable PLE completely.

 

 I also got perf top output to analyse the difference. Difference comes
 because of flushtlb (and also spinlock).

 That's in the guest, yes?
 
 Yes. Perf is in guest.
 


 Ebizzy run for 4k ple_window
 -  87.20%  [kernel]  [k] arch_local_irq_restore
 - arch_local_irq_restore
- 100.00% _raw_spin_unlock_irqrestore
   + 52.89% release_pages
   + 47.10% pagevec_lru_move_fn
 -   5.71%  [kernel]  [k] arch_local_irq_restore
 - arch_local_irq_restore
+ 86.03% default_send_IPI_mask_allbutself_phys
+ 13.96% default_send_IPI_mask_sequence_phys
 -   3.10%  [kernel]  [k] smp_call_function_many
   smp_call_function_many


 Ebizzy run for 32k ple_window

 -  91.40%  [kernel]  [k] arch_local_irq_restore
 - arch_local_irq_restore
- 100.00% _raw_spin_unlock_irqrestore
   + 53.13% release_pages
   + 46.86% pagevec_lru_move_fn
 -   4.38%  [kernel]  [k] smp_call_function_many
   smp_call_function_many
 -   2.51%  [kernel]  [k] arch_local_irq_restore
 - arch_local_irq_restore
+ 90.76% default_send_IPI_mask_allbutself_phys
+ 9.24% default_send_IPI_mask_sequence_phys


 Both the 4k and the 32k results are crazy.  Why is
 arch_local_irq_restore() so prominent?  Do you have a very high
 interrupt rate in the guest?
 
 How to measure if I have high interrupt rate in guest?
 From /proc/interrupt numbers I am not able to judge :(

'vmstat 1'

 
 I went back and got the results on a 32 core machine with 32 vcpu guest.
 Strangely, I got result supporting the claim that increasing ple_window
 helps for non-overcommitted scenario.
 
 32 core 32 vcpu guest 1x scenarios.
 
 ple_gap = 0
 kernbench: Elapsed Time 38.61
 ebizzy: 7463 records/s
 
 ple_window = 4k
 kernbench: Elapsed Time 43.5067
 ebizzy:2528 records/s
 
 ple_window = 32k
 kernebench : Elapsed Time 39.4133
 ebizzy: 7196 records/s

So maybe something was wrong with the first measurement.

 
 
 perf top for ebizzy for above:
 ple_gap = 0
 -  84.74%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   - 100.00% _raw_spin_unlock_irqrestore
  + 50.96% release_pages
  + 49.02% pagevec_lru_move_fn
 -   6.57%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   + 92.54% default_send_IPI_mask_allbutself_phys
   + 7.46% default_send_IPI_mask_sequence_phys
 -   1.54%  [kernel]  [k] smp_call_function_many
  smp_call_function_many

Again the numbers are ridiculously high for arch_local_irq_restore.
Maybe there's a bad perf/kvm interaction when we're injecting an
interrupt, I can't believe we're spending 84% of the time running the
popf instruction.

 
 ple_window = 32k
 -  84.47%  [kernel]  [k] arch_local_irq_restore
+ arch_local_irq_restore
 -   6.46%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   + 93.51% default_send_IPI_mask_allbutself_phys
   + 6.49% default_send_IPI_mask_sequence_phys
 -   1.80%  [kernel]  [k] smp_call_function_many
- smp_call_function_many
   + 99.98% native_flush_tlb_others
 
 
 ple_window = 4k
 -  91.35%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   - 100.00% _raw_spin_unlock_irqrestore
  + 53.19% release_pages
  + 46.81% pagevec_lru_move_fn
 -   3.90%  [kernel]  [k] smp_call_function_many
  smp_call_function_many
 -   2.94%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   + 93.12% default_send_IPI_mask_allbutself_phys
   + 6.88% default_send_IPI_mask_sequence_phys
 
 Let me know if I can try something here..
 /me confused :(

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-04 Thread Peter Zijlstra

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
 
 Again the numbers are ridiculously high for arch_local_irq_restore.
 Maybe there's a bad perf/kvm interaction when we're injecting an
 interrupt, I can't believe we're spending 84% of the time running the
 popf instruction. 

Smells like a software fallback that doesn't do NMI, hrtimer based
sampling typically hits popf where we re-enable interrupts.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-04 Thread Andrew Theurer

On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
 On 10/04/2012 12:49 PM, Raghavendra K T wrote:
  On 10/03/2012 10:35 PM, Avi Kivity wrote:
  On 10/03/2012 02:22 PM, Raghavendra K T wrote:
  So I think it's worth trying again with ple_window of 2-4.
 
 
  Hi Avi,
 
  I ran different benchmarks increasing ple_window, and results does not
  seem to be encouraging for increasing ple_window.
 
  Thanks for testing! Comments below.
 
  Results:
  16 core PLE machine with 16 vcpu guest.
 
  base kernel = 3.6-rc5 + ple handler optimization patch
  base_pleopt_8k = base kernel + ple window = 8k
  base_pleopt_16k = base kernel + ple window = 16k
  base_pleopt_32k = base kernel + ple window = 32k
 
 
  Percentage improvements of benchmarks w.r.t base_pleopt with
  ple_window = 4096
 
  base_pleopt_8kbase_pleopt_16kbase_pleopt_32k
  - 

 
  kernbench_1x-5.54915-15.94529-44.31562
  kernbench_2x-7.89399-17.75039-37.73498
 
  So, 44% degradation even with no overcommit?  That's surprising.
  
  Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
  spending 8 times the original ple_window cycles for 16 vcpus
  significant?
 
 A PLE exit when not overcommitted cannot do any good, it is better to
 spin in the guest rather that look for candidates on the host.  In fact
 when we benchmark we often disable PLE completely.

Agreed.  However, I really do not understand why the kernbench regressed
with bigger ple_window.  It should stay the same or improve.  Raghu, do
you have perf data for the kernbench runs?
 
  
 
  I also got perf top output to analyse the difference. Difference comes
  because of flushtlb (and also spinlock).
 
  That's in the guest, yes?
  
  Yes. Perf is in guest.
  
 
 
  Ebizzy run for 4k ple_window
  -  87.20%  [kernel]  [k] arch_local_irq_restore
  - arch_local_irq_restore
 - 100.00% _raw_spin_unlock_irqrestore
+ 52.89% release_pages
+ 47.10% pagevec_lru_move_fn
  -   5.71%  [kernel]  [k] arch_local_irq_restore
  - arch_local_irq_restore
 + 86.03% default_send_IPI_mask_allbutself_phys
 + 13.96% default_send_IPI_mask_sequence_phys
  -   3.10%  [kernel]  [k] smp_call_function_many
smp_call_function_many
 
 
  Ebizzy run for 32k ple_window
 
  -  91.40%  [kernel]  [k] arch_local_irq_restore
  - arch_local_irq_restore
 - 100.00% _raw_spin_unlock_irqrestore
+ 53.13% release_pages
+ 46.86% pagevec_lru_move_fn
  -   4.38%  [kernel]  [k] smp_call_function_many
smp_call_function_many
  -   2.51%  [kernel]  [k] arch_local_irq_restore
  - arch_local_irq_restore
 + 90.76% default_send_IPI_mask_allbutself_phys
 + 9.24% default_send_IPI_mask_sequence_phys
 
 
  Both the 4k and the 32k results are crazy.  Why is
  arch_local_irq_restore() so prominent?  Do you have a very high
  interrupt rate in the guest?
  
  How to measure if I have high interrupt rate in guest?
  From /proc/interrupt numbers I am not able to judge :(
 
 'vmstat 1'
 
  
  I went back and got the results on a 32 core machine with 32 vcpu guest.
  Strangely, I got result supporting the claim that increasing ple_window
  helps for non-overcommitted scenario.
  
  32 core 32 vcpu guest 1x scenarios.
  
  ple_gap = 0
  kernbench: Elapsed Time 38.61
  ebizzy: 7463 records/s
  
  ple_window = 4k
  kernbench: Elapsed Time 43.5067
  ebizzy:2528 records/s
  
  ple_window = 32k
  kernebench : Elapsed Time 39.4133
  ebizzy: 7196 records/s
 
 So maybe something was wrong with the first measurement.

OK, this is more in line with what I expected for kernbench.  FWIW, in
order to show an improvement for a larger ple_window, we really need a
workload which we know has a longer lock holding time (without factoring
in LHP).  We have noticed this on IO based locks mostly.  We saw it with
a massive disk IO test (qla2xxx lock), and also with a large web serving
test (some vfs related lock, but I forget what exactly it was).
 
  
  
  perf top for ebizzy for above:
  ple_gap = 0
  -  84.74%  [kernel]  [k] arch_local_irq_restore
 - arch_local_irq_restore
- 100.00% _raw_spin_unlock_irqrestore
   + 50.96% release_pages
   + 49.02% pagevec_lru_move_fn
  -   6.57%  [kernel]  [k] arch_local_irq_restore
 - arch_local_irq_restore
+ 92.54% default_send_IPI_mask_allbutself_phys
+ 7.46% default_send_IPI_mask_sequence_phys
  -   1.54%  [kernel]  [k] smp_call_function_many
   smp_call_function_many
 
 Again the numbers are ridiculously high for arch_local_irq_restore.
 Maybe there's a bad perf/kvm interaction when we're injecting an
 interrupt, I can't believe we're spending 84% of the time running the
 popf instruction.

I do have a feeling that ebizzy just has too many variables and LHP is
just one of many problems.  However, am I curious what perf kvm from

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-04 Thread Avi Kivity

On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
 On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
 
 Again the numbers are ridiculously high for arch_local_irq_restore.
 Maybe there's a bad perf/kvm interaction when we're injecting an
 interrupt, I can't believe we're spending 84% of the time running the
 popf instruction. 
 
 Smells like a software fallback that doesn't do NMI, hrtimer based
 sampling typically hits popf where we re-enable interrupts.

Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
host will expose it (and a good idea anyway to get best performance).

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-03 Thread Raghavendra K T

* Avi Kivity a...@redhat.com [2012-09-24 17:41:19]:

 On 09/21/2012 08:24 PM, Raghavendra K T wrote:
  On 09/21/2012 06:32 PM, Rik van Riel wrote:
  On 09/21/2012 08:00 AM, Raghavendra K T wrote:
  From: Raghavendra K T raghavendra...@linux.vnet.ibm.com
 
  When total number of VCPUs of system is less than or equal to physical
  CPUs,
  PLE exits become costly since each VCPU can have dedicated PCPU, and
  trying to find a target VCPU to yield_to just burns time in PLE handler.
 
  This patch reduces overhead, by simply doing a return in such
  scenarios by
  checking the length of current cpu runqueue.
 
  I am not convinced this is the way to go.
 
  The VCPU that is holding the lock, and is not releasing it,
  probably got scheduled out. That implies that VCPU is on a
  runqueue with at least one other task.
  
  I see your point here, we have two cases:
  
  case 1)
  
  rq1 : vcpu1-wait(lockA) (spinning)
  rq2 : vcpu2-holding(lockA) (running)
  
  Here Ideally vcpu1 should not enter PLE handler, since it would surely
  get the lock within ple_window cycle. (assuming ple_window is tuned for
  that workload perfectly).
  
  May be this explains why we are not seeing benefit with kernbench.
  
  On the other side, Since we cannot have a perfect ple_window tuned for
  all type of workloads, for those workloads, which may need more than
  4096 cycles, we gain. thinking is it that we are seeing in benefited
  cases?
 
 Maybe we need to increase the ple window regardless.  4096 cycles is 2
 microseconds or less (call it t_spin).  The overhead from
 kvm_vcpu_on_spin() and the associated task switches is at least a few
 microseconds, increasing as contention is added (call it t_tield).  The
 time for a natural context switch is several milliseconds (call it
 t_slice).  There is also the time the lock holder owns the lock,
 assuming no contention (t_hold).
 
 If t_yield  t_spin, then in the undercommitted case it dominates
 t_spin.  If t_hold  t_spin we lose badly.
 
 If t_spin  t_yield, then the undercommitted case doesn't suffer as much
 as most of the spinning happens in the guest instead of the host, so it
 can pick up the unlock timely.  We don't lose too much in the
 overcommitted case provided the values aren't too far apart (say a
 factor of 3).
 
 Obviously t_spin must be significantly smaller than t_slice, otherwise
 it accomplishes nothing.
 
 Regarding t_hold: if it is small, then a larger t_spin helps avoid false
 exits.  If it is large, then we're not very sensitive to t_spin.  It
 doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up
 yielding for several milliseconds.
 
 So I think it's worth trying again with ple_window of 2-4.
 

Hi Avi,

I ran different benchmarks increasing ple_window, and results does not
seem to be encouraging for increasing ple_window.

Results:
16 core PLE machine with 16 vcpu guest. 

base kernel = 3.6-rc5 + ple handler optimization patch 
base_pleopt_8k = base kernel + ple window = 8k
base_pleopt_16k = base kernel + ple window = 16k
base_pleopt_32k = base kernel + ple window = 32k


Percentage improvements of benchmarks w.r.t base_pleopt with ple_window = 4096

base_pleopt_8k  base_pleopt_16k base_pleopt_32k
-   

kernbench_1x-5.54915-15.94529   -44.31562
kernbench_2x-7.89399-17.75039   -37.73498
-   

sysbench_1x 0.45955 -0.987780.05252
sysbench_2x 1.44071 -0.816251.35620
sysbench_3x 0.45549 1.51795 -0.41573
-   


hackbench_1x-3.80272-13.91456   -40.79059
hackbench_2x-4.78999-7.61382-7.24475
-   

ebizzy_1x   -2.54626-16.86050   -38.46109
ebizzy_2x   -8.75526-19.29116   -48.33314
-   


I also got perf top output to analyse the difference. Difference comes
because of flushtlb (and also spinlock).

Ebizzy run for 4k ple_window
-  87.20%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
  - 100.00% _raw_spin_unlock_irqrestore
 + 52.89% release_pages
 + 47.10% pagevec_lru_move_fn
-   5.71%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
  + 86.03% default_send_IPI_mask_allbutself_phys
  + 13.96% default_send_IPI_mask_sequence_phys
-   3.10%  [kernel]  [k] smp_call_function_many
 smp_call_function_many


Ebizzy run for 32k ple_window

-  91.40%  [kernel]  [k] arch_local_irq_restore
   - arch_local_irq_restore
  - 100.00% _raw_spin_unlock_irqrestore
 + 53.13%

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-03 Thread Raghavendra K T

* Avi Kivity a...@redhat.com [2012-09-30 13:13:09]:

 On 09/30/2012 01:07 PM, Gleb Natapov wrote:
  On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
  On 09/28/2012 08:16 AM, Raghavendra K T wrote:
   
  
   +struct pv_sched_info {
   +   unsigned long   sched_bitmap;
   
   Thinking, whether we need something similar to cpumask here?
   Only thing is we are representing guest (v)cpumask.
   
  
  DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
  
  vcpu_id can be greater than KVM_MAX_VCPUS.
 
 Use the index into the vcpu table as the bitmap index then.  In fact
 it's better because then the lookup to get the vcpu pointer is trivial.

Did you mean, while setting the bitmap,

we should do 
for (i = 1..n)
if (kvm-vcpus[i] == vcpu) set ith position in bitmap?

I just wanted to know whether there is any easy way to convert from 
vcpu  pointer to index in kvm vcpu table.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-03 Thread Avi Kivity

On 10/03/2012 04:17 PM, Raghavendra K T wrote:
 * Avi Kivity a...@redhat.com [2012-09-30 13:13:09]:
 
 On 09/30/2012 01:07 PM, Gleb Natapov wrote:
  On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
  On 09/28/2012 08:16 AM, Raghavendra K T wrote:
   
  
   +struct pv_sched_info {
   +   unsigned long   sched_bitmap;
   
   Thinking, whether we need something similar to cpumask here?
   Only thing is we are representing guest (v)cpumask.
   
  
  DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
  
  vcpu_id can be greater than KVM_MAX_VCPUS.
 
 Use the index into the vcpu table as the bitmap index then.  In fact
 it's better because then the lookup to get the vcpu pointer is trivial.
 
 Did you mean, while setting the bitmap,
 
 we should do 
 for (i = 1..n)
 if (kvm-vcpus[i] == vcpu) set ith position in bitmap?

You can store i in the vcpu itself:

  set_bit(vcpu-index, kvm-preempted);

 
 I just wanted to know whether there is any easy way to convert from 
 vcpu  pointer to index in kvm vcpu table.
 



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-03 Thread Avi Kivity

On 10/03/2012 02:22 PM, Raghavendra K T wrote:
 So I think it's worth trying again with ple_window of 2-4.
 
 
 Hi Avi,
 
 I ran different benchmarks increasing ple_window, and results does not
 seem to be encouraging for increasing ple_window.

Thanks for testing! Comments below.

 Results:
 16 core PLE machine with 16 vcpu guest. 
 
 base kernel = 3.6-rc5 + ple handler optimization patch 
 base_pleopt_8k = base kernel + ple window = 8k
 base_pleopt_16k = base kernel + ple window = 16k
 base_pleopt_32k = base kernel + ple window = 32k
 
 
 Percentage improvements of benchmarks w.r.t base_pleopt with ple_window = 4096
 
   base_pleopt_8k  base_pleopt_16k base_pleopt_32k
 - 
 
 kernbench_1x  -5.54915-15.94529   -44.31562
 kernbench_2x  -7.89399-17.75039   -37.73498

So, 44% degradation even with no overcommit?  That's surprising.

 I also got perf top output to analyse the difference. Difference comes
 because of flushtlb (and also spinlock).

That's in the guest, yes?

 
 Ebizzy run for 4k ple_window
 -  87.20%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   - 100.00% _raw_spin_unlock_irqrestore
  + 52.89% release_pages
  + 47.10% pagevec_lru_move_fn
 -   5.71%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   + 86.03% default_send_IPI_mask_allbutself_phys
   + 13.96% default_send_IPI_mask_sequence_phys
 -   3.10%  [kernel]  [k] smp_call_function_many
  smp_call_function_many
 
 
 Ebizzy run for 32k ple_window
 
 -  91.40%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   - 100.00% _raw_spin_unlock_irqrestore
  + 53.13% release_pages
  + 46.86% pagevec_lru_move_fn
 -   4.38%  [kernel]  [k] smp_call_function_many
  smp_call_function_many
 -   2.51%  [kernel]  [k] arch_local_irq_restore
- arch_local_irq_restore
   + 90.76% default_send_IPI_mask_allbutself_phys
   + 9.24% default_send_IPI_mask_sequence_phys
 

Both the 4k and the 32k results are crazy.  Why is
arch_local_irq_restore() so prominent?  Do you have a very high
interrupt rate in the guest?




-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-30 Thread Avi Kivity

On 09/28/2012 08:18 PM, Konrad Rzeszutek Wilk wrote:
  PLE:
  - works for unmodified / non-Linux guests
  - works for all types of spins (e.g. smp_call_function*())
  - utilizes an existing hardware interface (PAUSE instruction) so likely
  more robust compared to a software interface
 
  PV:
  - has more information, so it can perform better
  
  Should we also consider that we always have an edge here for non-PLE
  machine?
 
 True.  The deployment share for these is decreasing rapidly though.  I
 hate optimizing for obsolete hardware.
 
 Keep in mind that the patchset that Jeremy provided also cleans (remove)
 parts of the pv spinlock code. It removes the various spin_lock,
 spin_unlock, etc that touch paravirt code. Instead the pv code is only
 in the slowpath. And if you don't compile with CONFIG_PARAVIRT_SPINLOCK
 the end code is the same as it is now.

We still need to maintain information about the lock holder if pv is
enabled at all, even if it is never used.

 On a different subject-  I am curious whether the Haswell new locking
 instructions (the transactional ones?) can be put in usage for the slow
 case?

Transactions are blown on any context switch, so no.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-30 Thread Avi Kivity

On 09/28/2012 08:16 AM, Raghavendra K T wrote:
 

 +struct pv_sched_info {
 +   unsigned long   sched_bitmap;
 
 Thinking, whether we need something similar to cpumask here?
 Only thing is we are representing guest (v)cpumask.
 

DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)

cpumask is for host masks, this is a guest mask.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-30 Thread Gleb Natapov

On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
 On 09/28/2012 08:16 AM, Raghavendra K T wrote:
  
 
  +struct pv_sched_info {
  +   unsigned long   sched_bitmap;
  
  Thinking, whether we need something similar to cpumask here?
  Only thing is we are representing guest (v)cpumask.
  
 
 DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
 
vcpu_id can be greater than KVM_MAX_VCPUS.

 cpumask is for host masks, this is a guest mask.
 
 -- 
 error compiling committee.c: too many arguments to function

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-30 Thread Avi Kivity

On 09/30/2012 01:07 PM, Gleb Natapov wrote:
 On Sun, Sep 30, 2012 at 10:18:17AM +0200, Avi Kivity wrote:
 On 09/28/2012 08:16 AM, Raghavendra K T wrote:
  
 
  +struct pv_sched_info {
  +   unsigned long   sched_bitmap;
  
  Thinking, whether we need something similar to cpumask here?
  Only thing is we are representing guest (v)cpumask.
  
 
 DECLARE_BITMAP(sched_bitmap, KVM_MAX_VCPUS)
 
 vcpu_id can be greater than KVM_MAX_VCPUS.

Use the index into the vcpu table as the bitmap index then.  In fact
it's better because then the lookup to get the vcpu pointer is trivial.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-28 Thread Raghavendra K T


On 09/28/2012 02:37 AM, Jiannan Ouyang wrote:



On Thu, Sep 27, 2012 at 4:50 AM, Avi Kivity a...@redhat.com
mailto:a...@redhat.com wrote:

On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
  I've actually implemented this preempted_bitmap idea.

Interesting, please share the code if you can.

  However, I'm doing this to expose this information to the guest,
so the
  guest is able to know if the lock holder is preempted or not before
  spining. Right now, I'm doing experiment to show that this idea
works.
 
  I'm wondering what do you guys think of the relationship between the
  pv_ticketlock approach and PLE handler approach. Are we going to
adopt
  PLE instead of the pv ticketlock, and why?

Right now we're searching for the best solution.  The tradeoffs are more
or less:

PLE:
- works for unmodified / non-Linux guests
- works for all types of spins (e.g. smp_call_function*())
- utilizes an existing hardware interface (PAUSE instruction) so likely
more robust compared to a software interface

PV:
- has more information, so it can perform better

Given these tradeoffs, if we can get PLE to work for moderate amounts of
overcommit then I'll prefer it (even if it slightly underperforms PV).
If we are unable to make it work well, then we'll have to add PV.

--
error compiling committee.c: too many arguments to function


FYI. The preempted_bitmap patch.

I delete some unrelated code in the generated patch file and seems
broken the patch file format... I hope anyone could teach me some
solutions.
However, it's pretty straight forward, four things: declaration,
initialization, set and clear. I think you guys can figure it out easily!

As Avi sugguested, you could check task state TASK_RUNNING in sched_out.

Signed-off-by: Jiannan Ouyang ouy...@cs.pitt.edu
mailto:ouy...@cs.pitt.edu

diff --git a/arch/x86/include/asm/

paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 8613cbb..4fcb648 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -73,6 +73,16 @@ struct pv_info {
 const char *name;
  };


I suppose we need this in common place since s390 also should have this,
if we are using this information in vcpu_on_spin()..



+struct pv_sched_info {
+   unsigned long   sched_bitmap;


Thinking, whether we need something similar to cpumask here?
Only thing is we are representing guest (v)cpumask.


+} __attribute__((__packed__));
+
  struct pv_init_ops {
 /*
  * Patch may replace one of the defined code sequences with
diff --git a/arch/x86/kernel/paravirt-spinlocks.c
b/arch/x86/kernel/paravirt-spinlocks.c
index 676b8c7..2242d22 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c

+struct pv_sched_info pv_sched_info = {
+.sched_bitmap = (unsigned long)-1,
+};
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 44ee712..3eb277e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -494,6 +494,11 @@ static struct kvm *kvm_create_vm(unsigned long
type)
 mutex_init(kvm-slots_lock);
 atomic_set(kvm-users_count, 1);

+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+kvm-pv_sched_info.sched_bitmap = (unsigned long)-1;
+#endif
+
 r = kvm_init_mmu_notifier(kvm);
 if (r)
 goto out_err;
@@ -2697,7 +2702,13 @@ struct kvm_vcpu
*preempt_notifier_to_vcpu(struct preempt_notifier *pn)
  static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
  {
 struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

+   set_bit(vcpu-vcpu_id, vcpu-kvm-pv_sched_info.sched_bitmap);
 kvm_arch_vcpu_load(vcpu, cpu);
  }

@@ -2705,7 +2716,13 @@ static void kvm_sched_out(struct
preempt_notifier *pn,
   struct task_struct *next)
  {
 struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);

+   clear_bit(vcpu-vcpu_id,
vcpu-kvm-pv_sched_info.sched_bitmap);
 kvm_arch_vcpu_put(vcpu);
  }


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-28 Thread Konrad Rzeszutek Wilk

  PLE:
  - works for unmodified / non-Linux guests
  - works for all types of spins (e.g. smp_call_function*())
  - utilizes an existing hardware interface (PAUSE instruction) so likely
  more robust compared to a software interface
 
  PV:
  - has more information, so it can perform better
  
  Should we also consider that we always have an edge here for non-PLE
  machine?
 
 True.  The deployment share for these is decreasing rapidly though.  I
 hate optimizing for obsolete hardware.

Keep in mind that the patchset that Jeremy provided also cleans (remove)
parts of the pv spinlock code. It removes the various spin_lock,
spin_unlock, etc that touch paravirt code. Instead the pv code is only
in the slowpath. And if you don't compile with CONFIG_PARAVIRT_SPINLOCK
the end code is the same as it is now.

On a different subject-  I am curious whether the Haswell new locking
instructions (the transactional ones?) can be put in usage for the slow
case?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Gleb Natapov

On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
 On 09/25/2012 10:09 AM, Raghavendra K T wrote:
  On 09/24/2012 09:36 PM, Avi Kivity wrote:
  On 09/24/2012 05:41 PM, Avi Kivity wrote:
 
 
  case 2)
  rq1 : vcpu1-wait(lockA) (spinning)
  rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]
 
  I agree that checking rq1 length is not proper in this case, and as
  you
  rightly pointed out, we are in trouble here.
  nr_running()/num_online_cpus() would give more accurate picture here,
  but it seemed costly. May be load balancer save us a bit here in not
  running to such sort of cases. ( I agree load balancer is far too
  complex).
 
  In theory preempt notifier can tell us whether a vcpu is preempted or
  not (except for exits to userspace), so we can keep track of whether
  it's we're overcommitted in kvm itself.  It also avoids false positives
  from other guests and/or processes being overcommitted while our vm
  is fine.
 
  It also allows us to cheaply skip running vcpus.
 
  Hi Avi,
 
  Could you please elaborate on how preempt notifiers can be used
  here to keep track of overcommit or skip running vcpus?
 
  Are we planning set some flag in sched_out() handler etc?
 
 
 Keep a bitmap kvm-preempted_vcpus.
 
 In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
 flag and our bit in kvm-preempted_vcpus.  On sched_in, if the flag is
 set, clear our bit in kvm-preempted_vcpus.  We can also keep a counter
 of preempted vcpus.
 
 We can use the bitmap and the counter to quickly see if spinning is
 worthwhile (if the counter is zero, better to spin).  If not, we can use
 the bitmap to select target vcpus quickly.
 
 The only problem is that in order to keep this accurate we need to keep
 the preempt notifiers active during exits to userspace.  But we can
 prototype this without this change, and add it later if it works.
 
Can user return notifier can be used instead? Set bit in
kvm-preempted_vcpus on return to userspace.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity

On 09/25/2012 04:21 PM, Takuya Yoshikawa wrote:
 On Tue, 25 Sep 2012 10:12:49 +0200
 Avi Kivity a...@redhat.com wrote:
 
 It will.  The tradeoff is between false-positive costs (undercommit) and
 true positive costs (overcommit).  I think undercommit should perform
 well no matter what.
 
 If we utilize preempt notifiers to track overcommit dynamically, then we
 can vary the spin time dynamically.  Keep it long initially, as we get
 more preempted vcpus make it shorter.
 
 What will happen if we pin each vcpu thread to some core?
 I don't want to see so many vcpu threads moving around without
 being pinned at all.

If you do that you've removed a lot of flexibility from the scheduler,
so overcommit becomes even less likely to work well (a trivial example
is pinning two vcpus from the same vm to the same core -- it's so
obviously bad no one considers doing it).

 In that case, we don't want to make KVM do any work of searching
 a vcpu thread to yield to.

Why not?  If a vcpu thread on another core has been preempted, and is
the lock holder, and we can boost it, then we've fixed our problem.
Even if the spinning thread keeps spinning because it is the only task
eligible to run on its core.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity

On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
 I've actually implemented this preempted_bitmap idea. 

Interesting, please share the code if you can.

 However, I'm doing this to expose this information to the guest, so the
 guest is able to know if the lock holder is preempted or not before
 spining. Right now, I'm doing experiment to show that this idea works.
 
 I'm wondering what do you guys think of the relationship between the
 pv_ticketlock approach and PLE handler approach. Are we going to adopt
 PLE instead of the pv ticketlock, and why?

Right now we're searching for the best solution.  The tradeoffs are more
or less:

PLE:
- works for unmodified / non-Linux guests
- works for all types of spins (e.g. smp_call_function*())
- utilizes an existing hardware interface (PAUSE instruction) so likely
more robust compared to a software interface

PV:
- has more information, so it can perform better

Given these tradeoffs, if we can get PLE to work for moderate amounts of
overcommit then I'll prefer it (even if it slightly underperforms PV).
If we are unable to make it work well, then we'll have to add PV.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity

On 09/27/2012 09:44 AM, Gleb Natapov wrote:
 On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
 On 09/25/2012 10:09 AM, Raghavendra K T wrote:
  On 09/24/2012 09:36 PM, Avi Kivity wrote:
  On 09/24/2012 05:41 PM, Avi Kivity wrote:
 
 
  case 2)
  rq1 : vcpu1-wait(lockA) (spinning)
  rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]
 
  I agree that checking rq1 length is not proper in this case, and as
  you
  rightly pointed out, we are in trouble here.
  nr_running()/num_online_cpus() would give more accurate picture here,
  but it seemed costly. May be load balancer save us a bit here in not
  running to such sort of cases. ( I agree load balancer is far too
  complex).
 
  In theory preempt notifier can tell us whether a vcpu is preempted or
  not (except for exits to userspace), so we can keep track of whether
  it's we're overcommitted in kvm itself.  It also avoids false positives
  from other guests and/or processes being overcommitted while our vm
  is fine.
 
  It also allows us to cheaply skip running vcpus.
 
  Hi Avi,
 
  Could you please elaborate on how preempt notifiers can be used
  here to keep track of overcommit or skip running vcpus?
 
  Are we planning set some flag in sched_out() handler etc?
 
 
 Keep a bitmap kvm-preempted_vcpus.
 
 In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
 flag and our bit in kvm-preempted_vcpus.  On sched_in, if the flag is
 set, clear our bit in kvm-preempted_vcpus.  We can also keep a counter
 of preempted vcpus.
 
 We can use the bitmap and the counter to quickly see if spinning is
 worthwhile (if the counter is zero, better to spin).  If not, we can use
 the bitmap to select target vcpus quickly.
 
 The only problem is that in order to keep this accurate we need to keep
 the preempt notifiers active during exits to userspace.  But we can
 prototype this without this change, and add it later if it works.
 
 Can user return notifier can be used instead? Set bit in
 kvm-preempted_vcpus on return to userspace.
 

User return notifier is per-cpu, not per-task.  There is a new task_work
(linux/task_work.h) that does what you want.  With these
technicalities out of the way, I think it's the wrong idea.  If a vcpu
thread is in userspace, that doesn't mean it's preempted, there's no
point in boosting it if it's already running.

btw, we can have secondary effects.  A vcpu can be waiting for a lock in
the host kernel, or for a host page fault.  There's no point in boosting
anything for that.  Or a vcpu in userspace can be waiting for a lock
that is held by another thread, which has been preempted.  This is (like
I think Peter already said) a priority inheritance problem.  However
with fine-grained locking in userspace, we can make it go away.  The
guest kernel is unlikely to access one device simultaneously from two
threads (and if it does, we just need to improve the threading in the
device model).

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Gleb Natapov

On Thu, Sep 27, 2012 at 10:59:21AM +0200, Avi Kivity wrote:
 On 09/27/2012 09:44 AM, Gleb Natapov wrote:
  On Tue, Sep 25, 2012 at 10:54:21AM +0200, Avi Kivity wrote:
  On 09/25/2012 10:09 AM, Raghavendra K T wrote:
   On 09/24/2012 09:36 PM, Avi Kivity wrote:
   On 09/24/2012 05:41 PM, Avi Kivity wrote:
  
  
   case 2)
   rq1 : vcpu1-wait(lockA) (spinning)
   rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]
  
   I agree that checking rq1 length is not proper in this case, and as
   you
   rightly pointed out, we are in trouble here.
   nr_running()/num_online_cpus() would give more accurate picture here,
   but it seemed costly. May be load balancer save us a bit here in not
   running to such sort of cases. ( I agree load balancer is far too
   complex).
  
   In theory preempt notifier can tell us whether a vcpu is preempted or
   not (except for exits to userspace), so we can keep track of whether
   it's we're overcommitted in kvm itself.  It also avoids false positives
   from other guests and/or processes being overcommitted while our vm
   is fine.
  
   It also allows us to cheaply skip running vcpus.
  
   Hi Avi,
  
   Could you please elaborate on how preempt notifiers can be used
   here to keep track of overcommit or skip running vcpus?
  
   Are we planning set some flag in sched_out() handler etc?
  
  
  Keep a bitmap kvm-preempted_vcpus.
  
  In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
  flag and our bit in kvm-preempted_vcpus.  On sched_in, if the flag is
  set, clear our bit in kvm-preempted_vcpus.  We can also keep a counter
  of preempted vcpus.
  
  We can use the bitmap and the counter to quickly see if spinning is
  worthwhile (if the counter is zero, better to spin).  If not, we can use
  the bitmap to select target vcpus quickly.
  
  The only problem is that in order to keep this accurate we need to keep
  the preempt notifiers active during exits to userspace.  But we can
  prototype this without this change, and add it later if it works.
  
  Can user return notifier can be used instead? Set bit in
  kvm-preempted_vcpus on return to userspace.
  
 
 User return notifier is per-cpu, not per-task.  There is a new task_work
 (linux/task_work.h) that does what you want.  With these
 technicalities out of the way, I think it's the wrong idea.  If a vcpu
 thread is in userspace, that doesn't mean it's preempted, there's no
 point in boosting it if it's already running.
 
Ah, so you want to set bit in kvm-preempted_vcpus if task is _not_
TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task 
is in userspace it is definitely not preempted.
 
 btw, we can have secondary effects.  A vcpu can be waiting for a lock in
 the host kernel, or for a host page fault.  There's no point in boosting
 anything for that.  Or a vcpu in userspace can be waiting for a lock
 that is held by another thread, which has been preempted. 
Do you mean userspace spinlock? Because otherwise task that's waits on
a kernel lock will sleep in the kernel.

This is (like
 I think Peter already said) a priority inheritance problem.  However
 with fine-grained locking in userspace, we can make it go away.  The
 guest kernel is unlikely to access one device simultaneously from two
 threads (and if it does, we just need to improve the threading in the
 device model).
 
 -- 
 error compiling committee.c: too many arguments to function

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity

On 09/27/2012 11:11 AM, Gleb Natapov wrote:
 
 User return notifier is per-cpu, not per-task.  There is a new task_work
 (linux/task_work.h) that does what you want.  With these
 technicalities out of the way, I think it's the wrong idea.  If a vcpu
 thread is in userspace, that doesn't mean it's preempted, there's no
 point in boosting it if it's already running.
 
 Ah, so you want to set bit in kvm-preempted_vcpus if task is _not_
 TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task 
 is in userspace it is definitely not preempted.

No, as I originally wrote.  If it's TASK_RUNNING when it saw sched_out,
then it is preempted (i.e. runnable), not sleeping on some waitqueue,
voluntarily (HLT) or involuntarily (page fault).

  
 btw, we can have secondary effects.  A vcpu can be waiting for a lock in
 the host kernel, or for a host page fault.  There's no point in boosting
 anything for that.  Or a vcpu in userspace can be waiting for a lock
 that is held by another thread, which has been preempted. 
 Do you mean userspace spinlock? Because otherwise task that's waits on
 a kernel lock will sleep in the kernel.

I meant a kernel mutex.

vcpu 0: take guest spinlock
vcpu 0: vmexit
vcpu 0: spin_lock(some_lock)
vcpu 1: take same guest spinlock
vcpu 1: PLE vmexit
vcpu 1: wtf?

Waiting on a host kernel spinlock is not too bad because we expect to be
out shortly.  Waiting on a host kernel mutex can be a lot worse.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Gleb Natapov

On Thu, Sep 27, 2012 at 11:33:56AM +0200, Avi Kivity wrote:
 On 09/27/2012 11:11 AM, Gleb Natapov wrote:
  
  User return notifier is per-cpu, not per-task.  There is a new task_work
  (linux/task_work.h) that does what you want.  With these
  technicalities out of the way, I think it's the wrong idea.  If a vcpu
  thread is in userspace, that doesn't mean it's preempted, there's no
  point in boosting it if it's already running.
  
  Ah, so you want to set bit in kvm-preempted_vcpus if task is _not_
  TASK_RUNNING in sched_out (you wrote opposite in your email)? If a task 
  is in userspace it is definitely not preempted.
 
 No, as I originally wrote.  If it's TASK_RUNNING when it saw sched_out,
 then it is preempted (i.e. runnable), not sleeping on some waitqueue,
 voluntarily (HLT) or involuntarily (page fault).
 
Of course, I got it all backwards. Need more coffee.

   
  btw, we can have secondary effects.  A vcpu can be waiting for a lock in
  the host kernel, or for a host page fault.  There's no point in boosting
  anything for that.  Or a vcpu in userspace can be waiting for a lock
  that is held by another thread, which has been preempted. 
  Do you mean userspace spinlock? Because otherwise task that's waits on
  a kernel lock will sleep in the kernel.
 
 I meant a kernel mutex.
 
 vcpu 0: take guest spinlock
 vcpu 0: vmexit
 vcpu 0: spin_lock(some_lock)
 vcpu 1: take same guest spinlock
 vcpu 1: PLE vmexit
 vcpu 1: wtf?
 
 Waiting on a host kernel spinlock is not too bad because we expect to be
 out shortly.  Waiting on a host kernel mutex can be a lot worse.
 
We can't do much about it without PV spinlock since there is not
information about what vcpu holds which guest spinlock, no?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity

On 09/27/2012 11:58 AM, Gleb Natapov wrote:
 
   
  btw, we can have secondary effects.  A vcpu can be waiting for a lock in
  the host kernel, or for a host page fault.  There's no point in boosting
  anything for that.  Or a vcpu in userspace can be waiting for a lock
  that is held by another thread, which has been preempted. 
  Do you mean userspace spinlock? Because otherwise task that's waits on
  a kernel lock will sleep in the kernel.
 
 I meant a kernel mutex.
 
 vcpu 0: take guest spinlock
 vcpu 0: vmexit
 vcpu 0: spin_lock(some_lock)
 vcpu 1: take same guest spinlock
 vcpu 1: PLE vmexit
 vcpu 1: wtf?
 
 Waiting on a host kernel spinlock is not too bad because we expect to be
 out shortly.  Waiting on a host kernel mutex can be a lot worse.
 
 We can't do much about it without PV spinlock since there is not
 information about what vcpu holds which guest spinlock, no?

It doesn't help.  If the lock holder is waiting for another lock in the
host kernel, boosting it doesn't help even if we know who it is.  We
need to boost the real lock holder, but we have no idea who it is (and
even if we did, we often can't do anything about it).


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Gleb Natapov

On Thu, Sep 27, 2012 at 12:04:58PM +0200, Avi Kivity wrote:
 On 09/27/2012 11:58 AM, Gleb Natapov wrote:
  

   btw, we can have secondary effects.  A vcpu can be waiting for a lock in
   the host kernel, or for a host page fault.  There's no point in boosting
   anything for that.  Or a vcpu in userspace can be waiting for a lock
   that is held by another thread, which has been preempted. 
   Do you mean userspace spinlock? Because otherwise task that's waits on
   a kernel lock will sleep in the kernel.
  
  I meant a kernel mutex.
  
  vcpu 0: take guest spinlock
  vcpu 0: vmexit
  vcpu 0: spin_lock(some_lock)
  vcpu 1: take same guest spinlock
  vcpu 1: PLE vmexit
  vcpu 1: wtf?
  
  Waiting on a host kernel spinlock is not too bad because we expect to be
  out shortly.  Waiting on a host kernel mutex can be a lot worse.
  
  We can't do much about it without PV spinlock since there is not
  information about what vcpu holds which guest spinlock, no?
 
 It doesn't help.  If the lock holder is waiting for another lock in the
 host kernel, boosting it doesn't help even if we know who it is.  We
 need to boost the real lock holder, but we have no idea who it is (and
 even if we did, we often can't do anything about it).
 
Without PV lock we will boost random preempted vcpu instead of going to
sleep in the situation you described.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity

On 09/27/2012 12:08 PM, Gleb Natapov wrote:
 On Thu, Sep 27, 2012 at 12:04:58PM +0200, Avi Kivity wrote:
 On 09/27/2012 11:58 AM, Gleb Natapov wrote:
  

   btw, we can have secondary effects.  A vcpu can be waiting for a lock 
   in
   the host kernel, or for a host page fault.  There's no point in 
   boosting
   anything for that.  Or a vcpu in userspace can be waiting for a lock
   that is held by another thread, which has been preempted. 
   Do you mean userspace spinlock? Because otherwise task that's waits on
   a kernel lock will sleep in the kernel.
  
  I meant a kernel mutex.
  
  vcpu 0: take guest spinlock
  vcpu 0: vmexit
  vcpu 0: spin_lock(some_lock)
  vcpu 1: take same guest spinlock
  vcpu 1: PLE vmexit
  vcpu 1: wtf?
  
  Waiting on a host kernel spinlock is not too bad because we expect to be
  out shortly.  Waiting on a host kernel mutex can be a lot worse.
  
  We can't do much about it without PV spinlock since there is not
  information about what vcpu holds which guest spinlock, no?
 
 It doesn't help.  If the lock holder is waiting for another lock in the
 host kernel, boosting it doesn't help even if we know who it is.  We
 need to boost the real lock holder, but we have no idea who it is (and
 even if we did, we often can't do anything about it).
 
 Without PV lock we will boost random preempted vcpu instead of going to
 sleep in the situation you described.

True.  In theory boosting a random vcpu shouldn't have any negative
effects though.  Right now the problem is that the boosting itself is
expensive.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Raghavendra K T


On 09/27/2012 02:20 PM, Avi Kivity wrote:

On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:

I've actually implemented this preempted_bitmap idea.


Interesting, please share the code if you can.


However, I'm doing this to expose this information to the guest, so the
guest is able to know if the lock holder is preempted or not before
spining. Right now, I'm doing experiment to show that this idea works.

I'm wondering what do you guys think of the relationship between the
pv_ticketlock approach and PLE handler approach. Are we going to adopt
PLE instead of the pv ticketlock, and why?


Right now we're searching for the best solution.  The tradeoffs are more
or less:

PLE:
- works for unmodified / non-Linux guests
- works for all types of spins (e.g. smp_call_function*())
- utilizes an existing hardware interface (PAUSE instruction) so likely
more robust compared to a software interface

PV:
- has more information, so it can perform better


Should we also consider that we always have an edge here for non-PLE
machine?



Given these tradeoffs, if we can get PLE to work for moderate amounts of
overcommit then I'll prefer it (even if it slightly underperforms PV).
If we are unable to make it work well, then we'll have to add PV.


Avi,
Thanks for this summary.. It is of great help to proceed in right
direction..

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-27 Thread Avi Kivity

On 09/27/2012 01:26 PM, Raghavendra K T wrote:
 On 09/27/2012 02:20 PM, Avi Kivity wrote:
 On 09/25/2012 04:43 PM, Jiannan Ouyang wrote:
 I've actually implemented this preempted_bitmap idea.

 Interesting, please share the code if you can.

 However, I'm doing this to expose this information to the guest, so the
 guest is able to know if the lock holder is preempted or not before
 spining. Right now, I'm doing experiment to show that this idea works.

 I'm wondering what do you guys think of the relationship between the
 pv_ticketlock approach and PLE handler approach. Are we going to adopt
 PLE instead of the pv ticketlock, and why?

 Right now we're searching for the best solution.  The tradeoffs are more
 or less:

 PLE:
 - works for unmodified / non-Linux guests
 - works for all types of spins (e.g. smp_call_function*())
 - utilizes an existing hardware interface (PAUSE instruction) so likely
 more robust compared to a software interface

 PV:
 - has more information, so it can perform better
 
 Should we also consider that we always have an edge here for non-PLE
 machine?

True.  The deployment share for these is decreasing rapidly though.  I
hate optimizing for obsolete hardware.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-25 Thread Raghavendra K T


On 09/24/2012 09:11 PM, Avi Kivity wrote:

On 09/21/2012 08:24 PM, Raghavendra K T wrote:

On 09/21/2012 06:32 PM, Rik van Riel wrote:

On 09/21/2012 08:00 AM, Raghavendra K T wrote:

From: Raghavendra K Traghavendra...@linux.vnet.ibm.com

When total number of VCPUs of system is less than or equal to physical
CPUs,
PLE exits become costly since each VCPU can have dedicated PCPU, and
trying to find a target VCPU to yield_to just burns time in PLE handler.

This patch reduces overhead, by simply doing a return in such
scenarios by
checking the length of current cpu runqueue.


I am not convinced this is the way to go.

The VCPU that is holding the lock, and is not releasing it,
probably got scheduled out. That implies that VCPU is on a
runqueue with at least one other task.


I see your point here, we have two cases:

case 1)

rq1 : vcpu1-wait(lockA) (spinning)
rq2 : vcpu2-holding(lockA) (running)

Here Ideally vcpu1 should not enter PLE handler, since it would surely
get the lock within ple_window cycle. (assuming ple_window is tuned for
that workload perfectly).

May be this explains why we are not seeing benefit with kernbench.

On the other side, Since we cannot have a perfect ple_window tuned for
all type of workloads, for those workloads, which may need more than
4096 cycles, we gain. thinking is it that we are seeing in benefited
cases?


Maybe we need to increase the ple window regardless.  4096 cycles is 2
microseconds or less (call it t_spin).  The overhead from
kvm_vcpu_on_spin() and the associated task switches is at least a few
microseconds, increasing as contention is added (call it t_tield).  The
time for a natural context switch is several milliseconds (call it
t_slice).  There is also the time the lock holder owns the lock,
assuming no contention (t_hold).

If t_yield  t_spin, then in the undercommitted case it dominates
t_spin.  If t_hold  t_spin we lose badly.

If t_spin  t_yield, then the undercommitted case doesn't suffer as much
as most of the spinning happens in the guest instead of the host, so it
can pick up the unlock timely.  We don't lose too much in the
overcommitted case provided the values aren't too far apart (say a
factor of 3).

Obviously t_spin must be significantly smaller than t_slice, otherwise
it accomplishes nothing.

Regarding t_hold: if it is small, then a larger t_spin helps avoid false
exits.  If it is large, then we're not very sensitive to t_spin.  It
doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up
yielding for several milliseconds.

So I think it's worth trying again with ple_window of 2-4.



Agree that spinning is not costly and  I have tried increasing
ple_window earlier. I 'll give one more shot.

I was thinking, unnessary spinning of vcpus (spinning when lockholder
is preempted), add up to degradation significantly, especially in
ticketlock scenario is more problemtic. no?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-25 Thread Raghavendra K T


On 09/24/2012 09:36 PM, Avi Kivity wrote:

On 09/24/2012 05:41 PM, Avi Kivity wrote:




case 2)
rq1 : vcpu1-wait(lockA) (spinning)
rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]

I agree that checking rq1 length is not proper in this case, and as you
rightly pointed out, we are in trouble here.
nr_running()/num_online_cpus() would give more accurate picture here,
but it seemed costly. May be load balancer save us a bit here in not
running to such sort of cases. ( I agree load balancer is far too
complex).


In theory preempt notifier can tell us whether a vcpu is preempted or
not (except for exits to userspace), so we can keep track of whether
it's we're overcommitted in kvm itself.  It also avoids false positives
from other guests and/or processes being overcommitted while our vm is fine.


It also allows us to cheaply skip running vcpus.


Hi Avi,

Could you please elaborate on how preempt notifiers can be used
here to keep track of overcommit or skip running vcpus?

Are we planning set some flag in sched_out() handler etc?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-25 Thread Avi Kivity

On 09/25/2012 09:36 AM, Raghavendra K T wrote:
 On 09/24/2012 09:11 PM, Avi Kivity wrote:
 On 09/21/2012 08:24 PM, Raghavendra K T wrote:
 On 09/21/2012 06:32 PM, Rik van Riel wrote:
 On 09/21/2012 08:00 AM, Raghavendra K T wrote:
 From: Raghavendra K Traghavendra...@linux.vnet.ibm.com

 When total number of VCPUs of system is less than or equal to
 physical
 CPUs,
 PLE exits become costly since each VCPU can have dedicated PCPU, and
 trying to find a target VCPU to yield_to just burns time in PLE
 handler.

 This patch reduces overhead, by simply doing a return in such
 scenarios by
 checking the length of current cpu runqueue.

 I am not convinced this is the way to go.

 The VCPU that is holding the lock, and is not releasing it,
 probably got scheduled out. That implies that VCPU is on a
 runqueue with at least one other task.

 I see your point here, we have two cases:

 case 1)

 rq1 : vcpu1-wait(lockA) (spinning)
 rq2 : vcpu2-holding(lockA) (running)

 Here Ideally vcpu1 should not enter PLE handler, since it would surely
 get the lock within ple_window cycle. (assuming ple_window is tuned for
 that workload perfectly).

 May be this explains why we are not seeing benefit with kernbench.

 On the other side, Since we cannot have a perfect ple_window tuned for
 all type of workloads, for those workloads, which may need more than
 4096 cycles, we gain. thinking is it that we are seeing in benefited
 cases?

 Maybe we need to increase the ple window regardless.  4096 cycles is 2
 microseconds or less (call it t_spin).  The overhead from
 kvm_vcpu_on_spin() and the associated task switches is at least a few
 microseconds, increasing as contention is added (call it t_tield).  The
 time for a natural context switch is several milliseconds (call it
 t_slice).  There is also the time the lock holder owns the lock,
 assuming no contention (t_hold).

 If t_yield  t_spin, then in the undercommitted case it dominates
 t_spin.  If t_hold  t_spin we lose badly.

 If t_spin  t_yield, then the undercommitted case doesn't suffer as much
 as most of the spinning happens in the guest instead of the host, so it
 can pick up the unlock timely.  We don't lose too much in the
 overcommitted case provided the values aren't too far apart (say a
 factor of 3).

 Obviously t_spin must be significantly smaller than t_slice, otherwise
 it accomplishes nothing.

 Regarding t_hold: if it is small, then a larger t_spin helps avoid false
 exits.  If it is large, then we're not very sensitive to t_spin.  It
 doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up
 yielding for several milliseconds.

 So I think it's worth trying again with ple_window of 2-4.


 Agree that spinning is not costly and  I have tried increasing
 ple_window earlier. I 'll give one more shot.

 I was thinking, unnessary spinning of vcpus (spinning when lockholder
 is preempted), add up to degradation significantly, especially in
 ticketlock scenario is more problemtic. no?


It will.  The tradeoff is between false-positive costs (undercommit) and
true positive costs (overcommit).  I think undercommit should perform
well no matter what.

If we utilize preempt notifiers to track overcommit dynamically, then we
can vary the spin time dynamically.  Keep it long initially, as we get
more preempted vcpus make it shorter.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-25 Thread Avi Kivity

On 09/25/2012 10:09 AM, Raghavendra K T wrote:
 On 09/24/2012 09:36 PM, Avi Kivity wrote:
 On 09/24/2012 05:41 PM, Avi Kivity wrote:


 case 2)
 rq1 : vcpu1-wait(lockA) (spinning)
 rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]

 I agree that checking rq1 length is not proper in this case, and as
 you
 rightly pointed out, we are in trouble here.
 nr_running()/num_online_cpus() would give more accurate picture here,
 but it seemed costly. May be load balancer save us a bit here in not
 running to such sort of cases. ( I agree load balancer is far too
 complex).

 In theory preempt notifier can tell us whether a vcpu is preempted or
 not (except for exits to userspace), so we can keep track of whether
 it's we're overcommitted in kvm itself.  It also avoids false positives
 from other guests and/or processes being overcommitted while our vm
 is fine.

 It also allows us to cheaply skip running vcpus.

 Hi Avi,

 Could you please elaborate on how preempt notifiers can be used
 here to keep track of overcommit or skip running vcpus?

 Are we planning set some flag in sched_out() handler etc?


Keep a bitmap kvm-preempted_vcpus.

In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
flag and our bit in kvm-preempted_vcpus.  On sched_in, if the flag is
set, clear our bit in kvm-preempted_vcpus.  We can also keep a counter
of preempted vcpus.

We can use the bitmap and the counter to quickly see if spinning is
worthwhile (if the counter is zero, better to spin).  If not, we can use
the bitmap to select target vcpus quickly.

The only problem is that in order to keep this accurate we need to keep
the preempt notifiers active during exits to userspace.  But we can
prototype this without this change, and add it later if it works.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-25 Thread Raghavendra K T


On 09/25/2012 02:24 PM, Avi Kivity wrote:

On 09/25/2012 10:09 AM, Raghavendra K T wrote:

On 09/24/2012 09:36 PM, Avi Kivity wrote:

On 09/24/2012 05:41 PM, Avi Kivity wrote:




case 2)
rq1 : vcpu1-wait(lockA) (spinning)
rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]

I agree that checking rq1 length is not proper in this case, and as
you
rightly pointed out, we are in trouble here.
nr_running()/num_online_cpus() would give more accurate picture here,
but it seemed costly. May be load balancer save us a bit here in not
running to such sort of cases. ( I agree load balancer is far too
complex).


In theory preempt notifier can tell us whether a vcpu is preempted or
not (except for exits to userspace), so we can keep track of whether
it's we're overcommitted in kvm itself.  It also avoids false positives
from other guests and/or processes being overcommitted while our vm
is fine.


It also allows us to cheaply skip running vcpus.


Hi Avi,

Could you please elaborate on how preempt notifiers can be used
here to keep track of overcommit or skip running vcpus?

Are we planning set some flag in sched_out() handler etc?



Keep a bitmap kvm-preempted_vcpus.

In sched_out, test whether we're TASK_RUNNING, and if so, set a vcpu
flag and our bit in kvm-preempted_vcpus.  On sched_in, if the flag is
set, clear our bit in kvm-preempted_vcpus.  We can also keep a counter
of preempted vcpus.

We can use the bitmap and the counter to quickly see if spinning is
worthwhile (if the counter is zero, better to spin).  If not, we can use
the bitmap to select target vcpus quickly.

The only problem is that in order to keep this accurate we need to keep
the preempt notifiers active during exits to userspace.  But we can
prototype this without this change, and add it later if it works.



Avi, Thanks for the idea.. I want to try this some time soon.

So ideally it means if we are under-committed the counter/ bitmap
effective value is zero.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-25 Thread Takuya Yoshikawa

On Tue, 25 Sep 2012 10:12:49 +0200
Avi Kivity a...@redhat.com wrote:

 It will.  The tradeoff is between false-positive costs (undercommit) and
 true positive costs (overcommit).  I think undercommit should perform
 well no matter what.
 
 If we utilize preempt notifiers to track overcommit dynamically, then we
 can vary the spin time dynamically.  Keep it long initially, as we get
 more preempted vcpus make it shorter.

What will happen if we pin each vcpu thread to some core?
I don't want to see so many vcpu threads moving around without
being pinned at all.

In that case, we don't want to make KVM do any work of searching
a vcpu thread to yield to.

Thanks,
Takuya
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-24 Thread Peter Zijlstra

On Fri, 2012-09-21 at 17:30 +0530, Raghavendra K T wrote:
 +unsigned long rq_nr_running(void)
 +{
 +   return this_rq()-nr_running;
 +}
 +EXPORT_SYMBOL(rq_nr_running); 

Uhm,.. no, that's a horrible thing to export.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-24 Thread Raghavendra K T


On 09/24/2012 05:03 PM, Peter Zijlstra wrote:

On Fri, 2012-09-21 at 17:30 +0530, Raghavendra K T wrote:

+unsigned long rq_nr_running(void)
+{
+   return this_rq()-nr_running;
+}
+EXPORT_SYMBOL(rq_nr_running);


Uhm,.. no, that's a horrible thing to export.



True.. I had the same fear :).  Other options I thought were something
like  nr_running()/num_online_cpus, this_cpu_load(), etc..

Could you please let me know, if we can rely some good metric to say,
system is not overcommitted/overcommitted?


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-24 Thread Avi Kivity

On 09/21/2012 08:24 PM, Raghavendra K T wrote:
 On 09/21/2012 06:32 PM, Rik van Riel wrote:
 On 09/21/2012 08:00 AM, Raghavendra K T wrote:
 From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

 When total number of VCPUs of system is less than or equal to physical
 CPUs,
 PLE exits become costly since each VCPU can have dedicated PCPU, and
 trying to find a target VCPU to yield_to just burns time in PLE handler.

 This patch reduces overhead, by simply doing a return in such
 scenarios by
 checking the length of current cpu runqueue.

 I am not convinced this is the way to go.

 The VCPU that is holding the lock, and is not releasing it,
 probably got scheduled out. That implies that VCPU is on a
 runqueue with at least one other task.
 
 I see your point here, we have two cases:
 
 case 1)
 
 rq1 : vcpu1-wait(lockA) (spinning)
 rq2 : vcpu2-holding(lockA) (running)
 
 Here Ideally vcpu1 should not enter PLE handler, since it would surely
 get the lock within ple_window cycle. (assuming ple_window is tuned for
 that workload perfectly).
 
 May be this explains why we are not seeing benefit with kernbench.
 
 On the other side, Since we cannot have a perfect ple_window tuned for
 all type of workloads, for those workloads, which may need more than
 4096 cycles, we gain. thinking is it that we are seeing in benefited
 cases?

Maybe we need to increase the ple window regardless.  4096 cycles is 2
microseconds or less (call it t_spin).  The overhead from
kvm_vcpu_on_spin() and the associated task switches is at least a few
microseconds, increasing as contention is added (call it t_tield).  The
time for a natural context switch is several milliseconds (call it
t_slice).  There is also the time the lock holder owns the lock,
assuming no contention (t_hold).

If t_yield  t_spin, then in the undercommitted case it dominates
t_spin.  If t_hold  t_spin we lose badly.

If t_spin  t_yield, then the undercommitted case doesn't suffer as much
as most of the spinning happens in the guest instead of the host, so it
can pick up the unlock timely.  We don't lose too much in the
overcommitted case provided the values aren't too far apart (say a
factor of 3).

Obviously t_spin must be significantly smaller than t_slice, otherwise
it accomplishes nothing.

Regarding t_hold: if it is small, then a larger t_spin helps avoid false
exits.  If it is large, then we're not very sensitive to t_spin.  It
doesn't matter if it takes us 2 usec or 20 usec to yield, if we end up
yielding for several milliseconds.

So I think it's worth trying again with ple_window of 2-4.

 
 case 2)
 rq1 : vcpu1-wait(lockA) (spinning)
 rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]
 
 I agree that checking rq1 length is not proper in this case, and as you
 rightly pointed out, we are in trouble here.
 nr_running()/num_online_cpus() would give more accurate picture here,
 but it seemed costly. May be load balancer save us a bit here in not
 running to such sort of cases. ( I agree load balancer is far too
 complex).

In theory preempt notifier can tell us whether a vcpu is preempted or
not (except for exits to userspace), so we can keep track of whether
it's we're overcommitted in kvm itself.  It also avoids false positives
from other guests and/or processes being overcommitted while our vm is fine.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-24 Thread Avi Kivity

On 09/24/2012 05:41 PM, Avi Kivity wrote:
 
 
 case 2)
 rq1 : vcpu1-wait(lockA) (spinning)
 rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]
 
 I agree that checking rq1 length is not proper in this case, and as you
 rightly pointed out, we are in trouble here.
 nr_running()/num_online_cpus() would give more accurate picture here,
 but it seemed costly. May be load balancer save us a bit here in not
 running to such sort of cases. ( I agree load balancer is far too
 complex).
 
 In theory preempt notifier can tell us whether a vcpu is preempted or
 not (except for exits to userspace), so we can keep track of whether
 it's we're overcommitted in kvm itself.  It also avoids false positives
 from other guests and/or processes being overcommitted while our vm is fine.

It also allows us to cheaply skip running vcpus.

We would probably need a -sched_exit() preempt notifier to make this
work.  Peter, I know how much you love those, would it be acceptable?
We'd still need yield_to() but the pressure on it might be reduced.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-24 Thread Peter Zijlstra

On Mon, 2012-09-24 at 18:06 +0200, Avi Kivity wrote:
 
 We would probably need a -sched_exit() preempt notifier to make this
 work.  Peter, I know how much you love those, would it be acceptable? 

Where exactly do you want this? TASK_DEAD? or another exit?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-24 Thread Avi Kivity

On 09/24/2012 06:14 PM, Peter Zijlstra wrote:
 On Mon, 2012-09-24 at 18:06 +0200, Avi Kivity wrote:
 
 We would probably need a -sched_exit() preempt notifier to make this
 work.  Peter, I know how much you love those, would it be acceptable? 
 
 Where exactly do you want this? TASK_DEAD? or another exit?

TASK_DEAD of the task that registered the preempt notifier.

The idea is that I want to hold on to the notifier even when exiting to
userspace.  Since userspace is under no obligation to call kvm again, I
need a chance to unregister the notifier and otherwise clean up.

Eh, looking at the code, we'll have a -sched_out() after the state is
set to TASK_DEAD.  So all we need to do is examine the state.  We'll
need to examine the state anyway to see if we were preempted or blocking.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-21 Thread Rik van Riel


On 09/21/2012 08:00 AM, Raghavendra K T wrote:

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

When total number of VCPUs of system is less than or equal to physical CPUs,
PLE exits become costly since each VCPU can have dedicated PCPU, and
trying to find a target VCPU to yield_to just burns time in PLE handler.

This patch reduces overhead, by simply doing a return in such scenarios by
checking the length of current cpu runqueue.


I am not convinced this is the way to go.

The VCPU that is holding the lock, and is not releasing it,
probably got scheduled out. That implies that VCPU is on a
runqueue with at least one other task.


--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1629,6 +1629,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
int pass;
int i;

+   if (unlikely(rq_nr_running() == 1))
+   return;
+
kvm_vcpu_set_in_spin_loop(me, true);
/*
 * We boost the priority of a VCPU that is runnable but not




--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-09-21 Thread Raghavendra K T


On 09/21/2012 06:32 PM, Rik van Riel wrote:

On 09/21/2012 08:00 AM, Raghavendra K T wrote:

From: Raghavendra K T raghavendra...@linux.vnet.ibm.com

When total number of VCPUs of system is less than or equal to physical
CPUs,
PLE exits become costly since each VCPU can have dedicated PCPU, and
trying to find a target VCPU to yield_to just burns time in PLE handler.

This patch reduces overhead, by simply doing a return in such
scenarios by
checking the length of current cpu runqueue.


I am not convinced this is the way to go.

The VCPU that is holding the lock, and is not releasing it,
probably got scheduled out. That implies that VCPU is on a
runqueue with at least one other task.


I see your point here, we have two cases:

case 1)

rq1 : vcpu1-wait(lockA) (spinning)
rq2 : vcpu2-holding(lockA) (running)

Here Ideally vcpu1 should not enter PLE handler, since it would surely
get the lock within ple_window cycle. (assuming ple_window is tuned for
that workload perfectly).

May be this explains why we are not seeing benefit with kernbench.

On the other side, Since we cannot have a perfect ple_window tuned for
all type of workloads, for those workloads, which may need more than
4096 cycles, we gain. thinking is it that we are seeing in benefited
cases?

case 2)
rq1 : vcpu1-wait(lockA) (spinning)
rq2 : vcpu3 (running) ,  vcpu2-holding(lockA) [scheduled out]

I agree that checking rq1 length is not proper in this case, and as you
rightly pointed out, we are in trouble here. 
nr_running()/num_online_cpus() would give more accurate picture here,

but it seemed costly. May be load balancer save us a bit here in not
running to such sort of cases. ( I agree load balancer is far too
complex).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

63 matches

Mail list logo