Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-31 Thread Dave Chinner
On Fri, Apr 01, 2011 at 08:38:11AM +0530, Balbir Singh wrote:
> * Dave Chinner  [2011-04-01 08:40:33]:
> 
> > On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote:
> > > 
> > > The following series implements page cache control,
> > > this is a split out version of patch 1 of version 3 of the
> > > page cache optimization patches posted earlier at
> > > Previous posting http://lwn.net/Articles/425851/ and analysis
> > > at http://lwn.net/Articles/419713/
> > > 
> > > Detailed Description
> > > 
> > > This patch implements unmapped page cache control via preferred
> > > page cache reclaim. The current patch hooks into kswapd and reclaims
> > > page cache if the user has requested for unmapped page control.
> > > This is useful in the following scenario
> > > - In a virtualized environment with cache=writethrough, we see
> > >   double caching - (one in the host and one in the guest). As
> > >   we try to scale guests, cache usage across the system grows.
> > >   The goal of this patch is to reclaim page cache when Linux is running
> > >   as a guest and get the host to hold the page cache and manage it.
> > >   There might be temporary duplication, but in the long run, memory
> > >   in the guests would be used for mapped pages.
> > 
> > What does this do that "cache=none" for the VMs and using the page
> > cache inside the guest doesn't acheive? That avoids double caching
> > and doesn't require any new complexity inside the host OS to
> > acheive...
> >
> 
> There was a long discussion on cache=none in the first posting and the
> downsides/impact on throughput. Please see
> http://www.mail-archive.com/kvm@vger.kernel.org/msg30655.html 

All there is in that thread is handwaving about the differences
between cache=none vs cache=writeback behaviour and about the amount
of data loss/corruption when failures occur.  There is only one real
example provided about real world performance in the entire thread,
but the root cause of the performance difference is not analysed,
determined and understood.  Hence I'm not convinced from this thread
that using cache=write* and using this functionality is
anything other than papering over some still unknown problem

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-31 Thread Balbir Singh
* Dave Chinner  [2011-04-01 08:40:33]:

> On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote:
> > 
> > The following series implements page cache control,
> > this is a split out version of patch 1 of version 3 of the
> > page cache optimization patches posted earlier at
> > Previous posting http://lwn.net/Articles/425851/ and analysis
> > at http://lwn.net/Articles/419713/
> > 
> > Detailed Description
> > 
> > This patch implements unmapped page cache control via preferred
> > page cache reclaim. The current patch hooks into kswapd and reclaims
> > page cache if the user has requested for unmapped page control.
> > This is useful in the following scenario
> > - In a virtualized environment with cache=writethrough, we see
> >   double caching - (one in the host and one in the guest). As
> >   we try to scale guests, cache usage across the system grows.
> >   The goal of this patch is to reclaim page cache when Linux is running
> >   as a guest and get the host to hold the page cache and manage it.
> >   There might be temporary duplication, but in the long run, memory
> >   in the guests would be used for mapped pages.
> 
> What does this do that "cache=none" for the VMs and using the page
> cache inside the guest doesn't acheive? That avoids double caching
> and doesn't require any new complexity inside the host OS to
> acheive...
>

There was a long discussion on cache=none in the first posting and the
downsides/impact on throughput. Please see
http://www.mail-archive.com/kvm@vger.kernel.org/msg30655.html 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-31 Thread KOSAKI Motohiro
> On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote:
> > 
> > The following series implements page cache control,
> > this is a split out version of patch 1 of version 3 of the
> > page cache optimization patches posted earlier at
> > Previous posting http://lwn.net/Articles/425851/ and analysis
> > at http://lwn.net/Articles/419713/
> > 
> > Detailed Description
> > 
> > This patch implements unmapped page cache control via preferred
> > page cache reclaim. The current patch hooks into kswapd and reclaims
> > page cache if the user has requested for unmapped page control.
> > This is useful in the following scenario
> > - In a virtualized environment with cache=writethrough, we see
> >   double caching - (one in the host and one in the guest). As
> >   we try to scale guests, cache usage across the system grows.
> >   The goal of this patch is to reclaim page cache when Linux is running
> >   as a guest and get the host to hold the page cache and manage it.
> >   There might be temporary duplication, but in the long run, memory
> >   in the guests would be used for mapped pages.
> 
> What does this do that "cache=none" for the VMs and using the page
> cache inside the guest doesn't acheive? That avoids double caching
> and doesn't require any new complexity inside the host OS to
> acheive...

Right. 

"cache=none" has no double caching issue and KSM already solved
cross gues cache sharing. So, I _guess_ this is not a core motivation 
of his patch. But I'm not him. I'm not sure.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-31 Thread Dave Chinner
On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote:
> 
> The following series implements page cache control,
> this is a split out version of patch 1 of version 3 of the
> page cache optimization patches posted earlier at
> Previous posting http://lwn.net/Articles/425851/ and analysis
> at http://lwn.net/Articles/419713/
> 
> Detailed Description
> 
> This patch implements unmapped page cache control via preferred
> page cache reclaim. The current patch hooks into kswapd and reclaims
> page cache if the user has requested for unmapped page control.
> This is useful in the following scenario
> - In a virtualized environment with cache=writethrough, we see
>   double caching - (one in the host and one in the guest). As
>   we try to scale guests, cache usage across the system grows.
>   The goal of this patch is to reclaim page cache when Linux is running
>   as a guest and get the host to hold the page cache and manage it.
>   There might be temporary duplication, but in the long run, memory
>   in the guests would be used for mapped pages.

What does this do that "cache=none" for the VMs and using the page
cache inside the guest doesn't acheive? That avoids double caching
and doesn't require any new complexity inside the host OS to
acheive...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Guest virtual CPUs limited to only 50% of host CPU

2011-03-31 Thread Drew Johnson
Hi,

I am using Qemu-KVM-0.12.5 on Intel Xeon (Vt-x enabled) processors and
monitoring the system using htop on the host. On the processors that
are running Qemu-KVM I am seeing a 50/50 split between userspace and
guest ("gu:" in htop). I have pinned the vCPU qemu-kvm threads to
specific host CPUs using taskset. In the guest the CPUs are nearly
100% userspace in htop.

Does anyone have ideas on why this is? Is there a way I can get much
higher utilization for the guest virtual CPUs wrt the host?

Thanks in advance!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-31 Thread Christoph Lameter
On Thu, 31 Mar 2011, KOSAKI Motohiro wrote:

> 1) zone reclaim doesn't work if the system has multiple node and the
>workload is file cache oriented (eg file server, web server, mail server, 
> et al).
>because zone recliam make some much free pages than zone->pages_min and
>then new page cache request consume nearest node memory and then it
>bring next zone reclaim. Then, memory utilization is reduced and
>unnecessary LRU discard is increased dramatically.

That is only true if the webserver only allocates from a single node. If
the allocation load is balanced then it will be fine. It is useful to
reclaim pages from the node where we allocate memory since that keeps the
dataset node local.

>SGI folks added CPUSET specific solution in past. 
> (cpuset.memory_spread_page)
>But global recliam still have its issue. zone recliam is HPC workload 
> specific
>feature and HPC folks has no motivation to don't use CPUSET.

The spreading can also be done via memory policies. But that is only
required if the application has an unbalanced allocation behavior.

> 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to
>decide to filter out mapped pages. It made a lot of problems for DB servers
>and large application servers. Because, if the system has a lot of mapped
>pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2)
>reclaim latency become terribly slow and hangup detectors misdetect its
>state and start to force reboot. That was big problem of RHEL5 based 
> banking
>system.
>So, sc->may_unmap should be killed in future. Don't increase uses.

Because a bank could not configure its system properly we need to get rid
of may_unmap? Maybe raise min_unmapped_ratio instead and take care that
either the allocation load is balanced or a round robin scheme is
used by the app?

> And, this patch introduce new allocator fast path overhead. I haven't seen
> any justification for it.

We could do the triggering differently.

> In other words, you have to kill following three for getting ack 1) zone
> reclaim oriented reclaim 2) filter based LRU scanning (eg sc->may_unmap)
> 3) fastpath overhead. In other words, If you want a feature for vm guest,
> Any hardcoded machine configration assumption and/or workload assumption
> are wrong.

It would be good if you could come up with a new reclaim scheme that
avoids the need for zone reclaim and still allows one to take advantage of
memory distances. I agree that the current scheme sometimes requires
tuning too many esoteric knobs to get useful behavior.

> But, I agree that now we have to concern slightly large VM change parhaps
> (or parhaps not). Ok, it's good opportunity to fill out some thing.
> Historically, Linux MM has "free memory are waste memory" policy, and It
> worked completely fine. But now we have a few exceptions.
>
> 1) RT, embedded and finance systems. They really hope to avoid reclaim
>latency (ie avoid foreground reclaim completely) and they can accept
>to make slightly much free pages before memory shortage.

In general we need a mechanism to ensure we can avoid reclaim during
critical sections of application. So some way to give some hints to the
machine to free up lots of memory (/proc/sys/vm/dropcaches is far too
drastic) may be useful.

> And, now we have four proposal of utilization related issues.
>
> 1) cleancache (from Oracle)
> 2) VirtFS (from IBM)
> 3) kstaled (from Google)
> 4) unmapped page reclaim (from you)
>
> Probably, we can't merge all of them and we need to consolidate some
> requirement and implementations.

Well all these approaches show that we have major issues with reclaim and
large memory. Things get overly complicated. Time for a new approach that
integrates all the goals that these try to accomplish?

> Personally I think cleancache or other multi level page cache framework
> looks promising. but another solution is also acceptable. Anyway, I hope
> to everyone back 1000feet bird eye at once and sorting out all requiremnt
> with all related person.

Would be good if you could takle that problem.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCE] Native Linux KVM tool

2011-03-31 Thread Pekka Enberg
Hi all,

We’re proud to announce the native Linux KVM tool!

The goal of this tool is to provide a clean, from-scratch, lightweight
KVM host tool implementation that can boot Linux guest images (just a
hobby, won't be big and professional like QEMU) with no BIOS
dependencies and with only the minimal amount of legacy device
emulation.

Note that this is a development prototype for the time being: there's no
networking support and no graphics support, amongst other missing
essentials.

It's great as a learning tool if you want to get your feet wet in
virtualization land: it's only 5 KLOC of clean C code that can already
boot a guest Linux image.

Right now it can boot a Linux image and provide you output via a serial
console, over the host terminal, i.e. you can use it to boot a guest
Linux image in a terminal or over ssh and log into the guest without
much guest or host side setup work needed.

1. To try out the tool, clone the git repository:

  git clone git://github.com/penberg/linux-kvm.git

or alternatively, if you already have a kernel source tree:

  git checkout -b kvm/tool
  git pull git://github.com/penberg/linux-kvm.git

2. Compile the tool:

  cd tools/kvm && make

3. Download a raw userspace image:

  wget http://wiki.qemu.org/download/linux-0.2.img.bz2 && bunzip2
linux-0.2.img.bz2

4. Build a kernel with CONFIG_VIRTIO_BLK=y and
CONFIG_SERIAL_8250_CONSOLE=y configuration options. Note: also make sure
you have CONFIG_EXT2_FS or CONFIG_EXT4_FS if you use the above image.

5. And finally, launch the hypervisor:

  ./kvm --image=linux-0.2.img --kernel=../../arch/x86/boot/bzImage

The tool has been written by Pekka Enberg, Cyrill Gorcunov, and Asias
He. Special thanks to Avi Kivity for his help on KVM internals and Ingo
Molnar for all-around support and encouragement!

See the following thread for original discussion for motivation of this
project:

http://thread.gmane.org/gmane.linux.kernel/962051/focus=962620

Pekka

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM, iSCSI and High Availability

2011-03-31 Thread Guido Winkelmann
Am Thursday 31 March 2011 schrieben Sie:
> That's what CLVM is for, it propagates the volume changes to every member
> of the 'cluster'.

Oh, right. I didn't know about clvm until now.

It sounds very promising though, certainly better than working with the 
proprietary API of whoever your SAN-vendor is to create a new LUN for every VM. 
Also, the machine we have got here, a Dell PowerVault, appears to be limited to 
at most 255 LUNs. I don't if that's a limitation of iSCSI or just a problem of 
this particular array.

Guido
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]arch:x86:kvm:i8254.h Fix typo in kvm_pit

2011-03-31 Thread Justin P. Mattock

On 03/31/2011 02:20 AM, Avi Kivity wrote:

On 03/30/2011 07:42 PM, Justin P. Mattock wrote:

On 03/30/2011 10:17 AM, Avi Kivity wrote:

On 03/30/2011 06:30 PM, Justin P. Mattock wrote:

On 03/30/2011 09:26 AM, Avi Kivity wrote:

On 03/30/2011 06:19 PM, Justin P. Mattock wrote:

The below patch changes base_addresss to base_address.
Note: I have grepped for base_addresss and nothing shows up,
grepping for base_address gets me lots of output, telling me that
this is a typo, but could be wrong.

Signed-off-by: Justin P. Mattock

---
arch/x86/kvm/i8254.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/i8254.h b/arch/x86/kvm/i8254.h
index 46d08ca..c2fa48b 100644
--- a/arch/x86/kvm/i8254.h
+++ b/arch/x86/kvm/i8254.h
@@ -33,7 +33,7 @@ struct kvm_kpit_state {
};

struct kvm_pit {
- unsigned long base_addresss;
+ unsigned long base_address;
struct kvm_io_device dev;
struct kvm_io_device speaker_dev;
struct kvm *kvm;


Why not remove the variable completely?



didnt even think to completely remove the variable(figured it was used
somewhere).I will look at that and resend with removal of the variable
for you..


Well if it was used, you ought to have changed all of the users, no?



at the moment I see:
(keep in mind my reading skills only go so far!)

grep -Re base_address kvm/* -n
kvm/ioapic.c:276: return ((addr >= ioapic->base_address &&
kvm/ioapic.c:277: (addr < ioapic->base_address + IOAPIC_MEM_LENGTH)));
kvm/ioapic.c:371: ioapic->base_address = IOAPIC_DEFAULT_BASE_ADDRESS;
kvm/ioapic.h:38: u64 base_address;

so changing base_addresss; to base_address; gets kvm_ioapic_reset to
function correctly as well as ioapic_in_range?
(but could be wrong)



Can you explain how kvm_ioapic_reset() would be affected by the change?

Really, you need to understand what you're doing before sending patches.



well looking at the code:
virt/kvm/ioapic.c @@276

static inline int ioapic_in_range(struct kvm_ioapic *ioapic, gpa_t addr)
{
return ((addr >= ioapic->base_address &&
 (addr < ioapic->base_address + IOAPIC_MEM_LENGTH)));
}

I see: base_address in there but looking more at the code its for 
something completely different..


Justin P. Mattock
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM, iSCSI and High Availability

2011-03-31 Thread David Martin
That's what CLVM is for, it propagates the volume changes to every member of 
the 'cluster'.

David Martin

- Original Message -
> Am Monday 28 March 2011 schrieb David Martin:
> > - Original Message -
> >
> > > On 3/28/11 2:46 PM, Avi Kivity wrote:
> > > > On 03/25/2011 10:26 PM, Marcin M. Jessa wrote:
> > > [...]
> > >
> > > > One LUN per image allows you to implement failover, LVM doesn't
> > > > (but
> > > > cluster-LVM does). I recommend using one LUN per image; it's
> > > > much
> > > > simpler.
> > >
> > > Some people say "Use one LUN, it's easier and use CLVM". Why is it
> > > easier to use CLVM and one LUN per virtual guest?
> >
> > I find it easier because i can do:
> > lvcreate -n vm1 --size 40G iscsi_vg
> > then virt-install or whatever
> > If I were using 1 lun per vm then I would have to provision the lun,
> > make
> > ALL hosts aware of the lun, and finally screw with the multipath
> > configs
> > etc.
> 
> Don't you have basically the same problem when using LVM in one LUN?
> You still
> have to make all the hosts aware of the new LV manually. I don't even
> know LVM
> even supports this, it wasn't exactly designed for a situation where
> multiple
> hosts might simultaneously read and write to a volume group, let alone
> create
> and destroy logical volumes while the VG is in use by any number of
> other
> hosts...
> 
> Guido
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2 v3] kvm/x86: move and fix substitue search for missing CPUID entries

2011-03-31 Thread Andre Przywara
If KVM cannot find an exact match for a requested CPUID leaf, the
code will try to find the closest match instead of simply confessing
it's failure.
The implementation was meant to satisfy the CPUID specification, but
did not properly check for extended and standard leaves and also
didn't account for the index subleaf.
Beside that this rule only applies to CPUID intercepts, which is not
the only user of the kvm_find_cpuid_entry() function.

So fix this algorithm and call it from kvm_emulate_cpuid().
This fixes a crash of newer Linux kernels as KVM guests on
AMD Bulldozer CPUs, where bogus values were returned in response to
a CPUID intercept.

CC:  [2.6.38]
Signed-off-by: Andre Przywara 
---
 arch/x86/kvm/x86.c |   31 +--
 1 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e86cec..a38fb9b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4959,12 +4959,6 @@ struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct 
kvm_vcpu *vcpu,
best = e;
break;
}
-   /*
-* Both basic or both extended?
-*/
-   if (((e->function ^ function) & 0x8000) == 0)
-   if (!best || e->function > best->function)
-   best = e;
}
return best;
 }
@@ -4984,6 +4978,27 @@ not_found:
return 36;
 }
 
+/*
+ * If no match is found, check whether we exceed the vCPU's limit
+ * and return the content of the highest valid _standard_ leaf instead.
+ * This is to satisfy the CPUID specification.
+ */
+static struct kvm_cpuid_entry2* check_cpuid_limit(struct kvm_vcpu *vcpu,
+  u32 function, u32 index)
+{
+   struct kvm_cpuid_entry2 *maxlevel;
+
+   maxlevel = kvm_find_cpuid_entry(vcpu, function & 0x8000, 0);
+   if (!maxlevel || maxlevel->eax >= function)
+   return NULL;
+   if (function & 0x8000) {
+   maxlevel = kvm_find_cpuid_entry(vcpu, 0, 0);
+   if (!maxlevel)
+   return NULL;
+   }
+   return kvm_find_cpuid_entry(vcpu, maxlevel->eax, index);
+}
+
 void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
 {
u32 function, index;
@@ -4996,6 +5011,10 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
kvm_register_write(vcpu, VCPU_REGS_RCX, 0);
kvm_register_write(vcpu, VCPU_REGS_RDX, 0);
best = kvm_find_cpuid_entry(vcpu, function, index);
+
+   if (!best)
+   best = check_cpuid_limit(vcpu, function, index);
+
if (best) {
kvm_register_write(vcpu, VCPU_REGS_RAX, best->eax);
kvm_register_write(vcpu, VCPU_REGS_RBX, best->ebx);
-- 
1.6.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm/x86: move and fix substitue search for missing CPUID entries

2011-03-31 Thread Andre Przywara

Avi Kivity wrote:

On 03/31/2011 03:13 PM, Andre Przywara wrote:

If KVM cannot find an exact match for a requested CPUID leaf, the
code will try to find the closest match instead of simply confessing
it's failure.
The implementation was meant to satisfy the CPUID specification, but
did not properly check for extended and standard leaves and also
didn't account for the index subleaf.
Beside that this rule only applies to CPUID intercepts, which is not
the only user of the kvm_find_cpuid_entry() function.

So fix this algorithm and move it into kvm_emulate_cpuid().
This fixes a crash of newer Linux kernels as KVM guests on
AMD Bulldozer CPUs, where bogus values were returned in response to
a CPUID intercept.

@@ -4996,6 +4990,19 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
kvm_register_write(vcpu, VCPU_REGS_RCX, 0);
kvm_register_write(vcpu, VCPU_REGS_RDX, 0);
best = kvm_find_cpuid_entry(vcpu, function, index);
+
+   /* if no match is found, check whether we exceed the vCPU's limit
+* and return the content of the highest valid standard leaf instead.
+* This is to satisfy the CPUID specification.
+*/
+   if (!best) {
+   best = kvm_find_cpuid_entry(vcpu, function&  0x8000, 0);


"highest valid standard leaf" means the second argument should be zero, no?


Weird, but somehow true. I fixed this is in a another version (following).

Thanks for spotting this.

Andre.




+   if (best&&  best->eax<  function)
+   best = kvm_find_cpuid_entry(vcpu, best->eax, index);
+   else
+   best = NULL;
+   }
+
if (best) {
kvm_register_write(vcpu, VCPU_REGS_RAX, best->eax);
kvm_register_write(vcpu, VCPU_REGS_RBX, best->ebx);





--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM, iSCSI and High Availability

2011-03-31 Thread Guido Winkelmann
Am Monday 28 March 2011 schrieb David Martin:
> - Original Message -
> 
> > On 3/28/11 2:46 PM, Avi Kivity wrote:
> > > On 03/25/2011 10:26 PM, Marcin M. Jessa wrote:
> > [...]
> > 
> > > One LUN per image allows you to implement failover, LVM doesn't (but
> > > cluster-LVM does). I recommend using one LUN per image; it's much
> > > simpler.
> > 
> > Some people say "Use one LUN, it's easier and use CLVM". Why is it
> > easier to use CLVM and one LUN per virtual guest?
> 
> I find it easier because i can do:
> lvcreate -n vm1 --size 40G iscsi_vg
> then virt-install or whatever
> If I were using 1 lun per vm then I would have to provision the lun, make
> ALL hosts aware of the lun, and finally screw with the multipath configs
> etc.

Don't you have basically the same problem when using LVM in one LUN? You still 
have to make all the hosts aware of the new LV manually. I don't even know LVM 
even supports this, it wasn't exactly designed for a situation where multiple 
hosts might simultaneously read and write to a volume group, let alone create 
and destroy logical volumes while the VG is in use by any number of other 
hosts...

Guido 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm/x86: move and fix substitue search for missing CPUID entries

2011-03-31 Thread Avi Kivity

On 03/31/2011 03:13 PM, Andre Przywara wrote:

If KVM cannot find an exact match for a requested CPUID leaf, the
code will try to find the closest match instead of simply confessing
it's failure.
The implementation was meant to satisfy the CPUID specification, but
did not properly check for extended and standard leaves and also
didn't account for the index subleaf.
Beside that this rule only applies to CPUID intercepts, which is not
the only user of the kvm_find_cpuid_entry() function.

So fix this algorithm and move it into kvm_emulate_cpuid().
This fixes a crash of newer Linux kernels as KVM guests on
AMD Bulldozer CPUs, where bogus values were returned in response to
a CPUID intercept.

@@ -4996,6 +4990,19 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
kvm_register_write(vcpu, VCPU_REGS_RCX, 0);
kvm_register_write(vcpu, VCPU_REGS_RDX, 0);
best = kvm_find_cpuid_entry(vcpu, function, index);
+
+   /* if no match is found, check whether we exceed the vCPU's limit
+* and return the content of the highest valid standard leaf instead.
+* This is to satisfy the CPUID specification.
+*/
+   if (!best) {
+   best = kvm_find_cpuid_entry(vcpu, function&  0x8000, 0);


"highest valid standard leaf" means the second argument should be zero, no?


+   if (best&&  best->eax<  function)
+   best = kvm_find_cpuid_entry(vcpu, best->eax, index);
+   else
+   best = NULL;
+   }
+
if (best) {
kvm_register_write(vcpu, VCPU_REGS_RAX, best->eax);
kvm_register_write(vcpu, VCPU_REGS_RBX, best->ebx);



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] kvm/x86: move and fix substitue search for missing CPUID entries

2011-03-31 Thread Andre Przywara
If KVM cannot find an exact match for a requested CPUID leaf, the
code will try to find the closest match instead of simply confessing
it's failure.
The implementation was meant to satisfy the CPUID specification, but
did not properly check for extended and standard leaves and also
didn't account for the index subleaf.
Beside that this rule only applies to CPUID intercepts, which is not
the only user of the kvm_find_cpuid_entry() function.

So fix this algorithm and move it into kvm_emulate_cpuid().
This fixes a crash of newer Linux kernels as KVM guests on
AMD Bulldozer CPUs, where bogus values were returned in response to
a CPUID intercept.

CC:  [2.6.38]
Signed-off-by: Andre Przywara 
---
 arch/x86/kvm/x86.c |   19 +--
 1 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6e86cec..552b8f8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4959,12 +4959,6 @@ struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct 
kvm_vcpu *vcpu,
best = e;
break;
}
-   /*
-* Both basic or both extended?
-*/
-   if (((e->function ^ function) & 0x8000) == 0)
-   if (!best || e->function > best->function)
-   best = e;
}
return best;
 }
@@ -4996,6 +4990,19 @@ void kvm_emulate_cpuid(struct kvm_vcpu *vcpu)
kvm_register_write(vcpu, VCPU_REGS_RCX, 0);
kvm_register_write(vcpu, VCPU_REGS_RDX, 0);
best = kvm_find_cpuid_entry(vcpu, function, index);
+
+   /* if no match is found, check whether we exceed the vCPU's limit
+* and return the content of the highest valid standard leaf instead.
+* This is to satisfy the CPUID specification.
+*/
+   if (!best) {
+   best = kvm_find_cpuid_entry(vcpu, function & 0x8000, 0);
+   if (best && best->eax < function)
+   best = kvm_find_cpuid_entry(vcpu, best->eax, index);
+   else
+   best = NULL;
+   }
+
if (best) {
kvm_register_write(vcpu, VCPU_REGS_RAX, best->eax);
kvm_register_write(vcpu, VCPU_REGS_RBX, best->ebx);
-- 
1.6.4


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm/x86: remove unneeded substitute search for missing CPUID entries

2011-03-31 Thread Avi Kivity

On 03/31/2011 12:12 PM, Andre Przywara wrote:

Avi Kivity wrote:

On 03/30/2011 03:01 PM, Andre Przywara wrote:

If KVM cannot find an exact match for a requested CPUID leaf, the
code will try to find the closest match instead of simply confessing
it's failure. The heuristic is on one hand wrong nowadays,
since it does not take the KVM CPUID leaves (0x40xx) into
account. On the other hand the callers of this function can all deal
with the no-match situation. So lets remove this code, as it serves
no purpose.
This fixes a crash of newer Linux kernels as KVM guests on
AMD Bulldozer CPUs, where bogus values were returned in response to
a CPUID intercept.


@@ -4959,12 +4959,6 @@ struct kvm_cpuid_entry2 
*kvm_find_cpuid_entry(struct kvm_vcpu *vcpu,

  best = e;
  break;
  }
-/*
- * Both basic or both extended?
- */
-if (((e->function ^ function)&  0x8000) == 0)
-if (!best || e->function>  best->function)
-best = e;
  }
  return best;
  }



This behaviour is mandated by the spec (looking at the Intel one), 
though it is implemented incorrectly - should always return largest 
basic leaf, and ignore the kvm leaves.


But the spec says that this applies only if EAX is higher than the 
largest supported leaf. The code as is checks whether KVM has an entry 
in the cpuid "cache" for it, which is not the same. Especially this 
case that hit me was a missing index entry, which should return 0.


Ah, I see.

The check for too large leaf numbers should be moved into 
kvm_emulate_cpuid(). There is already some code in QEMU 
(cpu_x86_cpuid) to handle this, but that path does not apply to KVM.


I will make a new version of this patch which replaces the old check 
with a sane version in kvm_emulate_cpuid().


Thanks for pointing this out.



I think the correct behaviour is:

if (e->function < 1 && (!best || e->function > best->function))
 best = e;

We probably need a find_exact_cpuid_entry() that returns NULL if it 
doesn't find a match, for internal use.


As mentioned, this behavior only applies to the actual intercept case, 
not to all users of kvm_find_cpuid_entry(). So I'd like to make this 
check in the intercept code path and not in this function.




Right.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/13] KVM: SVM: Add checks for IO instructions

2011-03-31 Thread Roedel, Joerg
On Thu, Mar 31, 2011 at 06:03:37AM -0400, Avi Kivity wrote:
> On 03/31/2011 11:42 AM, Roedel, Joerg wrote:
> > On Thu, Mar 31, 2011 at 05:18:28AM -0400, Avi Kivity wrote:
> > >  On 03/31/2011 09:14 AM, Roedel, Joerg wrote:
> > >  >  On Mon, Mar 28, 2011 at 08:28:12AM -0400, Avi Kivity wrote:
> > >  >  >   The spec indicates we need to check the TSS and IOPL based 
> > > permissions
> > >  >  >   before the intercept (vmx agrees).  With the code as is, it 
> > > happens
> > >  >  >   afterwards.
> > >  >  >
> > >  >  >   One way to do this is to have an ExtraChecks bit in the 
> > > opcode::flags.
> > >  >  >   Then opcode::u.xcheck->perms() is the pre-intercept check and
> > >  >  >   opcode::u.xcheck->execute() is the post-intercept execution.  
> > > Should
> > >  >  >   work for monitor/mwait/rdtsc(p)/rdpmc/other crap x86 throws at us.
> > >  >
> > >  >  Okay, as you suggested, I put these checks into the instruction 
> > > emulator
> > >  >  and let the hard work of implementing per-arch checks to the 
> > > nested-vmx
> > >  >  people ;)
> > >  >  I doubt that this makes the opcode-tables more readable, but lets see 
> > > :)
> > >
> > >  I think we're miscommunicating.  I'm talking about x86 checks, not virt
> > >  vendor specific checks.
> >
> > The place of the intercept check may be vendor specific. I havn't looked
> > at the Intel spec, though. But there are probably differences.
> 
> That's why there are three hooks: pre-ex, post-ex, post-mem.  If an 
> intercept fits in between, use the pre-ex hook and duplicate the checks 
> in the intercept.
> 
> As far as I recall, everything should fit into those three, though.

Okay, thats the way to go then, thanks.

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] kvm/x86: remove unneeded substitute search for missing CPUID entries

2011-03-31 Thread Andre Przywara

Avi Kivity wrote:

On 03/30/2011 03:01 PM, Andre Przywara wrote:

If KVM cannot find an exact match for a requested CPUID leaf, the
code will try to find the closest match instead of simply confessing
it's failure. The heuristic is on one hand wrong nowadays,
since it does not take the KVM CPUID leaves (0x40xx) into
account. On the other hand the callers of this function can all deal
with the no-match situation. So lets remove this code, as it serves
no purpose.
This fixes a crash of newer Linux kernels as KVM guests on
AMD Bulldozer CPUs, where bogus values were returned in response to
a CPUID intercept.


@@ -4959,12 +4959,6 @@ struct kvm_cpuid_entry2 *kvm_find_cpuid_entry(struct 
kvm_vcpu *vcpu,
best = e;
break;
}
-   /*
-* Both basic or both extended?
-*/
-   if (((e->function ^ function)&  0x8000) == 0)
-   if (!best || e->function>  best->function)
-   best = e;
}
return best;
  }



This behaviour is mandated by the spec (looking at the Intel one), 
though it is implemented incorrectly - should always return largest 
basic leaf, and ignore the kvm leaves.


But the spec says that this applies only if EAX is higher than the 
largest supported leaf. The code as is checks whether KVM has an entry 
in the cpuid "cache" for it, which is not the same. Especially this case 
that hit me was a missing index entry, which should return 0.
The check for too large leaf numbers should be moved into 
kvm_emulate_cpuid(). There is already some code in QEMU (cpu_x86_cpuid) 
to handle this, but that path does not apply to KVM.


I will make a new version of this patch which replaces the old check 
with a sane version in kvm_emulate_cpuid().


Thanks for pointing this out.



I think the correct behaviour is:

if (e->function < 1 && (!best || e->function > best->function))
 best = e;

We probably need a find_exact_cpuid_entry() that returns NULL if it 
doesn't find a match, for internal use.


As mentioned, this behavior only applies to the actual intercept case, 
not to all users of kvm_find_cpuid_entry(). So I'd like to make this 
check in the intercept code path and not in this function.


Regards,
Andre.

--
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] KVM: emulator: do not needlesly sync registers from emulator ctxt to vcpu

2011-03-31 Thread Gleb Natapov
Currently we sync registers back and forth before/after exiting
to userspace for IO, but during IO device model shouldn't need to
read/write the registers, so we can as well skip those sync points. The
only exaception is broken vmware backdor interface. The new code sync
registers content during IO only if registers are read from/written to
by userspace in the middle of the IO operation and this almost never
happens in practise.

Signed-off-by: Gleb Natapov 
---
 v1->v2
   - rebased onto see emulation patchset.

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e820c63..f68bacc 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -370,6 +370,8 @@ struct kvm_vcpu_arch {
/* emulate context */
 
struct x86_emulate_ctxt emulate_ctxt;
+   bool emulate_regs_need_sync_to_vcpu;
+   bool emulate_regs_need_sync_from_vcpu;
 
gpa_t time;
struct pvclock_vcpu_time_info hv_clock;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9686547..eecad1b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4378,6 +4378,7 @@ static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
? X86EMUL_MODE_PROT32 : X86EMUL_MODE_PROT16;
memset(c, 0, sizeof(struct decode_cache));
memcpy(c->regs, vcpu->arch.regs, sizeof c->regs);
+   vcpu->arch.emulate_regs_need_sync_from_vcpu = false;
 }
 
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq)
@@ -4460,6 +4461,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
 {
int r;
struct decode_cache *c = &vcpu->arch.emulate_ctxt.decode;
+   bool writeback = true;
 
kvm_clear_exception_queue(vcpu);
vcpu->arch.mmio_fault_cr2 = cr2;
@@ -4500,9 +4502,12 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu,
return EMULATE_DONE;
}
 
-   /* this is needed for vmware backdor interface to work since it
+   /* this is needed for vmware backdoor interface to work since it
   changes registers values  during IO operation */
-   memcpy(c->regs, vcpu->arch.regs, sizeof c->regs);
+   if (vcpu->arch.emulate_regs_need_sync_from_vcpu) {
+   vcpu->arch.emulate_regs_need_sync_from_vcpu = false;
+   memcpy(c->regs, vcpu->arch.regs, sizeof c->regs);
+   }
 
 restart:
r = x86_emulate_insn(&vcpu->arch.emulate_ctxt);
@@ -4520,19 +4525,28 @@ restart:
} else if (vcpu->arch.pio.count) {
if (!vcpu->arch.pio.in)
vcpu->arch.pio.count = 0;
+   else
+   writeback = false;  
r = EMULATE_DO_MMIO;
-   } else if (vcpu->mmio_needed)
+   } else if (vcpu->mmio_needed) {
+   if (!vcpu->mmio_is_write)
+   writeback = false;
r = EMULATE_DO_MMIO;
-   else if (r == EMULATION_RESTART)
+   } else if (r == EMULATION_RESTART)
goto restart;
else
r = EMULATE_DONE;
 
-   toggle_interruptibility(vcpu, vcpu->arch.emulate_ctxt.interruptibility);
-   kvm_set_rflags(vcpu, vcpu->arch.emulate_ctxt.eflags);
-   kvm_make_request(KVM_REQ_EVENT, vcpu);
-   memcpy(vcpu->arch.regs, c->regs, sizeof c->regs);
-   kvm_rip_write(vcpu, vcpu->arch.emulate_ctxt.eip);
+   if (writeback) {
+   toggle_interruptibility(vcpu,
+   vcpu->arch.emulate_ctxt.interruptibility);
+   kvm_set_rflags(vcpu, vcpu->arch.emulate_ctxt.eflags);
+   kvm_make_request(KVM_REQ_EVENT, vcpu);
+   memcpy(vcpu->arch.regs, c->regs, sizeof c->regs);
+   vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
+   kvm_rip_write(vcpu, vcpu->arch.emulate_ctxt.eip);
+   } else
+   vcpu->arch.emulate_regs_need_sync_to_vcpu = true;
 
return r;
 }
@@ -5516,6 +5530,18 @@ out:
 
 int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 {
+   if (vcpu->arch.emulate_regs_need_sync_to_vcpu) {
+   /*
+* We are here if userspace calls get_regs() in the middle of
+* instruction emulation. Registers state needs to be copied
+* back from emulation context to vcpu. Usrapace shouldn't do
+* that usually, but some bad designed PV devices (vmware
+* backdoor interface) need this to work
+*/
+   struct decode_cache *c = &vcpu->arch.emulate_ctxt.decode;
+   memcpy(vcpu->arch.regs, c->regs, sizeof c->regs);
+   vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
+   }
regs->rax = kvm_register_read(vcpu, VCPU_REGS_RAX);
regs->rbx = kvm_register_read(vcpu, VCPU_REGS_RBX);
regs->rcx = kvm_register_read(vcpu, VCPU_REGS_RCX);
@@ -5543,6 +5569,9 @@ int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, 
struct 

Re: [PATCH 12/13] KVM: SVM: Add checks for IO instructions

2011-03-31 Thread Avi Kivity

On 03/31/2011 11:42 AM, Roedel, Joerg wrote:

On Thu, Mar 31, 2011 at 05:18:28AM -0400, Avi Kivity wrote:
>  On 03/31/2011 09:14 AM, Roedel, Joerg wrote:
>  >  On Mon, Mar 28, 2011 at 08:28:12AM -0400, Avi Kivity wrote:
>  >  >   The spec indicates we need to check the TSS and IOPL based permissions
>  >  >   before the intercept (vmx agrees).  With the code as is, it happens
>  >  >   afterwards.
>  >  >
>  >  >   One way to do this is to have an ExtraChecks bit in the opcode::flags.
>  >  >   Then opcode::u.xcheck->perms() is the pre-intercept check and
>  >  >   opcode::u.xcheck->execute() is the post-intercept execution.  Should
>  >  >   work for monitor/mwait/rdtsc(p)/rdpmc/other crap x86 throws at us.
>  >
>  >  Okay, as you suggested, I put these checks into the instruction emulator
>  >  and let the hard work of implementing per-arch checks to the nested-vmx
>  >  people ;)
>  >  I doubt that this makes the opcode-tables more readable, but lets see :)
>
>  I think we're miscommunicating.  I'm talking about x86 checks, not virt
>  vendor specific checks.

The place of the intercept check may be vendor specific. I havn't looked
at the Intel spec, though. But there are probably differences.


That's why there are three hooks: pre-ex, post-ex, post-mem.  If an 
intercept fits in between, use the pre-ex hook and duplicate the checks 
in the intercept.


As far as I recall, everything should fit into those three, though.


>  For example, the flow for IOIO would be:
>
> #UD check (lock prefix)
> PE/IOPL/CPL/VM check
> TSS bitmap check (can cause #PF)
> Intercept check
> Operand segment check
> Possible #PF
> Execution
>
>  We need to make sure the TSS bitmap check happens before the intercept,
>  so we need to split ->execute() into two.

Right. For the generic case, how about factor out the checks (for the
POST_EX intercept case) into a seperate excp_check-callback (similar to the
execute-callback) and execute it before the post-exception-intercept
check?


That is exactly my suggestion.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: better fix for race between nmi injection and enabling nmi window

2011-03-31 Thread Gleb Natapov
On Thu, Mar 31, 2011 at 11:25:46AM +0200, Avi Kivity wrote:
> On 03/31/2011 11:24 AM, Gleb Natapov wrote:
> >On Thu, Mar 31, 2011 at 11:23:28AM +0200, Avi Kivity wrote:
> >>  On 03/30/2011 08:47 PM, Gleb Natapov wrote:
> >>  >On Wed, Mar 30, 2011 at 07:16:34PM +0200, Avi Kivity wrote:
> >>  >>   On 03/30/2011 06:30 PM, Marcelo Tosatti wrote:
> >>  >>   >Based on Gleb's idea, fix race between nmi injection and enabling
> >>  >>   >nmi window in a simpler way.
> >>  >>   >
> >>  >>   >Signed-off-by: Marcelo Tosatti
> >>  >>   >
> >>  >>   >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>  >>   >index a6a129f..9a7cc1be 100644
> >>  >>   >--- a/arch/x86/kvm/x86.c
> >>  >>   >+++ b/arch/x86/kvm/x86.c
> >>  >>   >@@ -5152,6 +5152,7 @@ static void kvm_put_guest_xcr0(struct 
> >> kvm_vcpu *vcpu)
> >>  >>   >static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >>  >>   >{
> >>  >>   >int r;
> >>  >>   >+   int nmi_pending;
> >>  >>   >bool req_int_win = !irqchip_in_kernel(vcpu->kvm)&&
> >>  >>   >vcpu->run->request_interrupt_window;
> >>  >>   >
> >>  >>   >@@ -5195,11 +5196,13 @@ static int vcpu_enter_guest(struct kvm_vcpu 
> >> *vcpu)
> >>  >>   >if (unlikely(r))
> >>  >>   >goto out;
> >>  >>   >
> >>  >>   >+   nmi_pending = ACCESS_ONCE(vcpu->arch.nmi_pending);
> >>  >>   >+
> >>  >>   >if (kvm_check_request(KVM_REQ_EVENT, vcpu) || 
> >> req_int_win) {
> >>  >>   >inject_pending_event(vcpu);
> >>  >>   >
> >>  >>   >/* enable NMI/IRQ window open exits if needed */
> >>  >>   >-   if (vcpu->arch.nmi_pending)
> >>  >>   >+   if (nmi_pending)
> >>  >>   >kvm_x86_ops->enable_nmi_window(vcpu);
> >>  >>   >else if (kvm_cpu_has_interrupt(vcpu) || 
> >> req_int_win)
> >>  >>   >kvm_x86_ops->enable_irq_window(vcpu);
> >>  >>   >
> >>  >>
> >>  >>   What about the check in inject_pending_events()?
> >>  >>
> >>  >Didn't we decide that this check is not a problem? Worst that can happen
> >>  >is NMI injection will be delayed till next exit.
> >>
> >>  Could be very far in the future.
> >>
> >Next host interrupt. But with tickles host and guest yeah.
> >
> 
> esp. important with NMI, which may be used in a situation where your
> tick (and everything else) are dead.
> 
If host is alive eventually cpu should receive reschedule IPI.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/13] KVM: SVM: Add checks for IO instructions

2011-03-31 Thread Roedel, Joerg
On Thu, Mar 31, 2011 at 05:18:28AM -0400, Avi Kivity wrote:
> On 03/31/2011 09:14 AM, Roedel, Joerg wrote:
> > On Mon, Mar 28, 2011 at 08:28:12AM -0400, Avi Kivity wrote:
> > >  The spec indicates we need to check the TSS and IOPL based permissions
> > >  before the intercept (vmx agrees).  With the code as is, it happens
> > >  afterwards.
> > >
> > >  One way to do this is to have an ExtraChecks bit in the opcode::flags.
> > >  Then opcode::u.xcheck->perms() is the pre-intercept check and
> > >  opcode::u.xcheck->execute() is the post-intercept execution.  Should
> > >  work for monitor/mwait/rdtsc(p)/rdpmc/other crap x86 throws at us.
> >
> > Okay, as you suggested, I put these checks into the instruction emulator
> > and let the hard work of implementing per-arch checks to the nested-vmx
> > people ;)
> > I doubt that this makes the opcode-tables more readable, but lets see :)
> 
> I think we're miscommunicating.  I'm talking about x86 checks, not virt 
> vendor specific checks.

The place of the intercept check may be vendor specific. I havn't looked
at the Intel spec, though. But there are probably differences.

> For example, the flow for IOIO would be:
> 
>#UD check (lock prefix)
>PE/IOPL/CPL/VM check
>TSS bitmap check (can cause #PF)
>Intercept check
>Operand segment check
>Possible #PF
>Execution
> 
> We need to make sure the TSS bitmap check happens before the intercept, 
> so we need to split ->execute() into two.

Right. For the generic case, how about factor out the checks (for the
POST_EX intercept case) into a seperate excp_check-callback (similar to the
execute-callback) and execute it before the post-exception-intercept
check?

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: better fix for race between nmi injection and enabling nmi window

2011-03-31 Thread Marcelo Tosatti
On Wed, Mar 30, 2011 at 08:47:03PM +0200, Gleb Natapov wrote:
> On Wed, Mar 30, 2011 at 07:16:34PM +0200, Avi Kivity wrote:
> > On 03/30/2011 06:30 PM, Marcelo Tosatti wrote:
> > >Based on Gleb's idea, fix race between nmi injection and enabling
> > >nmi window in a simpler way.
> > >
> > >Signed-off-by: Marcelo Tosatti
> > >
> > >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > >index a6a129f..9a7cc1be 100644
> > >--- a/arch/x86/kvm/x86.c
> > >+++ b/arch/x86/kvm/x86.c
> > >@@ -5152,6 +5152,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
> > >  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >  {
> > >   int r;
> > >+  int nmi_pending;
> > >   bool req_int_win = !irqchip_in_kernel(vcpu->kvm)&&
> > >   vcpu->run->request_interrupt_window;
> > >
> > >@@ -5195,11 +5196,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> > >   if (unlikely(r))
> > >   goto out;
> > >
> > >+  nmi_pending = ACCESS_ONCE(vcpu->arch.nmi_pending);
> > >+
> > >   if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> > >   inject_pending_event(vcpu);
> > >
> > >   /* enable NMI/IRQ window open exits if needed */
> > >-  if (vcpu->arch.nmi_pending)
> > >+  if (nmi_pending)
> > >   kvm_x86_ops->enable_nmi_window(vcpu);
> > >   else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
> > >   kvm_x86_ops->enable_irq_window(vcpu);
> > >
> > 
> > What about the check in inject_pending_events()?
> > 
> Didn't we decide that this check is not a problem? Worst that can happen
> is NMI injection will be delayed till next exit.

Yes.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: better fix for race between nmi injection and enabling nmi window

2011-03-31 Thread Marcelo Tosatti
On Thu, Mar 31, 2011 at 11:25:46AM +0200, Avi Kivity wrote:

> >>  >>   >else if (kvm_cpu_has_interrupt(vcpu) || 
> >> req_int_win)
> >>  >>   >kvm_x86_ops->enable_irq_window(vcpu);
> >>  >>   >
> >>  >>
> >>  >>   What about the check in inject_pending_events()?
> >>  >>
> >>  >Didn't we decide that this check is not a problem? Worst that can happen
> >>  >is NMI injection will be delayed till next exit.
> >>
> >>  Could be very far in the future.
> >>
> >Next host interrupt. But with tickles host and guest yeah.
> >
> 
> esp. important with NMI, which may be used in a situation where your
> tick (and everything else) are dead.

Well the host must be alive. So eventually NMI will be delivered to the
guest, but not closely matching guest visible clocks.

What is the problem with that? The race should be rare enough to not
interfere with NMI watchdog-like things?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: better fix for race between nmi injection and enabling nmi window

2011-03-31 Thread Avi Kivity

On 03/31/2011 11:24 AM, Gleb Natapov wrote:

On Thu, Mar 31, 2011 at 11:23:28AM +0200, Avi Kivity wrote:
>  On 03/30/2011 08:47 PM, Gleb Natapov wrote:
>  >On Wed, Mar 30, 2011 at 07:16:34PM +0200, Avi Kivity wrote:
>  >>   On 03/30/2011 06:30 PM, Marcelo Tosatti wrote:
>  >>   >Based on Gleb's idea, fix race between nmi injection and enabling
>  >>   >nmi window in a simpler way.
>  >>   >
>  >>   >Signed-off-by: Marcelo Tosatti
>  >>   >
>  >>   >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>  >>   >index a6a129f..9a7cc1be 100644
>  >>   >--- a/arch/x86/kvm/x86.c
>  >>   >+++ b/arch/x86/kvm/x86.c
>  >>   >@@ -5152,6 +5152,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu 
*vcpu)
>  >>   >static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  >>   >{
>  >>   >   int r;
>  >>   >+  int nmi_pending;
>  >>   >   bool req_int_win = !irqchip_in_kernel(vcpu->kvm)&&
>  >>   >   vcpu->run->request_interrupt_window;
>  >>   >
>  >>   >@@ -5195,11 +5196,13 @@ static int vcpu_enter_guest(struct kvm_vcpu 
*vcpu)
>  >>   >   if (unlikely(r))
>  >>   >   goto out;
>  >>   >
>  >>   >+  nmi_pending = ACCESS_ONCE(vcpu->arch.nmi_pending);
>  >>   >+
>  >>   >   if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>  >>   >   inject_pending_event(vcpu);
>  >>   >
>  >>   >   /* enable NMI/IRQ window open exits if needed */
>  >>   >-  if (vcpu->arch.nmi_pending)
>  >>   >+  if (nmi_pending)
>  >>   >   kvm_x86_ops->enable_nmi_window(vcpu);
>  >>   >   else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
>  >>   >   kvm_x86_ops->enable_irq_window(vcpu);
>  >>   >
>  >>
>  >>   What about the check in inject_pending_events()?
>  >>
>  >Didn't we decide that this check is not a problem? Worst that can happen
>  >is NMI injection will be delayed till next exit.
>
>  Could be very far in the future.
>
Next host interrupt. But with tickles host and guest yeah.



esp. important with NMI, which may be used in a situation where your 
tick (and everything else) are dead.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: better fix for race between nmi injection and enabling nmi window

2011-03-31 Thread Gleb Natapov
On Thu, Mar 31, 2011 at 11:23:28AM +0200, Avi Kivity wrote:
> On 03/30/2011 08:47 PM, Gleb Natapov wrote:
> >On Wed, Mar 30, 2011 at 07:16:34PM +0200, Avi Kivity wrote:
> >>  On 03/30/2011 06:30 PM, Marcelo Tosatti wrote:
> >>  >Based on Gleb's idea, fix race between nmi injection and enabling
> >>  >nmi window in a simpler way.
> >>  >
> >>  >Signed-off-by: Marcelo Tosatti
> >>  >
> >>  >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>  >index a6a129f..9a7cc1be 100644
> >>  >--- a/arch/x86/kvm/x86.c
> >>  >+++ b/arch/x86/kvm/x86.c
> >>  >@@ -5152,6 +5152,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu 
> >> *vcpu)
> >>  >   static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >>  >   {
> >>  > int r;
> >>  >+int nmi_pending;
> >>  > bool req_int_win = !irqchip_in_kernel(vcpu->kvm)&&
> >>  > vcpu->run->request_interrupt_window;
> >>  >
> >>  >@@ -5195,11 +5196,13 @@ static int vcpu_enter_guest(struct kvm_vcpu 
> >> *vcpu)
> >>  > if (unlikely(r))
> >>  > goto out;
> >>  >
> >>  >+nmi_pending = ACCESS_ONCE(vcpu->arch.nmi_pending);
> >>  >+
> >>  > if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
> >>  > inject_pending_event(vcpu);
> >>  >
> >>  > /* enable NMI/IRQ window open exits if needed */
> >>  >-if (vcpu->arch.nmi_pending)
> >>  >+if (nmi_pending)
> >>  > kvm_x86_ops->enable_nmi_window(vcpu);
> >>  > else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
> >>  > kvm_x86_ops->enable_irq_window(vcpu);
> >>  >
> >>
> >>  What about the check in inject_pending_events()?
> >>
> >Didn't we decide that this check is not a problem? Worst that can happen
> >is NMI injection will be delayed till next exit.
> 
> Could be very far in the future.
> 
Next host interrupt. But with tickles host and guest yeah.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: x86: better fix for race between nmi injection and enabling nmi window

2011-03-31 Thread Avi Kivity

On 03/30/2011 08:47 PM, Gleb Natapov wrote:

On Wed, Mar 30, 2011 at 07:16:34PM +0200, Avi Kivity wrote:
>  On 03/30/2011 06:30 PM, Marcelo Tosatti wrote:
>  >Based on Gleb's idea, fix race between nmi injection and enabling
>  >nmi window in a simpler way.
>  >
>  >Signed-off-by: Marcelo Tosatti
>  >
>  >diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>  >index a6a129f..9a7cc1be 100644
>  >--- a/arch/x86/kvm/x86.c
>  >+++ b/arch/x86/kvm/x86.c
>  >@@ -5152,6 +5152,7 @@ static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
>  >   static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  >   {
>  >  int r;
>  >+ int nmi_pending;
>  >  bool req_int_win = !irqchip_in_kernel(vcpu->kvm)&&
>  >  vcpu->run->request_interrupt_window;
>  >
>  >@@ -5195,11 +5196,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  >  if (unlikely(r))
>  >  goto out;
>  >
>  >+ nmi_pending = ACCESS_ONCE(vcpu->arch.nmi_pending);
>  >+
>  >  if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
>  >  inject_pending_event(vcpu);
>  >
>  >  /* enable NMI/IRQ window open exits if needed */
>  >- if (vcpu->arch.nmi_pending)
>  >+ if (nmi_pending)
>  >  kvm_x86_ops->enable_nmi_window(vcpu);
>  >  else if (kvm_cpu_has_interrupt(vcpu) || req_int_win)
>  >  kvm_x86_ops->enable_irq_window(vcpu);
>  >
>
>  What about the check in inject_pending_events()?
>
Didn't we decide that this check is not a problem? Worst that can happen
is NMI injection will be delayed till next exit.


Could be very far in the future.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]arch:x86:kvm:i8254.h Fix typo in kvm_pit

2011-03-31 Thread Avi Kivity

On 03/30/2011 07:42 PM, Justin P. Mattock wrote:

On 03/30/2011 10:17 AM, Avi Kivity wrote:

On 03/30/2011 06:30 PM, Justin P. Mattock wrote:

On 03/30/2011 09:26 AM, Avi Kivity wrote:

On 03/30/2011 06:19 PM, Justin P. Mattock wrote:

The below patch changes base_addresss to base_address.
Note: I have grepped for base_addresss and nothing shows up,
grepping for base_address gets me lots of output, telling me that
this is a typo, but could be wrong.

Signed-off-by: Justin P. Mattock

---
arch/x86/kvm/i8254.h | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/i8254.h b/arch/x86/kvm/i8254.h
index 46d08ca..c2fa48b 100644
--- a/arch/x86/kvm/i8254.h
+++ b/arch/x86/kvm/i8254.h
@@ -33,7 +33,7 @@ struct kvm_kpit_state {
};

struct kvm_pit {
- unsigned long base_addresss;
+ unsigned long base_address;
struct kvm_io_device dev;
struct kvm_io_device speaker_dev;
struct kvm *kvm;


Why not remove the variable completely?



didnt even think to completely remove the variable(figured it was used
somewhere).I will look at that and resend with removal of the variable
for you..


Well if it was used, you ought to have changed all of the users, no?



at the moment I see:
(keep in mind my reading skills only go so far!)

 grep -Re base_address kvm/* -n
kvm/ioapic.c:276:   return ((addr >= ioapic->base_address &&
kvm/ioapic.c:277:(addr < ioapic->base_address + 
IOAPIC_MEM_LENGTH)));
kvm/ioapic.c:371:   ioapic->base_address = 
IOAPIC_DEFAULT_BASE_ADDRESS;

kvm/ioapic.h:38:u64 base_address;

so changing base_addresss; to base_address; gets kvm_ioapic_reset to 
function correctly as well as ioapic_in_range?

(but could be wrong)



Can you explain how kvm_ioapic_reset() would be affected by the change?

Really, you need to understand what you're doing before sending patches.

--
error compiling committee.c: too many arguments to function


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/13] KVM: SVM: Add checks for IO instructions

2011-03-31 Thread Avi Kivity

On 03/31/2011 09:14 AM, Roedel, Joerg wrote:

On Mon, Mar 28, 2011 at 08:28:12AM -0400, Avi Kivity wrote:
>  The spec indicates we need to check the TSS and IOPL based permissions
>  before the intercept (vmx agrees).  With the code as is, it happens
>  afterwards.
>
>  One way to do this is to have an ExtraChecks bit in the opcode::flags.
>  Then opcode::u.xcheck->perms() is the pre-intercept check and
>  opcode::u.xcheck->execute() is the post-intercept execution.  Should
>  work for monitor/mwait/rdtsc(p)/rdpmc/other crap x86 throws at us.

Okay, as you suggested, I put these checks into the instruction emulator
and let the hard work of implementing per-arch checks to the nested-vmx
people ;)
I doubt that this makes the opcode-tables more readable, but lets see :)


I think we're miscommunicating.  I'm talking about x86 checks, not virt 
vendor specific checks.


For example, the flow for IOIO would be:

  #UD check (lock prefix)
  PE/IOPL/CPL/VM check
  TSS bitmap check (can cause #PF)
  Intercept check
  Operand segment check
  Possible #PF
  Execution

We need to make sure the TSS bitmap check happens before the intercept, 
so we need to split ->execute() into two.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] Unmapped page cache control (v5)

2011-03-31 Thread Balbir Singh
* KOSAKI Motohiro  [2011-03-31 14:40:33]:

> > 
> > The following series implements page cache control,
> > this is a split out version of patch 1 of version 3 of the
> > page cache optimization patches posted earlier at
> > Previous posting http://lwn.net/Articles/425851/ and analysis
> > at http://lwn.net/Articles/419713/
> > 
> > Detailed Description
> > 
> > This patch implements unmapped page cache control via preferred
> > page cache reclaim. The current patch hooks into kswapd and reclaims
> > page cache if the user has requested for unmapped page control.
> > This is useful in the following scenario
> > - In a virtualized environment with cache=writethrough, we see
> >   double caching - (one in the host and one in the guest). As
> >   we try to scale guests, cache usage across the system grows.
> >   The goal of this patch is to reclaim page cache when Linux is running
> >   as a guest and get the host to hold the page cache and manage it.
> >   There might be temporary duplication, but in the long run, memory
> >   in the guests would be used for mapped pages.
> >
> > - The option is controlled via a boot option and the administrator
> >   can selectively turn it on, on a need to use basis.
> > 
> > A lot of the code is borrowed from zone_reclaim_mode logic for
> > __zone_reclaim(). One might argue that the with ballooning and
> > KSM this feature is not very useful, but even with ballooning,
> > we need extra logic to balloon multiple VM machines and it is hard
> > to figure out the correct amount of memory to balloon. With these
> > patches applied, each guest has a sufficient amount of free memory
> > available, that can be easily seen and reclaimed by the balloon driver.
> > The additional memory in the guest can be reused for additional
> > applications or used to start additional guests/balance memory in
> > the host.
> 
> If anyone think this series works, They are just crazy. This patch reintroduce
> two old issues.
> 
> 1) zone reclaim doesn't work if the system has multiple node and the
>workload is file cache oriented (eg file server, web server, mail server, 
> et al). 
>because zone recliam make some much free pages than zone->pages_min and
>then new page cache request consume nearest node memory and then it
>bring next zone reclaim. Then, memory utilization is reduced and
>unnecessary LRU discard is increased dramatically.
> 
>SGI folks added CPUSET specific solution in past. 
> (cpuset.memory_spread_page)
>But global recliam still have its issue. zone recliam is HPC workload 
> specific 
>feature and HPC folks has no motivation to don't use CPUSET.
>

I am afraid you misread the patches and the intent. The intent to
explictly enable control of unmapped pages and has nothing
specifically to do with multiple nodes at this point. The control is
system wide and carefully enabled by the administrator.
 
> 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to
>decide to filter out mapped pages. It made a lot of problems for DB servers
>and large application servers. Because, if the system has a lot of mapped
>pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2)
>reclaim latency become terribly slow and hangup detectors misdetect its
>state and start to force reboot. That was big problem of RHEL5 based 
> banking
>system.
>So, sc->may_unmap should be killed in future. Don't increase uses.
> 

Can you remove sc->may_unmap without removing zone_reclaim()? The LRU
churn can be addressed at the time of isolation, I'll send out an
incremental patch for that.

> 
> But, I agree that now we have to concern slightly large VM change parhaps
> (or parhaps not). Ok, it's good opportunity to fill out some thing.
> Historically, Linux MM has "free memory are waste memory" policy, and It
> worked completely fine. But now we have a few exceptions.
> 
> 1) RT, embedded and finance systems. They really hope to avoid reclaim
>latency (ie avoid foreground reclaim completely) and they can accept 
>to make slightly much free pages before memory shortage.
> 
> 2) VM guest
>VM host and VM guest naturally makes two level page cache model. and
>Linux page cache + two level don't work fine. It has two issues
>1) hard to visualize real memory consumption. That makes harder to 
>   works baloon fine. And google want to visualize memory utilization
>   to pack in more jobs.
>2) hard to make in kernel memory utilization improvement mechanism.
> 
> 
> And, now we have four proposal of utilization related issues.
> 
> 1) cleancache (from Oracle)

Cleancache requires both hypervisor and guest support. With these
patches, Linux can run well under hypverisor if we know the hypversior
does a lot of the IO and maintains the cache.

> 2) VirtFS (from IBM)
> 3) kstaled (from Google)
> 4) unmapped page reclaim (from you)
> 
> Probably, we can't merge all of them and we need t

Re: [PATCH 12/13] KVM: SVM: Add checks for IO instructions

2011-03-31 Thread Roedel, Joerg
On Mon, Mar 28, 2011 at 08:28:12AM -0400, Avi Kivity wrote:
> The spec indicates we need to check the TSS and IOPL based permissions 
> before the intercept (vmx agrees).  With the code as is, it happens 
> afterwards.
> 
> One way to do this is to have an ExtraChecks bit in the opcode::flags.  
> Then opcode::u.xcheck->perms() is the pre-intercept check and 
> opcode::u.xcheck->execute() is the post-intercept execution.  Should 
> work for monitor/mwait/rdtsc(p)/rdpmc/other crap x86 throws at us.

Okay, as you suggested, I put these checks into the instruction emulator
and let the hard work of implementing per-arch checks to the nested-vmx
people ;)
I doubt that this makes the opcode-tables more readable, but lets see :)

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html