[ kvm-Bugs-1884444 ] Virtual network (set up via TCP socket) dog slow, but vde ok

2010-06-14 Thread SourceForge.net
Bugs item #188, was opened at 2008-02-01 15:30
Message generated for change (Comment added) made by jessorensen
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=188group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: qemu
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Alain Knaff (alain_knaff)
Assigned to: Anthony Liguori (aliguori)
Summary: Virtual network (set up via TCP socket) dog slow, but vde ok

Initial Comment:
I've got two kvm virtual machines that I connect to gether via a virtual 
network:

1. A samba server (Kubuntu 7.10), whose network I set up with -net 
nic,vlan=1,macaddr=00:1B:FC:1C:00:01 -net socket,vlan=1,listen=127.0.0.1:1234

2. A Windows XP client, whose network I set up with -net 
nic,vlan=1,macaddr=00:1B:FC:1C:01:11 -net socket,vlan=1,connect=127.0.0.1:1234

The Windows XP VM has joined the Samba VM's domain.
When a domain user logs in, it takes litterally _minutes_ until its profile 
data is copied to the client, and same thing to copy it back to the server on 
logout.

Setting up both with VDE, it's *much* quicker (only a couple of seconds, rather 
than minutes for login/logout).

I also tried udp mcast networking, but couldn't get both VMs to communicate 
with each other. A tcpdump on the host shows traffic on port 1234. However, a 
tcpdump on the Kubuntu guest shows no traffic (not even arp) when I try to ping 
it from the windows VM


--

Comment By: Jes Sorensen (jessorensen)
Date: 2010-06-14 08:21

Message:
Hi,

Per Anthony's add, any chance you could let us know if this problem was
resolved?

I am trying to go through the old bugs and if this one is fixed, I'd like
to close it.

Thanks,
Jes


--

Comment By: Anthony Liguori (aliguori)
Date: 2008-05-26 19:16

Message:
Logged In: YES 
user_id=120449
Originator: NO

I suspect this is related to whether SIGIO was enabled on the file
descriptors.  With the recent IO thread refactorings, this should be fixed.
 Please test if you can or I will queue it up to test myself.

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=188group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Autotest] [KVM-AUTOTEST PATCH 13/14] KVM test: kvm_monitor.py: add QMP interface

2010-06-14 Thread Michael Goldish
On 06/14/2010 05:39 AM, Amos Kong wrote:
 On Sun, Jun 13, 2010 at 05:33:44PM +0300, Michael Goldish wrote:
 An initial QMP client implementation.
 Should be fully functional and supports asynchronous events.
 However, most tests must be modified to support it, because it returns output
 in a different format from the human monitor (the human monitor returns 
 strings
 and the QMP one returns dicts or lists).

 To enable QMP, set main_monitor to a monitor whose monitor_type is qmp.

 For example (a single QMP monitor):

 monitors = monitor1
 monitor_type_monitor1 = qmp
 main_monitor = monitor1

 Another example (multiple monitors, both human and QMP):

 monitors = MyMonitor SomeOtherMonitor YetAnotherMonitor   # defines 3 
 monitors
 monitor_type = human# default for all monitors
 monitor_type_SomeOtherMonitor = qmp # applies only to 
 SomeOtherMonitor
 monitor_type_YetAnotherMonitor = qmp# applies only to 
 YetAnotherMonitor
 main_monitor = SomeOtherMonitor # the main monitor is a QMP one, 
 so
 # the test will use QMP

 Note:
 Monitor methods now raise exceptions such as MonitorLockError and 
 QMPCmdError.
 If this turns out to be a bad idea, it shouldn't be hard to revert to the old
 convention of returning a (status, output) tuple.

 Signed-off-by: Michael Goldish mgold...@redhat.com
 ---
  client/tests/kvm/kvm_monitor.py |  275 
 +++
  client/tests/kvm/kvm_vm.py  |6 +-
  2 files changed, 279 insertions(+), 2 deletions(-)

 diff --git a/client/tests/kvm/kvm_monitor.py 
 b/client/tests/kvm/kvm_monitor.py
 index c5cf9c3..76a1a83 100644
 --- a/client/tests/kvm/kvm_monitor.py
 +++ b/client/tests/kvm/kvm_monitor.py
 @@ -6,6 +6,11 @@ Interfaces to the QEMU monitor.
  
  import socket, time, threading, logging
  import kvm_utils
 +try:
 +import json
 +except ImportError:
 +logging.warning(Could not import json module. 
 +QMP monitor functionality disabled.)
  
  
  class MonitorError(Exception):
 @@ -28,6 +33,10 @@ class MonitorProtocolError(MonitorError):
  pass
  
  
 +class QMPCmdError(MonitorError):
 +pass
 +
 +
  class Monitor:
  
  Common code for monitor classes.
 @@ -114,6 +123,8 @@ class HumanMonitor(Monitor):
  suppress_exceptions is False
  @raise MonitorProtocolError: Raised if the initial (qemu) prompt 
 isn't
  found and suppress_exceptions is False
 +@note: Other exceptions may be raised.  See _get_command_output's
 +docstring.
  
  try:
  Monitor.__init__(self, filename)
 @@ -354,3 +365,267 @@ class HumanMonitor(Monitor):
  @return: The command's output
  
  return self._get_command_output(mouse_button %d % state)
 +
 +
 +class QMPMonitor(Monitor):
 +
 +Wraps QMP monitor commands.
 +
 +
 +def __init__(self, filename, suppress_exceptions=False):
 +
 +Connect to the monitor socket and issue the qmp_capabilities command
 +
 +@param filename: Monitor socket filename
 +@raise MonitorConnectError: Raised if the connection fails and
 +suppress_exceptions is False
 +@note: Other exceptions may be raised if the qmp_capabilities 
 command
 +fails.  See _get_command_output's docstring.
 +
 +try:
 +Monitor.__init__(self, filename)
 +
 +self.protocol = qmp
 +self.events = []
 +
 +# Issue qmp_capabilities
 +self._get_command_output(qmp_capabilities)
 +
 +except MonitorError, e:
 +if suppress_exceptions:
 +logging.warn(e)
 +else:
 +raise
 +
 +
 +# Private methods
 +
 +def _build_cmd(self, cmd, args=None):
 +obj = {execute: cmd}
 +if args:
 +obj[arguments] = args
 +return obj
 +
 +
 +def _read_objects(self, timeout=5):
 +
 +Read lines from monitor and try to decode them.
 +Stop when all available lines have been successfully decoded, or 
 when
 +timeout expires.  If any decoded objects are asynchronous events, 
 store
 +them in self.events.  Return all decoded objects.
 +
 +@param timeout: Time to wait for all lines to decode successfully
 +@return: A list of objects
 +
 +s = 
 +objs = []
 +end_time = time.time() + timeout
 +while time.time()  end_time:
 +s += self._recvall()
 +for line in s.splitlines():
 +if not line:
 +continue
 +try:
 +obj = json.loads(line)
 +except:
 +# Found an incomplete or broken line -- keep reading
 +break
 +objs += [obj]
 +else:
 

Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Markus Armbruster
Alex Williamson alex.william...@redhat.com writes:

 qdev_get_dev_path() is intended to be the canonical utility for creating
 a string representing the qdev hierarchy of a device.  The path consists
 of bus and device names as well as identified properties of the immediate
 parent bus and device.  This results in strings like:

 /main-system-bus/pci.0,addr=00.0/i440FX
 /main-system-bus/pci.0,addr=01.0/PIIX3
 /main-system-bus/pci.0,addr=02.0/cirrus-vga
 /main-system-bus/pci.0/isa.0/mc146818rtc
 /main-system-bus/pci.0/isa.0/isa-serial
 /main-system-bus/pci.0/isa.0/i8042
 /main-system-bus/pci.0/isa.0/isa-fdc
 /main-system-bus/pci.0,addr=03.0/i82551,mac=52:54:00:12:34:56
 /main-system-bus/pci.0,addr=04.0/virtio-net-pci,mac=52:54:00:12:34:57
 /main-system-bus/pci.0,addr=05.0/e1000,mac=52:54:00:12:34:58
 /main-system-bus/pci.0,addr=06.0/rtl8139,mac=52:54:00:12:34:59
 /main-system-bus/pci.0,addr=07.0/pcnet,mac=52:54:00:12:34:5a
 /main-system-bus/pci.0,addr=01.1/piix3-ide
 /main-system-bus/pci.0,addr=01.3/PIIX4_PM
 /main-system-bus/pci.0,addr=08.0/lsi53c895a
 /main-system-bus/pci.0,addr=09.0/virtio-blk-pci

 There are two primary targets for these strings.  The first is vmsave, where
 we currently use a device provided string plus instance number to track
 SaveStateEntries.  This fails when we introduce device hotplug, particularly
 in a case were we have gaps in the instance numbers that cannot easily be
 reproduced on a migration target.  The second is for naming RAMBlocks.  For
 these, we would like a string that matches the vmstate string.

Could you explain why you add identified properties of the immediate
parent bus and device?  They make the result ver much *not* a dev
path in the qdev sense...
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [BSG]: Add support for struct sg_io_v4-d[out,in]_iovec_count

2010-06-14 Thread Nicholas A. Bellinger
On Mon, 2010-06-14 at 15:22 +0900, FUJITA Tomonori wrote:
 On Sun, 13 Jun 2010 23:11:29 -0700
 Nicholas A. Bellinger n...@linux-iscsi.org wrote:
 
  From: Nicholas Bellinger n...@linux-iscsi.org
  
  Greetings Jens and co,
  
  This patch adds the missing support to block/bsg.c:bsg_map_hdr() to accept
  struct sg_io_v4-d[out,in]_iovec_count and process struct 
  sg_io_v4-d[out,in]_xferp memory
  containing userspace iovecs for kernel level BSG.  It adds a new wrapper 
  bsg_rq_map_user_iov()
  that will call copy_from_user() and blk_rq_map_user_iov() following the 
  original SG_IO logic in
  drivers/scsi/sg.c:sg_start_req().
  
  So far this has been tested on a x86_64 v2.6.34 KVM Host with TCM_Loop 
  Virtual SAS Port/LUNs
  into a x86_64 v2.6.26 KVM Guest with Megasas 8707EM2 HBA Emulation + my new 
  scsi-bsg backstore code.
  
  Please consider this for v2.6.36 as it will be required in order for 
  QEMU-KVM MegaSAS and VirtIO HBA
  emulation using QEMU scatterlist memory and BSG backstores.
  
  Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
  ---
   block/bsg.c |   55 +++
   1 files changed, 47 insertions(+), 8 deletions(-)
 
 This have been rejected several times.
 

I guess that would explain why the iovec_count structure members are
part of struct sg_io_v4 and block/bsg.c:bsg_map_hdr() silently accepts
and overwrites them..  :(

 This doesn't work on 32bit user/64bit kernel.

Hmmm, being able to use QEMU-KVM HBA emulation with userspace
scatterlist memory for the various HBA emulation with a matched
user/kernel is still useful though, yes..?  Being able to handle TMR
emulation from QEMU-KVM HBA emulation would also be useful I think.

Having QEMU-KVM fall back to SG_IO for the 32bit/64bit kernel would be
accpetable for me if we can still make BSG + iovecs work for the typical
case.

 
 The compat code doesn't work for the read/write interface (does for
 the ioctl interface though). So we can't support this feature cleanly.

So I am curious to see if there will be a performance improvement
between QEMU-KVM + scsi-bsg w/ AIO IOVECs compared to the legacy SG_IO
ioctl().  Anyways, I will post my QEMU-KVM scsi-bsg patches and run some
benchmarks this week with this patch on 5500 series Nehalem so see aside
from the userspace TMR case how useful having a BSG backstore for
QEMU-KVM may be.

Best,

--nab

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control

2010-06-14 Thread Balbir Singh
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-14 09:28:19]:

 On Mon, 14 Jun 2010 00:01:45 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
 
  * Balbir Singh bal...@linux.vnet.ibm.com [2010-06-08 21:21:46]:
  
   Selectively control Unmapped Page Cache (nospam version)
   
   From: Balbir Singh bal...@linux.vnet.ibm.com
   
   This patch implements unmapped page cache control via preferred
   page cache reclaim. The current patch hooks into kswapd and reclaims
   page cache if the user has requested for unmapped page control.
   This is useful in the following scenario
   
   - In a virtualized environment with cache=writethrough, we see
 double caching - (one in the host and one in the guest). As
 we try to scale guests, cache usage across the system grows.
 The goal of this patch is to reclaim page cache when Linux is running
 as a guest and get the host to hold the page cache and manage it.
 There might be temporary duplication, but in the long run, memory
 in the guests would be used for mapped pages.
   - The option is controlled via a boot option and the administrator
 can selectively turn it on, on a need to use basis.
   
   A lot of the code is borrowed from zone_reclaim_mode logic for
   __zone_reclaim(). One might argue that the with ballooning and
   KSM this feature is not very useful, but even with ballooning,
   we need extra logic to balloon multiple VM machines and it is hard
   to figure out the correct amount of memory to balloon. With these
   patches applied, each guest has a sufficient amount of free memory
   available, that can be easily seen and reclaimed by the balloon driver.
   The additional memory in the guest can be reused for additional
   applications or used to start additional guests/balance memory in
   the host.
   
   KSM currently does not de-duplicate host and guest page cache. The goal
   of this patch is to help automatically balance unmapped page cache when
   instructed to do so.
   
   There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
   and the number of pages to reclaim when unmapped_page_control argument
   is supplied. These numbers were chosen to avoid aggressiveness in
   reaping page cache ever so frequently, at the same time providing control.
   
   The sysctl for min_unmapped_ratio provides further control from
   within the guest on the amount of unmapped pages to reclaim.
  
  
  Are there any major objections to this patch?
   
 
 This kind of patch needs how it works well measurement.
 
 - How did you measure the effect of the patch ? kernbench is not enough, of 
 course.

I can run other benchmarks as well, I will do so

 - Why don't you believe LRU ? And if LRU doesn't work well, should it be
   fixed by a knob rather than generic approach ?
 - No side effects ?

I believe in LRU, just that the problem I am trying to solve is of
using double the memory for caching the same data (consider kvm
running in cache=writethrough or writeback mode, both the hypervisor
and the guest OS maintain a page cache of the same data). As the VM's
grow the overhead is substantial. In my runs I found upto 60%
duplication in some cases.

 
 - Linux vm guys tend to say, free memory is bad memory. ok, for what
   free memory created by your patch is used ? IOW, I can't see the benefit.
   If free memory that your patch created will be used for another page-cache,
   it will be dropped soon by your patch itself.
 

Free memory is good for cases when you want to do more in the same
system. I agree that in a bare metail environment that might be
partially true. I don't have a problem with frequently used data being
cached, but I am targetting a consolidated environment at the moment.
Moreover, the administrator has control via a boot option, so it is
non-instrusive in many ways.

   If your patch just drops duplicated, but no more necessary for other kvm,
   I agree your patch may increase available size of page-caches. But you just
   drops unmapped pages.


unmapped and unused are the best targets, I plan to add slab cache control 
later. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Introduce canonical device hierarchy string

2010-06-14 Thread Gerd Hoffmann

  Hi,


My premise with this attempt is that we walk the hierarchy and use the
names to create the base of the path.  As we get to the device,
particularly to the parent bus of the device, we need to start looking at
properties to ensure uniqueness.


You'll need that for every bus along the way down to the device.  Create 
a virtual machine with two lsi scsi host adapters, then attach a disk 
with scsi id 0 to each.  Just the scsi id isn't good enougth to identify 
the device.  You'll need the lsi pci address too.



For now, the only properties I've tagged as path
properties are PCI bus addresses and MAC addresses.


mac address isn't needed here.  You need the property which specifies 
the bus address.  For PCI this obviously is the PCI address.  For scsi 
the scsi id.  For ISA you can use the I/O port base.  virtio-serial the 
port number, ...


cheers,
  Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control

2010-06-14 Thread KAMEZAWA Hiroyuki
On Mon, 14 Jun 2010 12:19:55 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:
  - Why don't you believe LRU ? And if LRU doesn't work well, should it be
fixed by a knob rather than generic approach ?
  - No side effects ?
 
 I believe in LRU, just that the problem I am trying to solve is of
 using double the memory for caching the same data (consider kvm
 running in cache=writethrough or writeback mode, both the hypervisor
 and the guest OS maintain a page cache of the same data). As the VM's
 grow the overhead is substantial. In my runs I found upto 60%
 duplication in some cases.
 
 
 - Linux vm guys tend to say, free memory is bad memory. ok, for what
   free memory created by your patch is used ? IOW, I can't see the benefit.
   If free memory that your patch created will be used for another page-cache,
   it will be dropped soon by your patch itself.
 
 Free memory is good for cases when you want to do more in the same
 system. I agree that in a bare metail environment that might be
 partially true. I don't have a problem with frequently used data being
 cached, but I am targetting a consolidated environment at the moment.
 Moreover, the administrator has control via a boot option, so it is
 non-instrusive in many ways.

It sounds that what you want is to improve performance etc. but to make it
easy sizing the system and to help admins. Right ?

From performance perspective, I don't see any advantage to drop caches
which can be dropped easily. I just use cpus for the purpose it may no
be necessary.

Thanks,
-Kame

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control

2010-06-14 Thread Balbir Singh
* KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-06-14 16:00:21]:

 On Mon, 14 Jun 2010 12:19:55 +0530
 Balbir Singh bal...@linux.vnet.ibm.com wrote:
   - Why don't you believe LRU ? And if LRU doesn't work well, should it be
 fixed by a knob rather than generic approach ?
   - No side effects ?
  
  I believe in LRU, just that the problem I am trying to solve is of
  using double the memory for caching the same data (consider kvm
  running in cache=writethrough or writeback mode, both the hypervisor
  and the guest OS maintain a page cache of the same data). As the VM's
  grow the overhead is substantial. In my runs I found upto 60%
  duplication in some cases.
  
  
  - Linux vm guys tend to say, free memory is bad memory. ok, for what
free memory created by your patch is used ? IOW, I can't see the benefit.
If free memory that your patch created will be used for another 
  page-cache,
it will be dropped soon by your patch itself.
  
  Free memory is good for cases when you want to do more in the same
  system. I agree that in a bare metail environment that might be
  partially true. I don't have a problem with frequently used data being
  cached, but I am targetting a consolidated environment at the moment.
  Moreover, the administrator has control via a boot option, so it is
  non-instrusive in many ways.
 
 It sounds that what you want is to improve performance etc. but to make it
 easy sizing the system and to help admins. Right ?


Right, to allow freeing up of using double the memory to cache data.
 
 From performance perspective, I don't see any advantage to drop caches
 which can be dropped easily. I just use cpus for the purpose it may no
 be necessary.
 

It is not that easy, in a virtualized environment, you do directly
reclaim, but use a mechanism like ballooning and that too requires a
smart software to decide where to balloon from. This patch (optionally
if enabled) optimizes that by

1. Reducing double caching
2. Not requiring newer smarts or a management software to monitor and
balloon
3. Allows better estimation of free memory by avoiding double caching
4. Allows immediate use of free memory for other applications or
startup of newer guest instances.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 1/2] Linux/Guest unmapped page cache control

2010-06-14 Thread KAMEZAWA Hiroyuki
On Mon, 14 Jun 2010 13:06:46 +0530
Balbir Singh bal...@linux.vnet.ibm.com wrote:
 
  It sounds that what you want is to improve performance etc. but to make it
  easy sizing the system and to help admins. Right ?
 
 
 Right, to allow freeing up of using double the memory to cache data.
  
Oh, sorry. ask again..

It sounds that what you want is _not_ to improve performance etc. but to make it
...

?

-Kame

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Avi Kivity

On 06/11/2010 07:56 AM, Balbir Singh wrote:



Just to be clear, let's say we have a mapped page (say of /sbin/init)
that's been unreferenced since _just_ after the system booted.  We also
have an unmapped page cache page of a file often used at runtime, say
one from /etc/resolv.conf or /etc/passwd.

Which page will be preferred for eviction with this patch set?

 

In this case the order is as follows

1. First we pick free pages if any
2. If we don't have free pages, we go after unmapped page cache and
slab cache
3. If that fails as well, we go after regularly memory

In the scenario that you describe, we'll not be able to easily free up
the frequently referenced page from /etc/*. The code will move on to
step 3 and do its regular reclaim.
   


Still it seems to me you are subverting the normal order of reclaim.  I 
don't see why an unmapped page cache or slab cache item should be 
evicted before a mapped page.  Certainly the cost of rebuilding a dentry 
compared to the gain from evicting it, is much higher than that of 
reestablishing a mapped page.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/24] Move nested option from svm.c to x86.c

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:23 PM, Nadav Har'El wrote:

The SVM module had a nested option, on by default, which controls whether
to allow nested virtualization. Now that VMX also supports nested
virtualization, we can move this option to x86.c, for both SVM and VMX.

The nested option takes three possible values. 0 disables nested
virtualization on both SVM and VMX, and 1 enables it on both.
The value 2, which is the default when this module option is not explicitly
set, asks each of SVM or VMX to choose its own default; Currently, VMX
disables nested virtualization in this case, while SVM leaves it enabled.

When nested VMX becomes more mature, this default should probably be changed
to enable nested virtualization on both architectures.

--- .before/arch/x86/kvm/x86.c  2010-06-13 15:01:28.0 +0300
+++ .after/arch/x86/kvm/x86.c   2010-06-13 15:01:28.0 +0300
@@ -95,6 +95,17 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
  int ignore_msrs = 0;
  module_param_named(ignore_msrs, ignore_msrs, bool, S_IRUGO | S_IWUSR);

+/* If nested=1, nested virtualization is supported. I.e., the guest may use
+ * VMX or SVM (as appropriate) and be a hypervisor for its own guests.
+ * If nested=0, nested virtualization is not supported.
+ * When nested starts as 2 (which is the default), it is later modified by the
+ * specific module used (VMX or SVM). Currently, nested will be left enabled
+ * on SVM, but reset to 0 on VMX.
+ */
+int nested = 2;
+EXPORT_SYMBOL_GPL(nested);
+module_param(nested, int, S_IRUGO);
+
   



A global variable names 'nested' is not a good idea.  I recommend having 
a kvm-intel scope module parameter instead, that also avoids the 0/1/2 
values.


After the patches are merged we can try to consolidate here.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/24] Add VMX and SVM to list of supported cpuid features

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:23 PM, Nadav Har'El wrote:

Add the VMX CPU feature to the list of CPU featuress KVM advertises with
the KVM_GET_SUPPORTED_CPUID ioctl (unless the nested module option is off).

Qemu uses this ioctl, and intersects KVM's list with its own list of desired
cpu features (depending on the -cpu option given to qemu) to determine the
final list of features presented to the guest.
This patch also does the same for SVM: KVM now advertises it supports SVM,
unless the nested module option is off.

Signed-off-by: Nadav Har'Eln...@il.ibm.com
---
--- .before/arch/x86/kvm/x86.c  2010-06-13 15:01:28.0 +0300
+++ .after/arch/x86/kvm/x86.c   2010-06-13 15:01:28.0 +0300
@@ -1923,7 +1923,7 @@ static void do_cpuid_ent(struct kvm_cpui
/* cpuid 1.ecx */
const u32 kvm_supported_word4_x86_features =
F(XMM3) | 0 /* Reserved, DTES64, MONITOR */ |
-   0 /* DS-CPL, VMX, SMX, EST */ |
+   0 /* DS-CPL */ | (nested ? F(VMX) : 0) | 0 /* SMX, EST */ |
0 /* TM2 */ | F(SSSE3) | 0 /* CNXT-ID */ | 0 /* Reserved */ |
0 /* Reserved */ | F(CX16) | 0 /* xTPR Update, PDCM */ |
0 /* Reserved, DCA */ | F(XMM4_1) |
   


You can use kvm_x86_ops-set_supported_cpuid() to alter features.


@@ -1931,7 +1931,8 @@ static void do_cpuid_ent(struct kvm_cpui
0 /* Reserved, XSAVE, OSXSAVE */;
/* cpuid 0x8001.ecx */
const u32 kvm_supported_word6_x86_features =
-   F(LAHF_LM) | F(CMP_LEGACY) | F(SVM) | 0 /* ExtApicSpace */ |
+   F(LAHF_LM) | F(CMP_LEGACY) | (nested ? F(SVM) : 0) |
+   0 /* ExtApicSpace */ |
F(CR8_LEGACY) | F(ABM) | F(SSE4A) | F(MISALIGNSSE) |
F(3DNOWPREFETCH) | 0 /* OSVW */ | 0 /* IBS */ | F(SSE5) |
0 /* SKINIT */ | 0 /* WDT */;
   


Good idea, but let's leave it out of the nvmx patches.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/24] Implement VMXON and VMXOFF

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:24 PM, Nadav Har'El wrote:

This patch allows a guest to use the VMXON and VMXOFF instructions, and
emulates them accordingly. Basically this amounts to checking some
prerequisites, and then remembering whether the guest has enabled or disabled
VMX operation.

   


Should probably reorder with next patch.


+/* The nested_vmx structure is part of vcpu_vmx, and holds information we need
+ * for correct emulation of VMX (i.e., nested VMX) on this vcpu. For example,
+ * the current VMCS set by L1, a list of the VMCSs used to run the active
+ * L2 guests on the hardware, and more.
+ */
   


Please (here and elsewhere) use the standard kernel style for multiline 
comments - start with /* on a line by itself.




+/* Emulate the VMXON instruction.
+ * Currently, we just remember that VMX is active, and do not save or even
+ * inspect the argument to VMXON (the so-called VMXON pointer) because we
+ * do not currently need to store anything in that guest-allocated memory
+ * region. Consequently, VMCLEAR and VMPTRLD also do not verify that the their
+ * argument is different from the VMXON pointer (which the spec says they do).
+ */
+static int handle_vmon(struct kvm_vcpu *vcpu)
+{
+   struct kvm_segment cs;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   /* The Intel VMX Instruction Reference lists a bunch of bits that
+* are prerequisite to running VMXON, most notably CR4.VMXE must be
+* set to 1. Otherwise, we should fail with #UD. We test these now:
+*/
+   if (!nested) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (!(vcpu-arch.cr4  X86_CR4_VMXE) ||
+   !(vcpu-arch.cr0  X86_CR0_PE) ||
+   (vmx_get_rflags(vcpu)  X86_EFLAGS_VM)) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   vmx_get_segment(vcpu,cs, VCPU_SREG_CS);
+   if (is_long_mode(vcpu)  !cs.l) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 1;
+   }
+
+   if (vmx_get_cpl(vcpu)) {
+   kvm_inject_gp(vcpu, 0);
+   return 1;
+   }
+
+   vmx-nested.vmxon = 1;
   


= true


+
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
   


Need to block INIT signals in the local apic as well (fine for a 
separate patch).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:25 PM, Nadav Har'El wrote:

An implementation of VMX needs to define a VMCS structure. This structure
is kept in guest memory, but is opaque to the guest (who can only read or
write it with VMX instructions).

This patch starts to define the VMCS structure which our nested VMX
implementation will present to L1. We call it vmcs12, as it is the VMCS
that L1 keeps for its L2 guests.

This patch also adds the notion (as required by the VMX spec) of the current
VMCS, and finally includes utility functions for mapping the guest-allocated
VMCSs in host memory.

+#define VMCS12_REVISION 0x11e57ed0
   


Where did this number come from?  It's not from real hardware, yes?


+
+/*
+ * struct vmcs12 describes the state that our guest hypervisor (L1) keeps for a
+ * single nested guest (L2), hence the name vmcs12. Any VMX implementation has
+ * a VMCS structure (which is opaque to the guest), and vmcs12 is our emulated
+ * VMX's VMCS. This structure is stored in guest memory specified by VMPTRLD,
+ * and accessed by the guest using VMREAD/VMWRITE/VMCLEAR instructions. More
+ * than one of these structures may exist, if L1 runs multiple L2 guests.
+ * nested_vmx_run() will use the data here to build a VMCS for the underlying
+ * hardware which will be used to run L2.
+ * This structure is packed in order to preseve the binary content after live
+ * migration. If there are changes in the content or layout, VMCS12_REVISION
+ * must be changed.
+ */
+struct __attribute__ ((__packed__)) vmcs12 {
   


__packed is a convenient define for this.


+   /* According to the Intel spec, a VMCS region must start with the
+* following two fields. Then follow implementation-specific data.
+*/
+   u32 revision_id;
+   u32 abort;
+};
   


Note that this structure becomes an ABI, it cannot change except in a 
backward compatible way due to the need for live migration.  So I'd like 
a documentation patch that adds a description of the content to 
Documentation/kvm/.  It can be as simple as listing the structure 
definition.




+static struct page *nested_get_page(struct kvm_vcpu *vcpu, u64 vmcs_addr)
+{
+   struct page *vmcs_page =
+   gfn_to_page(vcpu-kvm, vmcs_addr  PAGE_SHIFT);
+
+   if (is_error_page(vmcs_page)) {
+   printk(KERN_ERR %s error allocating page 0x%llx\n,
+  __func__, vmcs_addr);
   


Those printks can be used by a malicious guest to span the host logs.  
Please wrap them with something that is conditional on a debug flag.


I'm not sure what we need to do with vmcs that is not in RAM.  It may 
simplify things to return the error_page to the caller and set 
KVM_REQ_TRIPLE_FAULT, so we don't have to deal with error handling later on.



+   kvm_release_page_clean(vmcs_page);
+   return NULL;
+   }
+   return vmcs_page;
+}
+
+static void nested_unmap_current(struct kvm_vcpu *vcpu)
+{
+   struct page *page;
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   if (!vmx-nested.current_l2_page) {
+   printk(KERN_INFO Shadow vmcs already unmapped\n);
+   BUG_ON(1);
+   return;
+   }
+
+   page = kmap_atomic_to_page(vmx-nested.current_l2_page);
+
+   kunmap_atomic(vmx-nested.current_l2_page, KM_USER0);
+
+   kvm_release_page_dirty(page);
   


Do we always dirty the page?

I guess it is no big deal even if we don't.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-14 11:09:44]:

 On 06/11/2010 07:56 AM, Balbir Singh wrote:
 
 Just to be clear, let's say we have a mapped page (say of /sbin/init)
 that's been unreferenced since _just_ after the system booted.  We also
 have an unmapped page cache page of a file often used at runtime, say
 one from /etc/resolv.conf or /etc/passwd.
 
 Which page will be preferred for eviction with this patch set?
 
 In this case the order is as follows
 
 1. First we pick free pages if any
 2. If we don't have free pages, we go after unmapped page cache and
 slab cache
 3. If that fails as well, we go after regularly memory
 
 In the scenario that you describe, we'll not be able to easily free up
 the frequently referenced page from /etc/*. The code will move on to
 step 3 and do its regular reclaim.
 
 Still it seems to me you are subverting the normal order of reclaim.
 I don't see why an unmapped page cache or slab cache item should be
 evicted before a mapped page.  Certainly the cost of rebuilding a
 dentry compared to the gain from evicting it, is much higher than
 that of reestablishing a mapped page.


Subverting to aviod memory duplication, the word subverting is
overloaded, let me try to reason a bit. First let me explain the
problem

Memory is a precious resource in a consolidated environment.
We don't want to waste memory via page cache duplication
(cache=writethrough and cache=writeback mode).

Now here is what we are trying to do

1. A slab page will not be freed until the entire page is free (all
slabs have been kfree'd so to speak). Normal reclaim will definitely
free this page, but a lot of it depends on how frequently we are
scanning the LRU list and when this page got added.
2. In the case of page cache (specifically unmapped page cache), there
is duplication already, so why not go after unmapped page caches when
the system is under memory pressure?

In the case of 1, we don't force a dentry to be freed, but rather a
freed page in the slab cache to be reclaimed ahead of forcing reclaim
of mapped pages.

Does the problem statement make sense? If so, do you agree with 1 and
2? Is there major concern about subverting regular reclaim? Does
subverting it make sense in the duplicated scenario?

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/24] Understanding guest pointers to vmcs12 structures

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:26 PM, Nadav Har'El wrote:

This patch includes a couple of utility functions for extracting pointer
operands of VMX instructions issued by L1 (a guest hypervisor), and
translating guest-given vmcs12 virtual addresses to guest-physical addresses.

+/*
+ * Decode the memory-address operand of a vmx instruction, according to the
+ * Intel spec.
+ */
+#define VMX_OPERAND_SCALING(vii)   ((vii)  3)
+#define VMX_OPERAND_ADDR_SIZE(vii) (((vii)  7)  7)
+#define VMX_OPERAND_IS_REG(vii)((vii)  (1u  10))
+#define VMX_OPERAND_SEG_REG(vii)   (((vii)  15)  7)
+#define VMX_OPERAND_INDEX_REG(vii) (((vii)  18)  0xf)
+#define VMX_OPERAND_INDEX_INVALID(vii) ((vii)  (1u  22))
+#define VMX_OPERAND_BASE_REG(vii)  (((vii)  23)  0xf)
+#define VMX_OPERAND_BASE_INVALID(vii)  ((vii)  (1u  27))
+#define VMX_OPERAND_REG(vii)   (((vii)  3)  0xf)
+#define VMX_OPERAND_REG2(vii)  (((vii)  28)  0xf)
+static gva_t get_vmx_mem_address(struct kvm_vcpu *vcpu,
+unsigned long exit_qualification,
+u32 vmx_instruction_info)
+{
+   int  scaling = VMX_OPERAND_SCALING(vmx_instruction_info);
+   int  addr_size = VMX_OPERAND_ADDR_SIZE(vmx_instruction_info);
+   bool is_reg = VMX_OPERAND_IS_REG(vmx_instruction_info);
+   int  seg_reg = VMX_OPERAND_SEG_REG(vmx_instruction_info);
+   int  index_reg = VMX_OPERAND_SEG_REG(vmx_instruction_info);
+   bool index_is_valid = !VMX_OPERAND_INDEX_INVALID(vmx_instruction_info);
+   int  base_reg   = VMX_OPERAND_BASE_REG(vmx_instruction_info);
+   bool base_is_valid  = !VMX_OPERAND_BASE_INVALID(vmx_instruction_info);
   


Since those defines are used just ones, you can fold them into their 
uses.  It doesn't add much to repeat the variable name.



+   gva_t addr;
+
+   if (is_reg) {
+   kvm_queue_exception(vcpu, UD_VECTOR);
+   return 0;
+   }
+
+   switch (addr_size) {
+   case 1: /* 32 bit. high bits are undefined according to the spec: */
+   exit_qualification= 0x;
+   break;
+   case 2: /* 64 bit */
+   break;
+   default: /* addr_size=0 means 16 bit */
+   return 0;
+   }
+
+   /* Addr = segment_base + offset */
+   /* offfset = Base + [Index * Scale] + Displacement */
+   addr = vmx_get_segment_base(vcpu, seg_reg);
+   if (base_is_valid)
+   addr += kvm_register_read(vcpu, base_reg);
+   if (index_is_valid)
+   addr += kvm_register_read(vcpu, index_reg)scaling;
+   addr += exit_qualification; /* holds the displacement */
   


Do we need a segment limit and access rights check?


+
+   return addr;
+}
+
   


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1

2010-06-14 Thread Nadav Har'El
On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 5/24] Introduce 
vmcs12: a VMCS structure for L1:
 On 06/13/2010 03:25 PM, Nadav Har'El wrote:
 +#define VMCS12_REVISION 0x11e57ed0

 
 Where did this number come from?  It's not from real hardware, yes?

Basically, we are presenting emulated VMX for the L1 guest, complete with
its own VMCS structure. This structure needs to have some VMCS revision id,
which should be an arbitrary number that we invent - it is not related to any
revision id that any real hardware uses. If you look closely, you can see that
the number I used is leetspeak for Nested0 ;-)

As you can see in the following patches, MSR_IA32_VMX_BASIC will return this
arbitrary VMCS revision id, and and VMPTRLD will verify that the VMCS region
that L1 is trying to load contains this revision id.

-- 
Nadav Har'El|   Monday, Jun 14 2010, 2 Tammuz 5770
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |If you're looking for a helping hand,
http://nadav.harel.org.il   |look first at the end of your arm.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/24] Implement VMCLEAR

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:27 PM, Nadav Har'El wrote:

This patch implements the VMCLEAR instruction.

+
+/* Emulate the VMCLEAR instruction */
+static int handle_vmclear(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   gpa_t guest_vmcs_addr, save_current_vmptr;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (read_guest_vmcs_gpa(vcpu,guest_vmcs_addr))
+   return 1;
+
+   save_current_vmptr = vmx-nested.current_vmptr;
+
+   vmx-nested.current_vmptr = guest_vmcs_addr;
+   if (!nested_map_current(vcpu))
+   return 1;
   


Haven't you leaked current_vmptr here?

If I read the code correctly, you are implementing a sort of stack here 
and pushing the current vmptr into save_current_vmptr.  Perhaps it's 
simper to have an nvmxptr structure which holds a vmptr and a kmap'ed 
pointer to it, and pass that around to functions.



+   vmx-nested.current_l2_page-launch_state = 0;
+   nested_unmap_current(vcpu);
+
+   nested_free_current_vmcs(vcpu);
+
+   if (save_current_vmptr == guest_vmcs_addr)
+   vmx-nested.current_vmptr = -1ull;
+   else
+   vmx-nested.current_vmptr = save_current_vmptr;
+
+   skip_emulated_instruction(vcpu);
+   clear_rflags_cf_zf(vcpu);
+   return 1;
+}
+
   


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/24] Implement VMPTRLD

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:27 PM, Nadav Har'El wrote:

This patch implements the VMPTRLD instruction.


  static void clear_rflags_cf_zf(struct kvm_vcpu *vcpu)
  {
unsigned long rflags;
@@ -3869,6 +3889,57 @@ static int handle_vmclear(struct kvm_vcp
return 1;
  }

+static bool verify_vmcs12_revision(struct kvm_vcpu *vcpu, gpa_t 
guest_vmcs_addr)
+{
+   bool ret;
+   struct vmcs12 *vmcs12;
+   struct page *vmcs_page = nested_get_page(vcpu, guest_vmcs_addr);
   


Blank line so I can catch my breath.


+   if (vmcs_page == NULL)
+   return 0;
   


Doesn't seem right.


+   vmcs12 = (struct vmcs12 *)kmap_atomic(vmcs_page, KM_USER0);
+   if (vmcs12-revision_id == VMCS12_REVISION)
+   ret = 1;
+   else {
+   set_rflags_to_vmx_fail_valid(vcpu);
+   ret = 0;
+   }
+   kunmap_atomic(vmcs12, KM_USER0);
+   kvm_release_page_dirty(vmcs_page);
   


Can release a clean page here.  But what happened to those mapping helpers?


+   return ret;
+}
+
+/* Emulate the VMPTRLD instruction */
+static int handle_vmptrld(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   gpa_t guest_vmcs_addr;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (read_guest_vmcs_gpa(vcpu,guest_vmcs_addr)) {
+   set_rflags_to_vmx_fail_invalid(vcpu);
   


Need to skip_emulated_instruction() in this case.


+   return 1;
+   }
+
+   if (!verify_vmcs12_revision(vcpu, guest_vmcs_addr))
+   return 1;
   


Here too.


+
+   if (vmx-nested.current_vmptr != guest_vmcs_addr) {
+   vmx-nested.current_vmptr = guest_vmcs_addr;
+
+   if (nested_create_current_vmcs(vcpu)) {
+   printk(KERN_ERR %s error could not allocate memory,
+   __func__);
   


In general ftrace and the ENOMEM itself are sufficient documentation 
that something went wrong.



+   return -ENOMEM;
+   }
+   }
+
+   clear_rflags_cf_zf(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
   


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/24] Implement VMPTRST

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:28 PM, Nadav Har'El wrote:

This patch implements the VMPTRST instruction.

Signed-off-by: Nadav Har'Eln...@il.ibm.com
---
--- .before/arch/x86/kvm/x86.c  2010-06-13 15:01:29.0 +0300
+++ .after/arch/x86/kvm/x86.c   2010-06-13 15:01:29.0 +0300
@@ -3301,7 +3301,7 @@ static int kvm_read_guest_virt_system(gv
return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, 0, error);
  }

-static int kvm_write_guest_virt_system(gva_t addr, void *val,
+int kvm_write_guest_virt_system(gva_t addr, void *val,
   unsigned int bytes,
   struct kvm_vcpu *vcpu,
   u32 *error)
   


write_guest_virt_system() is used by writes which need to ignore the 
cpl, for example when a cpl 3 instruction loads a segment, the processor 
needs to update the accessed flag even though it is only accessible to 
cpl 0.  That's not your case, you need the ordinary write_guest_virt().


Um, I see there is no kvm_write_guest_virt(), you'll have to introduce it.



+/* Emulate the VMPTRST instruction */
+static int handle_vmptrst(struct kvm_vcpu *vcpu)
+{
+   int r = 0;
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   u32 vmx_instruction_info = vmcs_read32(VMX_INSTRUCTION_INFO);
+   gva_t vmcs_gva;
+
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   vmcs_gva = get_vmx_mem_address(vcpu, exit_qualification,
+  vmx_instruction_info);
+   if (vmcs_gva == 0)
+   return 1;
   


What's wrong with gva 0?  It's favoured by exploiters everywhere.


+   r = kvm_write_guest_virt_system(vmcs_gva,
+(void *)to_vmx(vcpu)-nested.current_vmptr,
+sizeof(u64), vcpu, NULL);
+   if (r) {
   


Check against the X86EMUL return codes.  You'll need to inject a page 
fault on failure.



+   printk(KERN_INFO %s failed to write vmptr\n, __func__);
+   return 1;
+   }
+   clear_rflags_cf_zf(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+}
+
   


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/24] Add VMCS fields to the vmcs12

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:28 PM, Nadav Har'El wrote:

In this patch we add to vmcs12 (the VMCS that L1 keeps for L2) all the
standard VMCS fields. These fields are encapsulated in a struct shadow_vmcs.

Later patches will enable L1 to read and write these fields using VMREAD/
VMWRITE, and they will be used during a VMLAUNCH/VMRESUME in preparing a real
VMCS for running L2.


+/* shadow_vmcs is a structure used in nested VMX for holding a copy of all
+ * standard VMCS fields. It is used for emulating a VMCS for L1 (see vmcs12),
+ * and also for easier access to VMCS data (see l1_shadow_vmcs).
+ */
+struct __attribute__ ((__packed__)) shadow_vmcs {
   




+   u32 host_ia32_sysenter_cs;
+   unsigned long cr0_guest_host_mask;
   


I think I counted an odd number of u32 fields, which mean the ulong 
fields will be unaligned.  Please add padding to preserve natural alignment.


Have you considered placing often used fields together to reduce cache 
misses?  I'm not sure whether it's worth the effort.




  /*
@@ -139,6 +269,8 @@ struct __attribute__ ((__packed__)) vmcs
u32 revision_id;
u32 abort;

+   struct shadow_vmcs shadow_vmcs;
+
bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
   


That will make it difficult to expand shadow_vmcs in the future.  I 
suggest putting it at the end, and reserving some space in the middle.




+#define OFFSET(x) offsetof(struct shadow_vmcs, x)
+
+static unsigned short vmcs_field_to_offset_table[HOST_RIP+1] = {
   


Can leave the size unspecified (and use ARRAY_SIZE() later on).

The encoding of the indexes is a very sparse, so the table will be very 
large.  No need to deal with that now though.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/24] Implement reading and writing of VMX MSRs

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:25 PM, Nadav Har'El wrote:

When the guest can use VMX instructions (when the nested module option is
on), it should also be able to read and write VMX MSRs, e.g., to query about
VMX capabilities. This patch adds this support.

Signed-off-by: Nadav Har'Eln...@il.ibm.com
---
--- .before/arch/x86/kvm/x86.c  2010-06-13 15:01:28.0 +0300
+++ .after/arch/x86/kvm/x86.c   2010-06-13 15:01:28.0 +0300
@@ -702,7 +702,11 @@ static u32 msrs_to_save[] = {
  #ifdef CONFIG_X86_64
MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
  #endif
-   MSR_IA32_TSC, MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA
+   MSR_IA32_TSC, MSR_IA32_PERF_STATUS, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
+   MSR_IA32_FEATURE_CONTROL,  MSR_IA32_VMX_BASIC,
+   MSR_IA32_VMX_PINBASED_CTLS, MSR_IA32_VMX_PROCBASED_CTLS,
+   MSR_IA32_VMX_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS,
+   MSR_IA32_VMX_PROCBASED_CTLS2, MSR_IA32_VMX_EPT_VPID_CAP,
  };
   


These are read only from the guest point of view, but we need write 
support from the host to allow for tuning the features exposed to the guest.



  /*
+ * If we allow our guest to use VMX instructions, we should also let it use
+ * VMX-specific MSRs.
+ */
+static int nested_vmx_get_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
+{
+   u64 vmx_msr = 0;
+   u32 vmx_msr_high, vmx_msr_low;
+
+   switch (msr_index) {
+   case MSR_IA32_FEATURE_CONTROL:
+   *pdata = 0;
+   break;
+   case MSR_IA32_VMX_BASIC:
+   /*
+* This MSR reports some information about VMX support of the
+* processor. We should return information about the VMX we
+* emulate for the guest, and the VMCS structure we give it -
+* not about the VMX support of the underlying hardware. Some
+* However, some capabilities of the underlying hardware are
+* used directly by our emulation (e.g., the physical address
+* width), so these are copied from what the hardware reports.
+*/
+   *pdata = VMCS12_REVISION |
+   (((u64)sizeof(struct vmcs12))  32);
+   rdmsrl(MSR_IA32_VMX_BASIC, vmx_msr);
+#define VMX_BASIC_64   0x0001LLU
+#define VMX_BASIC_MEM_TYPE 0x003cLLU
+#define VMX_BASIC_INOUT0x0040LLU
   


Please move those defines (with longer names) to msr-index.h.


+   *pdata |= vmx_msr
+   (VMX_BASIC_64 | VMX_BASIC_MEM_TYPE | VMX_BASIC_INOUT);
+   break;
+#define CORE2_PINBASED_CTLS_MUST_BE_ONE  0x0016
+#define MSR_IA32_VMX_TRUE_PINBASED_CTLS  0x48d
+   case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
+   case MSR_IA32_VMX_PINBASED_CTLS:
+   vmx_msr_low  = CORE2_PINBASED_CTLS_MUST_BE_ONE;
+   vmx_msr_high = CORE2_PINBASED_CTLS_MUST_BE_ONE |
+   PIN_BASED_EXT_INTR_MASK |
+   PIN_BASED_NMI_EXITING |
+   PIN_BASED_VIRTUAL_NMIS;
   


IIRC not all processors support PIN_BASED_VIRTUAL_NMIs.  Can we support 
this feature on hosts that don't have it?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] *** SUBJECT HERE ***

2010-06-14 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

Greetings Gerd, Hannes and co,

This series adds initial support for a hw/scsi-bsg.c backstore for scsi-bus
compatible HBA emulation in QEMU-KVM on Linux hosts supporting the BSG driver.
This code is available from the scsi-bsg branch in the megasas/scsi friendly 
QEMU-KVM tree at:

http://git.kernel.org/?p=virt/kvm/nab/qemu-kvm.git;a=shortlog;h=refs/heads/scsi-bsg

Note that this initial code is being posted for review and to see how useful a 
BSG backstore
would be for QEMU-KVM and Linux hosts.  Note that in order for BSG I/O to 
function using vectored
AIO a kernel patch to linux/block/bsg.c:bsg_map_hdr() is currently required 
running on a bit paired
user/kernel enviroment.  The kernel patch in question is here:

http://marc.info/?l=linux-scsim=127649585524598w=2

The first three patches involve updating block code to support the BSG 
backstore for scsi-bsg.

The forth patch adds the minor changes to hw/scsi-bus.c and hw/scsi-disk.c in 
order to
function with scsi-bsg.

And the fifth patch adds the main hw/scsi-bsg.c logic necessary to run the new 
struct
SCSIDeviceInfo and for BSG AIO using struct iovec and paio_submit_len() to 
function.

So far this patch series has been tested with a Linux based x86_64 KVM host and 
guest
using the hw/megasas.c 8708EM2 HBA Emulation with TCM_Loop virtual SAS Port 
LUNs.

Comments are welcome,

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org

Nicholas Bellinger (5):
  [block]: Add top level BSG support
  [block]: Add BSG qemu_open() in block/raw.c:raw_open()
  [block]: Add paio_submit_len() non sector sized AIO
  [scsi]: Add BSG support for scsi-bus and scsi-disk
  [scsi-bsg]: Add initial support for BSG based SCSIDeviceInfo

 Makefile.objs |2 +-
 block.c   |   23 ++-
 block.h   |1 +
 block/raw-posix-aio.h |3 +
 block/raw-posix.c |   17 ++-
 block/raw.c   |   20 ++
 block_int.h   |5 +
 hw/scsi-bsg.c |  588 +
 hw/scsi-bus.c |3 +-
 hw/scsi-disk.c|4 +
 posix-aio-compat.c|   28 +++
 11 files changed, 687 insertions(+), 7 deletions(-)
 create mode 100644 hw/scsi-bsg.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] [block]: Add top level BSG support

2010-06-14 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

This patch adds top level BSG support to QEMU-KVM block and adds the
BDS_* prefixed defines for SG_IO and BSG.

It adds the BDS_SCSI_GENERIC and BDS_BSG assignments in 
block/raw-posix.c:hdev_open()
using S_ISCHR() and major(st.st_rdev) in order to determine when we are dealing 
with
scsi-generic or scsi-bsg backstores.

It also adds a special case BSG check in find_protocol() using the BSG
major in order to avoid the strchr() for ':' as we expect filenames to
contain /dev/bsg/H:C:T:L.

This path also adds a struct BlockDriverState-fd to save a opened file 
descriptor
with format_name = 'raw' for use by scsi-bsg.

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
---
 block.c   |   23 ---
 block.h   |1 +
 block/raw-posix.c |   17 +++--
 block_int.h   |5 +
 4 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/block.c b/block.c
index 88dbc00..3cd18ec 100644
--- a/block.c
+++ b/block.c
@@ -285,8 +285,12 @@ static BlockDriver *find_protocol(const char *filename)
 {
 BlockDriver *drv1;
 char protocol[128];
-int len;
+int len, bsg = 0;
 const char *p;
+#if defined(__linux__)
+struct stat st;
+#endif
+
 
 /* TODO Drivers without bdrv_file_open must be specified explicitly */
 
@@ -296,7 +300,15 @@ static BlockDriver *find_protocol(const char *filename)
 return bdrv_find_format(file);
 #endif
 p = strchr(filename, ':');
-if (!p) {
+#if defined(__linux__)
+if (stat(filename, st)  0)
+return NULL;
+/* This is not yet defined in include/linux/major.h.. */
+if (S_ISCHR(st.st_mode)  major(st.st_rdev) == 254)
+bsg = 1;
+#endif
+
+if (!p || bsg) {
 drv1 = find_hdev_driver(filename);
 if (!drv1) {
 drv1 = bdrv_find_format(file);
@@ -1209,7 +1221,12 @@ int bdrv_is_read_only(BlockDriverState *bs)
 
 int bdrv_is_sg(BlockDriverState *bs)
 {
-return bs-sg;
+return bs-sg == BDS_SCSI_GENERIC;
+}
+
+int bdrv_is_bsg(BlockDriverState *bs)
+{
+return bs-sg == BDS_BSG;
 }
 
 int bdrv_enable_write_cache(BlockDriverState *bs)
diff --git a/block.h b/block.h
index f87d24e..4faeedc 100644
--- a/block.h
+++ b/block.h
@@ -146,6 +146,7 @@ int bdrv_get_translation_hint(BlockDriverState *bs);
 int bdrv_is_removable(BlockDriverState *bs);
 int bdrv_is_read_only(BlockDriverState *bs);
 int bdrv_is_sg(BlockDriverState *bs);
+int bdrv_is_bsg(BlockDriverState *bs);
 int bdrv_enable_write_cache(BlockDriverState *bs);
 int bdrv_is_inserted(BlockDriverState *bs);
 int bdrv_media_changed(BlockDriverState *bs);
diff --git a/block/raw-posix.c b/block/raw-posix.c
index 1515ca9..d349109 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -52,6 +52,7 @@
 #include sys/ioctl.h
 #include linux/cdrom.h
 #include linux/fd.h
+#include linux/major.h
 #endif
 #if defined (__FreeBSD__) || defined(__FreeBSD_kernel__)
 #include signal.h
@@ -843,6 +844,9 @@ static int hdev_probe_device(const char *filename)
 static int hdev_open(BlockDriverState *bs, const char *filename, int flags)
 {
 BDRVRawState *s = bs-opaque;
+#if defined(__linux__)
+struct stat st;
+#endif
 
 #ifdef CONFIG_COCOA
 if (strstart(filename, /dev/cdrom, NULL)) {
@@ -873,8 +877,17 @@ static int hdev_open(BlockDriverState *bs, const char 
*filename, int flags)
 
 s-type = FTYPE_FILE;
 #if defined(__linux__)
-if (strstart(filename, /dev/sg, NULL)) {
-bs-sg = 1;
+if (stat(filename, st)  0) {
+printf(stat() failed errno: %d\n, errno);
+return -1;
+}
+if (S_ISCHR(st.st_mode)) {
+if (major(st.st_rdev) == SCSI_GENERIC_MAJOR) {
+bs-sg = BDS_SCSI_GENERIC;
+} else if (major(st.st_rdev) == 254) {
+/* This is not yet defined in include/linux/major.h.. */
+bs-sg = BDS_BSG;
+}
 }
 #endif
 
diff --git a/block_int.h b/block_int.h
index 1a7240c..74bcb1a 100644
--- a/block_int.h
+++ b/block_int.h
@@ -40,6 +40,10 @@
 #define BLOCK_OPT_CLUSTER_SIZE  cluster_size
 #define BLOCK_OPT_PREALLOC  preallocation
 
+#define BDS_NONE0
+#define BDS_SCSI_GENERIC1
+#define BDS_BSG 2
+
 typedef struct AIOPool {
 void (*cancel)(BlockDriverAIOCB *acb);
 int aiocb_size;
@@ -141,6 +145,7 @@ struct BlockDriverState {
 int encrypted; /* if true, the media is encrypted */
 int valid_key; /* if true, a valid encryption key has been set */
 int sg;/* if true, the device is a /dev/sg* */
+int fd;/* Used for BSG file descriptor */
 /* event callback when inserting/removing */
 void (*change_cb)(void *opaque);
 void *change_opaque;
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] [block]: Add BSG qemu_open() in block/raw.c:raw_open()

2010-06-14 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

This patch adds a BSG specific qemu_open() call in block/raw.c:raw_open() that
saves the opened file descriptor for BSG AIO into BlockDriverState-fd.

It also adds the reverse close() call to block/raw.c:raw_close()

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
---
 block/raw.c |   20 
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/block/raw.c b/block/raw.c
index 4406b8c..7820c78 100644
--- a/block/raw.c
+++ b/block/raw.c
@@ -5,7 +5,25 @@
 
 static int raw_open(BlockDriverState *bs, int flags)
 {
+int fd, ret;
+
 bs-sg = bs-file-sg;
+/*
+ * scsi-generic and other raw types do not call qemu_open()
+  */
+if (bs-sg != BDS_BSG)
+return 0;
+/*
+ * Obtain a file descriptor for the underlying BSG device for AIO w/ iovecs
+ */
+fd = qemu_open(bs-filename, flags, 0644);
+if (fd  0) {
+ret = -errno;
+if (ret == -EROFS)
+ret = -EACCES;
+return ret;
+}
+bs-fd = fd;
 return 0;
 }
 
@@ -37,6 +55,8 @@ static BlockDriverAIOCB *raw_aio_writev(BlockDriverState *bs,
 
 static void raw_close(BlockDriverState *bs)
 {
+if (bs-fd  0)
+close(bs-fd);
 }
 
 static void raw_flush(BlockDriverState *bs)
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] [block]: Add paio_submit_len() non sector sized AIO

2010-06-14 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

This patch adds posix-aio-compat.c:paio_submit_len(), which is a identical
to paio_submit() expect that in expected nb_len instead of nb_sectors (* 512)
so that it can be used by BSG AIO for write()/read() of struct sg_io_v4.

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
---
 block/raw-posix-aio.h |3 +++
 posix-aio-compat.c|   28 
 2 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/block/raw-posix-aio.h b/block/raw-posix-aio.h
index dfc63b8..29df842 100644
--- a/block/raw-posix-aio.h
+++ b/block/raw-posix-aio.h
@@ -30,6 +30,9 @@ int paio_init(void);
 BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
 int64_t sector_num, QEMUIOVector *qiov, int nb_sectors,
 BlockDriverCompletionFunc *cb, void *opaque, int type);
+BlockDriverAIOCB *paio_submit_len(BlockDriverState *bs, int fd,
+int64_t sector_num, QEMUIOVector *qiov, int nb_len,
+BlockDriverCompletionFunc *cb, void *opaque, int type);
 BlockDriverAIOCB *paio_ioctl(BlockDriverState *bs, int fd,
 unsigned long int req, void *buf,
 BlockDriverCompletionFunc *cb, void *opaque);
diff --git a/posix-aio-compat.c b/posix-aio-compat.c
index 272e998..ac9276c 100644
--- a/posix-aio-compat.c
+++ b/posix-aio-compat.c
@@ -585,6 +585,34 @@ BlockDriverAIOCB *paio_submit(BlockDriverState *bs, int fd,
 return acb-common;
 }
 
+BlockDriverAIOCB *paio_submit_len(BlockDriverState *bs, int fd,
+int64_t sector_num, QEMUIOVector *qiov, int nb_len,
+BlockDriverCompletionFunc *cb, void *opaque, int type)
+{
+struct qemu_paiocb *acb;
+
+acb = qemu_aio_get(raw_aio_pool, bs, cb, opaque);
+if (!acb)
+return NULL;
+acb-aio_type = type;
+acb-aio_fildes = fd;
+acb-ev_signo = SIGUSR2;
+acb-async_context_id = get_async_context_id();
+
+if (qiov) {
+acb-aio_iov = qiov-iov;
+acb-aio_niov = qiov-niov;
+}
+acb-aio_nbytes = nb_len;
+acb-aio_offset = 0;
+
+acb-next = posix_aio_state-first_aio;
+posix_aio_state-first_aio = acb;
+
+qemu_paio_submit(acb);
+return acb-common;
+}
+
 BlockDriverAIOCB *paio_ioctl(BlockDriverState *bs, int fd,
 unsigned long int req, void *buf,
 BlockDriverCompletionFunc *cb, void *opaque)
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] [scsi]: Add BSG support for scsi-bus and scsi-disk

2010-06-14 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

This patch updates hw/scsi-bus.c:scsi_bus_legacy_add_drive() to check for the 
scsi-bsg backstore.

It also updates hw/scsi-disk.c:scsi_disk_initfn() to check for when 
bdrv_is_bsg() is present and
we need to fail for the fileio backed scsi-disk code.

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
---
 hw/scsi-bus.c  |3 ++-
 hw/scsi-disk.c |4 
 2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/hw/scsi-bus.c b/hw/scsi-bus.c
index 7d80405..b5f5fbb 100644
--- a/hw/scsi-bus.c
+++ b/hw/scsi-bus.c
@@ -90,7 +90,8 @@ SCSIDevice *scsi_bus_legacy_add_drive(SCSIBus *bus, DriveInfo 
*dinfo, int unit)
 const char *driver;
 DeviceState *dev;
 
-driver = bdrv_is_sg(dinfo-bdrv) ? scsi-generic : scsi-disk;
+driver = bdrv_is_bsg(dinfo-bdrv) ? scsi-bsg :
+ bdrv_is_sg(dinfo-bdrv) ? scsi-generic : scsi-disk;
 dev = qdev_create(bus-qbus, driver);
 qdev_prop_set_uint32(dev, scsi-id, unit);
 qdev_prop_set_drive(dev, drive, dinfo);
diff --git a/hw/scsi-disk.c b/hw/scsi-disk.c
index fb3c189..290a2e7 100644
--- a/hw/scsi-disk.c
+++ b/hw/scsi-disk.c
@@ -704,6 +704,10 @@ static int scsi_disk_initfn(SCSIDevice *dev)
 error_report(scsi-disk: unwanted /dev/sg*);
 return -1;
 }
+if (bdrv_is_bsg(s-bs)) {
+error_report(scsi-disk: unwanted BSG);
+return -1;
+} 
 
 if (bdrv_get_type_hint(s-bs) == BDRV_TYPE_CDROM) {
 s-qdev.blocksize = 2048;
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] [scsi-bsg]: Add initial support for BSG based SCSIDeviceInfo

2010-06-14 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

This patch adds initial support for using the Linux BSG interface with 
write/read vectored
AIO as a QEMU backstore (SCSIDeviceInfo) with hw/scsi-bus.c compatible HBA 
emulation.

So far it has been tested with x86_64 host and guest using hw/megasas.c and 
TCM_Loop LLD
Port LUNs.  Because this path uses struct iovec for struct 
sg_io_v4-d[out,in]_xferp payloads,
which currently requires a patch to linux/block/bsg.c:bsg_map_hdr() in order to 
setup the
user - kernel iovecs.  This also will only currently work with paired 
user/kernel
(eg: 64bit user / 64bit kernel) because of different pointer sizes in struct 
iovec-iov_base.

There are also two FIXMEs in hw/scsi-bsg.c:bsg_generic_initfn() related to 
extraction of
SCSI LUN and device type values using BSG and required by QEMU-KVM.

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
---
 Makefile.objs |2 +-
 hw/scsi-bsg.c |  588 +
 2 files changed, 589 insertions(+), 1 deletions(-)
 create mode 100644 hw/scsi-bsg.c

diff --git a/Makefile.objs b/Makefile.objs
index 188d617..c4fcb72 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -197,7 +197,7 @@ hw-obj-$(CONFIG_IDE_CMD646) += ide/cmd646.o
 hw-obj-$(CONFIG_IDE_MACIO) += ide/macio.o
 
 # SCSI layer
-hw-obj-y += scsi-disk.o scsi-generic.o
+hw-obj-y += scsi-disk.o scsi-generic.o scsi-bsg.o
 hw-obj-y += lsi53c895a.o megasas.o
 hw-obj-$(CONFIG_ESP) += esp.o
 
diff --git a/hw/scsi-bsg.c b/hw/scsi-bsg.c
new file mode 100644
index 000..fc76b76
--- /dev/null
+++ b/hw/scsi-bsg.c
@@ -0,0 +1,588 @@
+/*
+ * block layer implementation of the sg v4 interface for Linux hosts
+ *
+ * Copyright (c) 2010 Rising Tide Systems
+ * Written by Nicholas A. Bellinger n...@linux-iscsi.org
+ *
+ * Based on hw/scsi-generic code by Laurent Vivier, Paul Brook, and Fabrice 
Bellard
+ *
+ * This code is licenced under the LGPL.
+ */
+
+#include qemu-common.h
+#include qemu-error.h
+#include block.h
+#include scsi.h
+#include dma.h
+#include block/raw-posix-aio.h
+
+#ifdef __linux__
+
+#define DEBUG_BSG
+#undef DEBUG_BSG_IO
+#undef DEBUG_BSG_MAP
+
+#ifdef DEBUG_BSG
+#define DPRINTF(fmt, ...) \
+do { printf(scsi-bsg:  fmt , ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) do {} while(0)
+#endif
+
+#define BADF(fmt, ...) \
+do { fprintf(stderr, scsi-bsg:  fmt , ## __VA_ARGS__); } while (0)
+
+#include stdio.h
+#include sys/types.h
+#include sys/stat.h
+#include sys/epoll.h
+#include unistd.h
+#include scsi/sg.h
+#include linux/bsg.h
+#include scsi-defs.h
+
+#define SCSI_SENSE_BUF_SIZE 96
+
+#define SG_ERR_DRIVER_TIMEOUT 0x06
+#define SG_ERR_DRIVER_SENSE 0x08
+
+#ifndef MAX_UINT
+#define MAX_UINT ((unsigned int)-1)
+#endif
+
+typedef struct SCSIBSGState SCSIBSGState;
+
+typedef struct SCSIBSGReq {
+SCSIRequest req;
+uint8_t *buf;
+int buflen;
+QEMUIOVector iov;
+QEMUIOVector aio_iov;
+struct sg_io_v4 bsg_hdr;
+} SCSIBSGReq;
+
+struct SCSIBSGState {
+SCSIDevice qdev;
+BlockDriverState *bs;
+int lun;
+int driver_status;
+uint8_t sensebuf[SCSI_SENSE_BUF_SIZE];
+uint8_t senselen;
+};
+
+static int bsg_read(int fd, void *p_read, int to_read)
+{
+int err;
+
+while (to_read  0) {
+err = read(fd, p_read, to_read);
+if (err = 0) {
+to_read -= err;
+p_read += err;
+} else if (errno == EINTR)
+continue;
+else {
+printf(bsg device %d read failed, errno: %d\n,
+fd, errno);
+return errno;
+}
+}
+return 0;
+}
+
+static SCSIBSGReq *bsg_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun)
+{
+SCSIRequest *req;
+SCSIBSGReq *r;
+
+req = scsi_req_alloc(sizeof(SCSIBSGReq), d, tag, lun);
+r = DO_UPCAST(SCSIBSGReq, req, req);
+qemu_iovec_init(r-iov, 1);
+qemu_iovec_init(r-aio_iov, 1);
+return r;
+}
+
+static void bsg_remove_request(SCSIBSGReq *r)
+{
+qemu_free(r-buf);
+qemu_iovec_destroy(r-iov);
+qemu_iovec_destroy(r-aio_iov);
+scsi_req_free(r-req);
+}
+
+static void bsg_command_complete(void *opaque, int ret)
+{
+SCSIBSGReq *r = (SCSIBSGReq *)opaque;
+SCSIBSGState *s = DO_UPCAST(SCSIBSGState, qdev, r-req.dev);
+
+s-driver_status = r-bsg_hdr.driver_status;
+if (s-driver_status)
+s-senselen = SCSI_SENSE_BUF_SIZE;
+
+if (ret != 0) {
+scsi_req_print(r-req);
+fprintf(stderr, %s: ret %d (%s)\n, __FUNCTION__,
+ret, strerror(-ret));
+s-senselen = scsi_build_sense(SENSE_CODE(INVALID_FIELD),
+s-sensebuf, SCSI_SENSE_BUF_SIZE, 0);
+s-driver_status = SG_ERR_DRIVER_SENSE;
+r-req.status = CHECK_CONDITION;
+} else {
+if (s-driver_status  SG_ERR_DRIVER_TIMEOUT) {
+scsi_req_print(r-req);
+fprintf(stderr, %s: timeout\n, __FUNCTION__);
+r-req.status = BUSY  1;
+} else if 

[PATCH 0/5] [QEMU-KVM]: Add BSG backstore using struct sg_io_v4

2010-06-14 Thread Nicholas A. Bellinger
From: Nicholas Bellinger n...@linux-iscsi.org

Quick resend with project subject for cover letter..

Greetings Gerd, Hannes and co,

This series adds initial support for a hw/scsi-bsg.c backstore for scsi-bus
compatible HBA emulation in QEMU-KVM on Linux hosts supporting the BSG driver.
This code is available from the scsi-bsg branch in the megasas/scsi friendly 
QEMU-KVM tree at:

http://git.kernel.org/?p=virt/kvm/nab/qemu-kvm.git;a=shortlog;h=refs/heads/scsi-bsg

Note that this initial code is being posted for review and to see how useful a 
BSG backstore
would be for QEMU-KVM and Linux hosts.  Note that in order for BSG I/O to 
function using vectored
AIO a kernel patch to linux/block/bsg.c:bsg_map_hdr() is currently required 
running on a bit paired
user/kernel enviroment.  The kernel patch in question is here:

http://marc.info/?l=linux-scsim=127649585524598w=2

The first three patches involve updating block code to support the BSG 
backstore for scsi-bsg.

The forth patch adds the minor changes to hw/scsi-bus.c and hw/scsi-disk.c in 
order to
function with scsi-bsg.

And the fifth patch adds the main hw/scsi-bsg.c logic necessary to run the new 
struct
SCSIDeviceInfo and for BSG AIO using struct iovec and paio_submit_len() to 
function.

So far this patch series has been tested with a Linux based x86_64 KVM host and 
guest
using the hw/megasas.c 8708EM2 HBA Emulation with TCM_Loop virtual SAS Port 
LUNs.

Comments are welcome,

Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org

Nicholas Bellinger (5):
  [block]: Add top level BSG support
  [block]: Add BSG qemu_open() in block/raw.c:raw_open()
  [block]: Add paio_submit_len() non sector sized AIO
  [scsi]: Add BSG support for scsi-bus and scsi-disk
  [scsi-bsg]: Add initial support for BSG based SCSIDeviceInfo

 Makefile.objs |2 +-
 block.c   |   23 ++-
 block.h   |1 +
 block/raw-posix-aio.h |3 +
 block/raw-posix.c |   17 ++-
 block/raw.c   |   20 ++
 block_int.h   |5 +
 hw/scsi-bsg.c |  588 +
 hw/scsi-bus.c |3 +-
 hw/scsi-disk.c|4 +
 posix-aio-compat.c|   28 +++
 11 files changed, 687 insertions(+), 7 deletions(-)
 create mode 100644 hw/scsi-bsg.c

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/24] Prepare vmcs02 from vmcs01 and vmcs12

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:29 PM, Nadav Har'El wrote:

This patch contains code to prepare the VMCS which can be used to actually
run the L2 guest, vmcs02. prepare_vmcs02 appropriately merges the information
in shadow_vmcs that L1 built for L2 (vmcs12), and that in the VMCS that we
built for L1 (vmcs01).

VMREAD/WRITE can only access one VMCS at a time (the current VMCS), which
makes it difficult for us to read from vmcs01 while writing to vmcs12. This
is why we first make a copy of vmcs01 in memory (l1_shadow_vmcs) and then
read that memory copy while writing to vmcs12.

Signed-off-by: Nadav Har'Eln...@il.ibm.com
---
--- .before/arch/x86/kvm/vmx.c  2010-06-13 15:01:29.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2010-06-13 15:01:29.0 +0300
@@ -849,6 +849,36 @@ static inline bool report_flexpriority(v
return flexpriority_enabled;
  }

+static inline bool nested_cpu_has_vmx_tpr_shadow(struct kvm_vcpu *vcpu)
+{
+   return cpu_has_vmx_tpr_shadow()
+   get_shadow_vmcs(vcpu)-cpu_based_vm_exec_control
+   CPU_BASED_TPR_SHADOW;
   


Operator precedence is with you, but the line width limit is not.  Use 
parentheses to improve readability.



@@ -1292,6 +1322,39 @@ static void vmx_load_host_state(struct v
preempt_enable();
  }

+int load_vmcs_host_state(struct shadow_vmcs *src)
+{
+   vmcs_write16(HOST_ES_SELECTOR, src-host_es_selector);
+   vmcs_write16(HOST_CS_SELECTOR, src-host_cs_selector);
+   vmcs_write16(HOST_SS_SELECTOR, src-host_ss_selector);
+   vmcs_write16(HOST_DS_SELECTOR, src-host_ds_selector);
+   vmcs_write16(HOST_FS_SELECTOR, src-host_fs_selector);
+   vmcs_write16(HOST_GS_SELECTOR, src-host_gs_selector);
+   vmcs_write16(HOST_TR_SELECTOR, src-host_tr_selector);
   


Why do we need to go through a shadow_vmcs for host fields?  Instead of 
cloning a vmcs, you can call a common init routing to initialize the 
host fields.



+
+   vmcs_write64(TSC_OFFSET, src-tsc_offset);
   


Don't you need to adjust for our TSC_OFFSET?


+/* prepare_vmcs_02 is called in when the L1 guest hypervisor runs its nested
+ * L2 guest. L1 has a vmcs for L2 (vmcs12), and this function merges it
+ * with L0's wishes for its guest (vmsc01), so we can run the L2 guest in a
+ * way that will both be appropriate to L1's requests, and our needs.
+ */
+int prepare_vmcs_02(struct kvm_vcpu *vcpu,
+   struct shadow_vmcs *vmcs12, struct shadow_vmcs *vmcs01)
+{
+   u32 exec_control;
+
+   load_vmcs_common(vmcs12);
+
+   vmcs_write64(VMCS_LINK_POINTER, vmcs12-vmcs_link_pointer);
   


Not sure about this.  We don't use the vmcs link pointer, it's better to 
keep it at its default of -1ull.



+   vmcs_write64(IO_BITMAP_A, vmcs01-io_bitmap_a);
+   vmcs_write64(IO_BITMAP_B, vmcs01-io_bitmap_b);
   


I guess merging the io bitmaps doesn't make much sense, at least at this 
stage.



+   if (cpu_has_vmx_msr_bitmap())
+   vmcs_write64(MSR_BITMAP, vmcs01-msr_bitmap);
   


However, merging the msr bitmaps is critical.  Perhaps you do it in a 
later patch.



+
+   if (vmcs12-vm_entry_msr_load_count  0 ||
+   vmcs12-vm_exit_msr_load_count  0 ||
+   vmcs12-vm_exit_msr_store_count  0) {
+   printk(KERN_WARNING
+   %s: VMCS MSR_{LOAD,STORE} unsupported\n, __func__);
   


Unfortunate, since kvm has started to use this feature.

For all unsupported mandatory features, we need reporting that is always 
enabled (i.e. no dprintk()), but to avoid flooding dmesg, use 
printk_ratelimit() or report just once.  Also, it's better to kill the 
guest (KVM_REQ_TRIPLE_FAULT, or a VM instruction error) than to let it 
continue incorrectly.



+   }
+
+   if (nested_cpu_has_vmx_tpr_shadow(vcpu)) {
+   struct page *page =
+   nested_get_page(vcpu, vmcs12-virtual_apic_page_addr);
+   if (!page)
+   return 1;
   


?


+   vmcs_write64(VIRTUAL_APIC_PAGE_ADDR, page_to_phys(page));
+   kvm_release_page_clean(page);
   


Hm.  The host can move this page around.  If it happens not to be mapped 
into the guest, we won't get any notification.  So we need to keep a 
reference to this page, or else force an exit from the mmu notifier 
callbacks if it is removed by the host.



+   }
+
+   if (nested_vm_need_virtualize_apic_accesses(vcpu)) {
+   struct page *page =
+   nested_get_page(vcpu, vmcs12-apic_access_addr);
+   if (!page)
+   return 1;
+   vmcs_write64(APIC_ACCESS_ADDR, page_to_phys(page));
+   kvm_release_page_clean(page);
+   }
   


Ditto.


+
+   vmcs_write32(PIN_BASED_VM_EXEC_CONTROL,
+(vmcs01-pin_based_vm_exec_control |
+ vmcs12-pin_based_vm_exec_control));
   


We don't really want the guest's pin-based controls to influence 

Re: [PATCH 0/3]: Fixes to IRQ routing

2010-06-14 Thread Gleb Natapov
On Thu, Jun 10, 2010 at 04:44:04PM -0400, Chris Lalancette wrote:
 As we've discussed previously, here is a series of patches to
 fix some of the IRQ routing issues we have in KVM.  With this series
 in place I was able to successfully kdump a RHEL-5 64-bit, and RHEL-6
 32- and 64-bit guest on CPU's other than the BSP.  RHEL-5 32-bit kdump still
 does not work; it gets stuck on Checking 'hlt' instruction.  However,
 it does that both before and after this series, so there is something
 else going on there that I still have to debug.
 
 I also need to change the kvm_migrate_pit_timer function to migrate the
 timer over to the last CPU that handled the timer interrupt, on the
 theory that that particlar CPU is likely to handle the timer interrupt again
 in the near future.
 
 Changes since RFC:
  - Changed ps-inject_lock from raw_spinlock_t to spinlock_t
  - Fixed up some formatting issues
  - Changed to have one PIT workqueue per-guest
  - Remember to cancel_work_sync when destroying the PIT
 
 Chris Lalancette
 
Looks good to me.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/24] Implement VMLAUNCH and VMRESUME

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:30 PM, Nadav Har'El wrote:

Implement the VMLAUNCH and VMRESUME instructions, allowing a guest
hypervisor to run its own guests.

Signed-off-by: Nadav Har'Eln...@il.ibm.com
---
--- .before/arch/x86/kvm/vmx.c  2010-06-13 15:01:29.0 +0300
+++ .after/arch/x86/kvm/vmx.c   2010-06-13 15:01:29.0 +0300
@@ -272,6 +272,9 @@ struct __attribute__ ((__packed__)) vmcs
struct shadow_vmcs shadow_vmcs;

bool launch_state; /* set to 0 by VMCLEAR, to 1 by VMLAUNCH */
+
+   int cpu;
   


Not sure cpu should be here.  It's certainly won't survive live 
migration.  Perhaps in struct vmcs_list (which should be renamed, 
perhaps struct cached_vmcs).



+   int launched;
  };
   


What's the difference between this and launch_state?



  struct vmcs_list {
@@ -297,6 +300,24 @@ struct nested_vmx {
/* list of real (hardware) VMCS, one for each L2 guest of L1 */
struct list_head l2_vmcs_list; /* a vmcs_list */
int l2_vmcs_num;
+
+   /* Are we running a nested guest now */
+   bool nested_mode;
+   /* Level 1 state for switching to level 2 and back */
+   struct  {
+   u64 efer;
+   unsigned long cr3;
+   unsigned long cr4;
+   u64 io_bitmap_a;
+   u64 io_bitmap_b;
+   u64 msr_bitmap;
+   int cpu;
+   int launched;
+   } l1_state;
   


This state needs save/restore support (as well as the current vmptr and 
vmxon state).



+   /* Level 1 shadow vmcs for switching to level 2 and back */
+   struct shadow_vmcs *l1_shadow_vmcs;
   


Again, not really happy about shadowing the non-nested vmcs.


+   /* Level 1 vmcs loaded into the processor */
+   struct vmcs *l1_vmcs;
  };

  enum vmcs_field_type {
@@ -1407,6 +1428,19 @@ static void vmx_vcpu_load(struct kvm_vcp
new_offset = vmcs_read64(TSC_OFFSET) + delta;
vmcs_write64(TSC_OFFSET, new_offset);
}
+
+   if (vmx-nested.l1_shadow_vmcs != NULL) {
+   struct shadow_vmcs *l1svmcs =
+   vmx-nested.l1_shadow_vmcs;
+   l1svmcs-host_tr_base = vmcs_readl(HOST_TR_BASE);
+   l1svmcs-host_gdtr_base = vmcs_readl(HOST_GDTR_BASE);
+   l1svmcs-host_ia32_sysenter_esp =
+   vmcs_readl(HOST_IA32_SYSENTER_ESP);
   


These are all static (at least on a single cpu.  No need to read them 
from a vmcs.



+   if (tsc_this  vcpu-arch.host_tsc)
+   l1svmcs-tsc_offset = vmcs_read64(TSC_OFFSET);
+   if (vmx-nested.nested_mode)
+   load_vmcs_host_state(l1svmcs);
+   }
}
  }


@@ -4348,6 +4392,42 @@ static int handle_vmclear(struct kvm_vcp
return 1;
  }

+static int nested_vmx_run(struct kvm_vcpu *vcpu);
+
+static int handle_launch_or_resume(struct kvm_vcpu *vcpu, bool launch)
+{
+   if (!nested_vmx_check_permission(vcpu))
+   return 1;
+
+   if (!nested_map_current(vcpu))
+   return 1;
   


Better error handling needed, perhaps triple fault.


+   if (to_vmx(vcpu)-nested.current_l2_page-launch_state == launch) {
+   /* Must use VMLAUNCH for the first time, VMRESUME later */
+   set_rflags_to_vmx_fail_valid(vcpu);
+   nested_unmap_current(vcpu);
   


skip_emulted_instruction();


+   return 1;
+   }
+   nested_unmap_current(vcpu);
+
+   skip_emulated_instruction(vcpu);
+
+   nested_vmx_run(vcpu);
+   return 1;
+}

@@ -4958,7 +5038,8 @@ static int vmx_handle_exit(struct kvm_vc
   (0x%x) and exit reason is 0x%x\n,
   __func__, vectoring_info, exit_reason);

-   if (unlikely(!cpu_has_virtual_nmis()  vmx-soft_vnmi_blocked)) {
+   if (!vmx-nested.nested_mode
+   unlikely(!cpu_has_virtual_nmis()  vmx-soft_vnmi_blocked)) {
   


Too much indent.  the unlikely() looks like the first statement of the 
block.


I think it isn't enough to check for nested mode.  If the guest hasn't 
enabled virtual NMIs, then the nested guest should behave exactly like 
the guest.




+static int nested_vmx_run(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+   vmx-nested.nested_mode = 1;
   


true


+   sync_cached_regs_to_vmcs(vcpu);
+   save_vmcs(vmx-nested.l1_shadow_vmcs);
+
+   vmx-nested.l1_state.efer = vcpu-arch.efer;
   


Not sure why you need to save efer.  Ordinarily, vmx reconstructs it 
from the guest efer and the host size exit control, you can do the same.



+   if (!enable_ept)
+   vmx-nested.l1_state.cr3 = vcpu-arch.cr3;
   


Ditto, isn't that HOST_CR3?


+   vmx-nested.l1_state.cr4 = vcpu-arch.cr4;
   


Ditto.


+
+   if (!nested_map_current(vcpu)) {

Re: [PATCH 3/5] [block]: Add paio_submit_len() non sector sized AIO

2010-06-14 Thread Christoph Hellwig
On Mon, Jun 14, 2010 at 02:44:31AM -0700, Nicholas A. Bellinger wrote:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 This patch adds posix-aio-compat.c:paio_submit_len(), which is a identical
 to paio_submit() expect that in expected nb_len instead of nb_sectors (* 512)
 so that it can be used by BSG AIO for write()/read() of struct sg_io_v4.

Jusre remove the nb_sectors argument, we already get the length passed
in the iovec.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] [block]: Add paio_submit_len() non sector sized AIO

2010-06-14 Thread Nicholas A. Bellinger
On Mon, 2010-06-14 at 07:42 -0400, Christoph Hellwig wrote:
 On Mon, Jun 14, 2010 at 02:44:31AM -0700, Nicholas A. Bellinger wrote:
  From: Nicholas Bellinger n...@linux-iscsi.org
  
  This patch adds posix-aio-compat.c:paio_submit_len(), which is a identical
  to paio_submit() expect that in expected nb_len instead of nb_sectors (* 
  512)
  so that it can be used by BSG AIO for write()/read() of struct sg_io_v4.
 
 Jusre remove the nb_sectors argument, we already get the length passed
 in the iovec.
 

Mmmm, should this stay in a seperate paio_submit_len() function..?  Or
would it be preferred to calculate struct qemu_paiocb-aio_nbytes from
the passed iov[i].iov_len for the BlockDriverState-sg == BDS_BSG
case..?

Best,

--nab

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/24] Deciding if L0 or L1 should handle an L2 exit

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:32 PM, Nadav Har'El wrote:

This patch contains the logic of whether an L2 exit should be handled by L0
and then L2 should be resumed, or whether L1 should be run to handle this
exit (using the nested_vmx_vmexit() function of the previous patch).

The basic idea is to let L1 handle the exit only if it actually asked to
trap this sort of event. For example, when L2 exits on a change to CR0,
we check L1's CR0_GUEST_HOST_MASK to see if L1 expressed interest in any
bit which changed; If it did, we exit to L1. But if it didn't it means that
it is we (L0) that wished to trap this event, so we handle it ourselves.

The next two patches add additional logic of what to do when an interrupt or
exception is injected: Does L0 need to do it, should we exit to L1 to do it,
or should we resume L2 and keep the exception to be injected later.

We keep a new flag, nested_run_pending, which can override the decision of
which should run next, L1 or L2. nested_run_pending=1 means that we *must* run
L2 next, not L1. This is necessary in several situations where had L1 run on
bare metal it would not have expected to be resumed at this stage. One
example is when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run.
Another examples is when L2 exits on an #NM exception that L0 asked for
(because of lazy FPU loading), and L0 must deal with the exception and resume
L2 which was in a middle of an instruction, and not resume L1 which does not
expect to see an exit from L2 at this point. nested_run_pending is especially
intended to avoid switching to L1 in the injection decision-point described
above.

@@ -3819,6 +3841,8 @@ static int handle_exception(struct kvm_v

if (is_no_device(intr_info)) {
vmx_fpu_activate(vcpu);
+   if (vmx-nested.nested_mode)
+   vmx-nested.nested_run_pending = 1;
return 1;
}
   


Isn't this true for many other exceptions?  #UD which we emulate (but 
the guest doesn't trap), page faults which we handle completely...




+
+/* Return 1 if we should exit from L2 to L1 to handle a CR access exit,
+ * rather than handle it ourselves in L0. I.e., check if L1 wanted to
+ * intercept (via guest_host_mask etc.) the current event.
+ */
+static bool nested_vmx_exit_handled_cr(struct kvm_vcpu *vcpu,
+   struct shadow_vmcs *l2svmcs)
+{
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   int cr = exit_qualification  15;
+   int reg = (exit_qualification  8)  15;
+   unsigned long val = kvm_register_read(vcpu, reg);
+
+   switch ((exit_qualification  4)  3) {
+   case 0: /* mov to cr */
+   switch (cr) {
+   case 0:
+   if (l2svmcs-cr0_guest_host_mask
+   (val ^ l2svmcs-cr0_read_shadow))
+   return 1;
+   break;
+   case 3:
+   if (l2svmcs-cpu_based_vm_exec_control
+   CPU_BASED_CR3_LOAD_EXITING)
+   return 1;
+   break;
+   case 4:
+   if (l2svmcs-cr4_guest_host_mask
+   (l2svmcs-cr4_read_shadow ^ val))
+   return 1;
+   break;
+   case 8:
+   if (l2svmcs-cpu_based_vm_exec_control
+   CPU_BASED_CR8_LOAD_EXITING)
+   return 1;
   


Should check TPR threshold here too if enabled.



+   case 3: /* lmsw */
+   if (l2svmcs-cr0_guest_host_mask
+   (val ^ l2svmcs-cr0_read_shadow))
+   return 1;
   


Need to mask off bit 0 (cr0.pe) of val, since lmsw can't clear it.


+   break;
+   }
+   return 0;
+}
+
+/* Return 1 if we should exit from L2 to L1 to handle an exit, or 0 if we
+ * should handle it ourselves in L0. Only call this when in nested_mode (L2).
+ */
+static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu, bool afterexit)
+{
+   u32 exit_code = vmcs_read32(VM_EXIT_REASON);
+   struct vcpu_vmx *vmx = to_vmx(vcpu);
+   u32 intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
+   struct shadow_vmcs *l2svmcs;
+   int r = 0;
+
+   if (vmx-nested.nested_run_pending)
+   return 0;
+
+   if (unlikely(vmx-fail)) {
+   printk(KERN_INFO %s failed vm entry %x\n,
+  __func__, vmcs_read32(VM_INSTRUCTION_ERROR));
+   return 1;
+   }
+
+   if (afterexit) {
+   /* There are some cases where we should let L1 handle certain
+* events when these are injected (afterexit==0) but we should
+* handle them in L0 on an exit (afterexit==1).
+*/
+   switch (exit_code) {
+   case EXIT_REASON_EXTERNAL_INTERRUPT:
+   return 0;
+

Re: [PATCH 20/24] Correct handling of interrupt injection

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:32 PM, Nadav Har'El wrote:

When KVM wants to inject an interrupt, the guest should think a real interrupt
has happened. Normally (in the non-nested case) this means checking that the
guest doesn't block interrupts (and if it does, inject when it doesn't - using
the interrupt window VMX mechanism), and setting up the appropriate VMCS
fields for the guest to receive the interrupt.

However, when we are running a nested guest (L2) and its hypervisor (L1)
requested exits on interrupts (as most hypervisors do), the most efficient
thing to do is to exit L2, telling L1 that the exit was caused by an
interrupt, the one we were injecting; Only when L1 asked not to be notified
of interrupts, we should to inject it directly to the running guest L2 (i.e.,
the normal code path).

However, properly doing what is described above requires invasive changes to
the flow of the existing code, which we elected not to do in this stage.
Instead we do something more simplistic and less efficient: we modify
vmx_interrupt_allowed(), which kvm calls to see if it can inject the interrupt
now, to exit from L2 to L1 before continuing the normal code. The normal kvm
code then notices that L1 is blocking interrupts, and sets the interrupt
window to inject the interrupt later to L1. Shortly after, L1 gets the
interrupt while it is itself running, not as an exit from L2. The cost is an
extra L1 exit (the interrupt window).

   


That's a little sad.



cpu_based_vm_exec_control = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
cpu_based_vm_exec_control |= CPU_BASED_VIRTUAL_INTR_PENDING;
@@ -3718,6 +3738,13 @@ static int nested_vmx_vmexit(struct kvm_

  static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu)
  {
+   if (to_vmx(vcpu)-nested.nested_mode  nested_exit_on_intr(vcpu)) {
+   if (to_vmx(vcpu)-nested.nested_run_pending)
+   return 0;
+   nested_vmx_vmexit(vcpu, true);
+   /* fall through to normal code, but now in L1, not L2 */
+   }
+
   


What exit is reported here?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] Nested VMX, v5

2010-06-14 Thread Avi Kivity

On 06/13/2010 03:22 PM, Nadav Har'El wrote:

Hi Avi,

This is a followup of our nested VMX patches that Orit Wasserman posted in
December. We've addressed most of the comments and concerns that you and
others on the mailing list had with the previous patch set. We hope you'll
find these patches easier to understand, and suitable for applying to KVM.


The following 24 patches implement nested VMX support. The patches enable a
guest to use the VMX APIs in order to run its own nested guests. I.e., it
allows running hypervisors (that use VMX) under KVM. We describe the theory
behind this work, our implementation, and its performance characteristics,
in IBM Research report H-0282, The Turtles Project: Design and Implementation
of Nested Virtualization, available at:

http://bit.ly/a0o9te

The current patches support running Linux under a nested KVM using shadow
page table (with bypass_guest_pf disabled). They support multiple nested
hypervisors, which can run multiple guests. Only 64-bit nested hypervisors
are supported. SMP is supported. Additional patches for running Windows under
nested KVM, and Linux under nested VMware server, and support for nested EPT,
are currently running in the lab, and will be sent as follow-on patchsets.

These patches were written by:
  Abel Gordon, abelgat  il.ibm.com
  Nadav Har'El, nyhat  il.ibm.com
  Orit Wasserman, oritwat  il.ibm.com
  Ben-Ami Yassor, benamiat  il.ibm.com
  Muli Ben-Yehuda, muliat  il.ibm.com

With contributions by:
  Anthony Liguori, aliguoriat  us.ibm.com
  Mike Day, mddayat  us.ibm.com

This work was inspired by the nested SVM support by Alexander Graf and Joerg
Roedel.


Changes since v4:
* Rebased to the current KVM tree.
* Support for lazy FPU loading.
* Implemented about 90 requests and suggestions made on the mailing list
   regarding the previous version of this patch set.
* Split the changes into many more, and better documented, patches.

   


Overall, very nice.  The finer split and better documentation really 
help reviewing, thanks.


Let's try to get this merged quickly.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/24] Introduce vmcs12: a VMCS structure for L1

2010-06-14 Thread Avi Kivity

On 06/14/2010 11:49 AM, Nadav Har'El wrote:

On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 5/24] Introduce vmcs12: a 
VMCS structure for L1:
   

On 06/13/2010 03:25 PM, Nadav Har'El wrote:
 

+#define VMCS12_REVISION 0x11e57ed0

   

Where did this number come from?  It's not from real hardware, yes?
 

Basically, we are presenting emulated VMX for the L1 guest, complete with
its own VMCS structure. This structure needs to have some VMCS revision id,
which should be an arbitrary number that we invent - it is not related to any
revision id that any real hardware uses. If you look closely, you can see that
the number I used is leetspeak for Nested0 ;-)
   


Will have to brush up on my leetspeak, I see.


As you can see in the following patches, MSR_IA32_VMX_BASIC will return this
arbitrary VMCS revision id, and and VMPTRLD will verify that the VMCS region
that L1 is trying to load contains this revision id.
   


Ok good, I was worried this was a real hardware ID.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Avi Kivity

On 06/14/2010 11:48 AM, Balbir Singh wrote:


In this case the order is as follows

1. First we pick free pages if any
2. If we don't have free pages, we go after unmapped page cache and
slab cache
3. If that fails as well, we go after regularly memory

In the scenario that you describe, we'll not be able to easily free up
the frequently referenced page from /etc/*. The code will move on to
step 3 and do its regular reclaim.
   

Still it seems to me you are subverting the normal order of reclaim.
I don't see why an unmapped page cache or slab cache item should be
evicted before a mapped page.  Certainly the cost of rebuilding a
dentry compared to the gain from evicting it, is much higher than
that of reestablishing a mapped page.

 

Subverting to aviod memory duplication, the word subverting is
overloaded,


Right, should have used a different one.


let me try to reason a bit. First let me explain the
problem

Memory is a precious resource in a consolidated environment.
We don't want to waste memory via page cache duplication
(cache=writethrough and cache=writeback mode).

Now here is what we are trying to do

1. A slab page will not be freed until the entire page is free (all
slabs have been kfree'd so to speak). Normal reclaim will definitely
free this page, but a lot of it depends on how frequently we are
scanning the LRU list and when this page got added.
2. In the case of page cache (specifically unmapped page cache), there
is duplication already, so why not go after unmapped page caches when
the system is under memory pressure?

In the case of 1, we don't force a dentry to be freed, but rather a
freed page in the slab cache to be reclaimed ahead of forcing reclaim
of mapped pages.
   


Sounds like this should be done unconditionally, then.  An empty slab 
page is worth less than an unmapped pagecache page at all times, no?



Does the problem statement make sense? If so, do you agree with 1 and
2? Is there major concern about subverting regular reclaim? Does
subverting it make sense in the duplicated scenario?

   


In the case of 2, how do you know there is duplication?  You know the 
guest caches the page, but you have no information about the host.  
Since the page is cached in the guest, the host doesn't see it 
referenced, and is likely to drop it.


If there is no duplication, then you may have dropped a recently-used 
page and will likely cause a major fault soon.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 20/24] Correct handling of interrupt injection

2010-06-14 Thread Avi Kivity

On 06/14/2010 03:29 PM, Avi Kivity wrote:

On 06/13/2010 03:32 PM, Nadav Har'El wrote:
When KVM wants to inject an interrupt, the guest should think a real 
interrupt
has happened. Normally (in the non-nested case) this means checking 
that the
guest doesn't block interrupts (and if it does, inject when it 
doesn't - using
the interrupt window VMX mechanism), and setting up the appropriate 
VMCS

fields for the guest to receive the interrupt.

However, when we are running a nested guest (L2) and its hypervisor (L1)
requested exits on interrupts (as most hypervisors do), the most 
efficient

thing to do is to exit L2, telling L1 that the exit was caused by an
interrupt, the one we were injecting; Only when L1 asked not to be 
notified
of interrupts, we should to inject it directly to the running guest 
L2 (i.e.,

the normal code path).

However, properly doing what is described above requires invasive 
changes to

the flow of the existing code, which we elected not to do in this stage.
Instead we do something more simplistic and less efficient: we modify
vmx_interrupt_allowed(), which kvm calls to see if it can inject the 
interrupt
now, to exit from L2 to L1 before continuing the normal code. The 
normal kvm

code then notices that L1 is blocking interrupts, and sets the interrupt
window to inject the interrupt later to L1. Shortly after, L1 gets the
interrupt while it is itself running, not as an exit from L2. The 
cost is an

extra L1 exit (the interrupt window).



That's a little sad.


It can also be broken if the guest chooses to keep interrupts disabled 
during exits, and instead ask vmx to ack interrupts.  The guest can then 
vmread the vector number and dispatch the interrupt itself.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-14 15:40:28]:

 On 06/14/2010 11:48 AM, Balbir Singh wrote:
 
 In this case the order is as follows
 
 1. First we pick free pages if any
 2. If we don't have free pages, we go after unmapped page cache and
 slab cache
 3. If that fails as well, we go after regularly memory
 
 In the scenario that you describe, we'll not be able to easily free up
 the frequently referenced page from /etc/*. The code will move on to
 step 3 and do its regular reclaim.
 Still it seems to me you are subverting the normal order of reclaim.
 I don't see why an unmapped page cache or slab cache item should be
 evicted before a mapped page.  Certainly the cost of rebuilding a
 dentry compared to the gain from evicting it, is much higher than
 that of reestablishing a mapped page.
 
 Subverting to aviod memory duplication, the word subverting is
 overloaded,
 
 Right, should have used a different one.
 
 let me try to reason a bit. First let me explain the
 problem
 
 Memory is a precious resource in a consolidated environment.
 We don't want to waste memory via page cache duplication
 (cache=writethrough and cache=writeback mode).
 
 Now here is what we are trying to do
 
 1. A slab page will not be freed until the entire page is free (all
 slabs have been kfree'd so to speak). Normal reclaim will definitely
 free this page, but a lot of it depends on how frequently we are
 scanning the LRU list and when this page got added.
 2. In the case of page cache (specifically unmapped page cache), there
 is duplication already, so why not go after unmapped page caches when
 the system is under memory pressure?
 
 In the case of 1, we don't force a dentry to be freed, but rather a
 freed page in the slab cache to be reclaimed ahead of forcing reclaim
 of mapped pages.
 
 Sounds like this should be done unconditionally, then.  An empty
 slab page is worth less than an unmapped pagecache page at all
 times, no?


In a consolidated environment, even at the cost of some CPU to run
shrinkers, I think potentially yes.
 
 Does the problem statement make sense? If so, do you agree with 1 and
 2? Is there major concern about subverting regular reclaim? Does
 subverting it make sense in the duplicated scenario?
 
 
 In the case of 2, how do you know there is duplication?  You know
 the guest caches the page, but you have no information about the
 host.  Since the page is cached in the guest, the host doesn't see
 it referenced, and is likely to drop it.

True, that is why the first patch is controlled via a boot parameter
that the host can pass. For the second patch, I think we'll need
something like a balloon size cache? with the cache argument being
optional. 

 
 If there is no duplication, then you may have dropped a
 recently-used page and will likely cause a major fault soon.


Yes, agreed. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Alex Williamson
On Mon, 2010-06-14 at 08:39 +0200, Markus Armbruster wrote:
 Alex Williamson alex.william...@redhat.com writes:
 
  qdev_get_dev_path() is intended to be the canonical utility for creating
  a string representing the qdev hierarchy of a device.  The path consists
  of bus and device names as well as identified properties of the immediate
  parent bus and device.  This results in strings like:
 
  /main-system-bus/pci.0,addr=00.0/i440FX
  /main-system-bus/pci.0,addr=01.0/PIIX3
  /main-system-bus/pci.0,addr=02.0/cirrus-vga
  /main-system-bus/pci.0/isa.0/mc146818rtc
  /main-system-bus/pci.0/isa.0/isa-serial
  /main-system-bus/pci.0/isa.0/i8042
  /main-system-bus/pci.0/isa.0/isa-fdc
  /main-system-bus/pci.0,addr=03.0/i82551,mac=52:54:00:12:34:56
  /main-system-bus/pci.0,addr=04.0/virtio-net-pci,mac=52:54:00:12:34:57
  /main-system-bus/pci.0,addr=05.0/e1000,mac=52:54:00:12:34:58
  /main-system-bus/pci.0,addr=06.0/rtl8139,mac=52:54:00:12:34:59
  /main-system-bus/pci.0,addr=07.0/pcnet,mac=52:54:00:12:34:5a
  /main-system-bus/pci.0,addr=01.1/piix3-ide
  /main-system-bus/pci.0,addr=01.3/PIIX4_PM
  /main-system-bus/pci.0,addr=08.0/lsi53c895a
  /main-system-bus/pci.0,addr=09.0/virtio-blk-pci
 
  There are two primary targets for these strings.  The first is vmsave, where
  we currently use a device provided string plus instance number to track
  SaveStateEntries.  This fails when we introduce device hotplug, particularly
  in a case were we have gaps in the instance numbers that cannot easily be
  reproduced on a migration target.  The second is for naming RAMBlocks.  For
  these, we would like a string that matches the vmstate string.
 
 Could you explain why you add identified properties of the immediate
 parent bus and device?  They make the result ver much *not* a dev
 path in the qdev sense...

In order to try to get a unique string.  Without looking into device
properties, two e1000s would both be:

/main-system-bus/pci.0/e1000
/main-system-bus/pci.0/e1000

Which is no better than simply e1000 and would require us to fall back
to instance ids again.  The goal here is that anything that makes use of
passing a dev when registering a vmstate gets an instance id of zero.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Jan Kiszka
Alex Williamson wrote:
 On Mon, 2010-06-14 at 08:39 +0200, Markus Armbruster wrote:
 Alex Williamson alex.william...@redhat.com writes:

 qdev_get_dev_path() is intended to be the canonical utility for creating
 a string representing the qdev hierarchy of a device.  The path consists
 of bus and device names as well as identified properties of the immediate
 parent bus and device.  This results in strings like:

 /main-system-bus/pci.0,addr=00.0/i440FX
 /main-system-bus/pci.0,addr=01.0/PIIX3
 /main-system-bus/pci.0,addr=02.0/cirrus-vga
 /main-system-bus/pci.0/isa.0/mc146818rtc
 /main-system-bus/pci.0/isa.0/isa-serial
 /main-system-bus/pci.0/isa.0/i8042
 /main-system-bus/pci.0/isa.0/isa-fdc
 /main-system-bus/pci.0,addr=03.0/i82551,mac=52:54:00:12:34:56
 /main-system-bus/pci.0,addr=04.0/virtio-net-pci,mac=52:54:00:12:34:57
 /main-system-bus/pci.0,addr=05.0/e1000,mac=52:54:00:12:34:58
 /main-system-bus/pci.0,addr=06.0/rtl8139,mac=52:54:00:12:34:59
 /main-system-bus/pci.0,addr=07.0/pcnet,mac=52:54:00:12:34:5a
 /main-system-bus/pci.0,addr=01.1/piix3-ide
 /main-system-bus/pci.0,addr=01.3/PIIX4_PM
 /main-system-bus/pci.0,addr=08.0/lsi53c895a
 /main-system-bus/pci.0,addr=09.0/virtio-blk-pci

 There are two primary targets for these strings.  The first is vmsave, where
 we currently use a device provided string plus instance number to track
 SaveStateEntries.  This fails when we introduce device hotplug, particularly
 in a case were we have gaps in the instance numbers that cannot easily be
 reproduced on a migration target.  The second is for naming RAMBlocks.  For
 these, we would like a string that matches the vmstate string.
 Could you explain why you add identified properties of the immediate
 parent bus and device?  They make the result ver much *not* a dev
 path in the qdev sense...
 
 In order to try to get a unique string.  Without looking into device
 properties, two e1000s would both be:
 
 /main-system-bus/pci.0/e1000
 /main-system-bus/pci.0/e1000
 
 Which is no better than simply e1000 and would require us to fall back
 to instance ids again.  The goal here is that anything that makes use of
 passing a dev when registering a vmstate gets an instance id of zero.

Will soon (re-)post a patch that adds per-bus instance numbers to deal
with identically named devices. That's required as not all buses have
canonical node IDs (e.g. ISA or the main system bus).

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Avi Kivity

On 06/14/2010 03:50 PM, Balbir Singh wrote:





let me try to reason a bit. First let me explain the
problem

Memory is a precious resource in a consolidated environment.
We don't want to waste memory via page cache duplication
(cache=writethrough and cache=writeback mode).

Now here is what we are trying to do

1. A slab page will not be freed until the entire page is free (all
slabs have been kfree'd so to speak). Normal reclaim will definitely
free this page, but a lot of it depends on how frequently we are
scanning the LRU list and when this page got added.
2. In the case of page cache (specifically unmapped page cache), there
is duplication already, so why not go after unmapped page caches when
the system is under memory pressure?

In the case of 1, we don't force a dentry to be freed, but rather a
freed page in the slab cache to be reclaimed ahead of forcing reclaim
of mapped pages.
   

Sounds like this should be done unconditionally, then.  An empty
slab page is worth less than an unmapped pagecache page at all
times, no?

 

In a consolidated environment, even at the cost of some CPU to run
shrinkers, I think potentially yes.
   


I don't understand.  If you're running the shrinkers then you're 
evicting live entries, which could cost you an I/O each.  That's 
expensive, consolidated or not.


If you're not running the shrinkers, why does it matter if you're 
consolidated or not?  Drop that age unconditionally.



Does the problem statement make sense? If so, do you agree with 1 and
2? Is there major concern about subverting regular reclaim? Does
subverting it make sense in the duplicated scenario?

   

In the case of 2, how do you know there is duplication?  You know
the guest caches the page, but you have no information about the
host.  Since the page is cached in the guest, the host doesn't see
it referenced, and is likely to drop it.
 

True, that is why the first patch is controlled via a boot parameter
that the host can pass. For the second patch, I think we'll need
something like a balloonsize  cache?  with the cache argument being
optional.
   


Whether a page is duplicated on the host or not is per-page, it cannot 
be a boot parameter.


If we drop unmapped pagecache pages, we need to be sure they can be 
backed by the host, and that depends on the amount of sharing.


Overall, I don't see how a user can tune this.  If I were a guest admin, 
I'd play it safe by not assuming the host will back me, and disabling 
the feature.


To get something like this to work, we need to reward cooperating guests 
somehow.



If there is no duplication, then you may have dropped a
recently-used page and will likely cause a major fault soon.
 

Yes, agreed.
   


So how do we deal with this?



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/24] Nested VMX, v5

2010-06-14 Thread Nadav Har'El
On Mon, Jun 14, 2010, Avi Kivity wrote about Re: [PATCH 0/24] Nested VMX, v5:
 Overall, very nice.  The finer split and better documentation really 
 help reviewing, thanks.

Thank you for the review and all the accurate comments!

 Let's try to get this merged quickly.

I'll start fixing the individual patches and resending them individually, and
when I've fixed everything I'll resubmit the whole lot. I hope that this time
I can do it in a matter of days, not months.

Thanks,
Nadav.

-- 
Nadav Har'El|   Monday, Jun 14 2010, 2 Tammuz 5770
n...@math.technion.ac.il |-
Phone +972-523-790466, ICQ 13349191 |An egotist is a person of low taste, more
http://nadav.harel.org.il   |interested in himself than in me.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Paul Brook
   /main-system-bus/pci.0,addr=09.0/virtio-blk-pci

There's a device missing between the main system bus and the pci bus.  Should 
be something like:

/main-system-bus/piix4-pcihost/pci.0/_09.0

  Could you explain why you add identified properties of the immediate
  parent bus and device?  They make the result ver much *not* a dev
  path in the qdev sense...
 
 In order to try to get a unique string.  Without looking into device
 properties, two e1000s would both be:
 
 /main-system-bus/pci.0/e1000
 /main-system-bus/pci.0/e1000
 
 Which is no better than simply e1000 and would require us to fall back
 to instance ids again.  The goal here is that anything that makes use of
 passing a dev when registering a vmstate gets an instance id of zero.

You already got the information you need, you just put it in the wrong place. 
The canonical ID for the device could be its bus address. In practice we'd 
probably want to allow the user to specify it by name, provided these are 
unique. e.g. in the above machine we could accept [...]/virtiio-blk-pci would 
as an aias for [...]:_09.0. Device names have a restricted namespace, so we 
can use an initial prefix to disambiguate a name/label from a bus address.

For busses that don't have a consistent addressing scheme then some sort of 
instance ID is unavoidable. I guess it may be possible to invent something 
based on other device properties (e.g. address of the first IO port/memory 
region).

Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM_AUTOTEST][PATCH] Add support for newer (sysfs) ksm interface

2010-06-14 Thread Lukas Doktor

This patch adds support for newer sysfs interface used since Fedora 12.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-1760424 ] Intel real mode emulation failures

2010-06-14 Thread SourceForge.net
Bugs item #1760424, was opened at 2007-07-25 18:40
Message generated for change (Comment added) made by technologov
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=1760424group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: intel
Group: None
Status: Open
Resolution: Fixed
Priority: 8
Private: No
Submitted By: Technologov (technologov)
Assigned to: Nobody/Anonymous (nobody)
Summary: Intel real mode emulation failures

Initial Comment:
Major Linux distros crash on KVM-29, Intel64 CPU.

Those include:

Mandriva 2007 (32/64-bit)
Ubuntu 7.04 (32/64-bit)
openSUSE 10.2 (32/64-bit)
PCLinuxOS 2007 (32-bit)

The all crash because of the same reason: Graphical Bootloader. (GRUB) 

If we fix the root of the issue, all the problems will gone.

All of above work fine with -no-kvm. (Qemu)

I thouught, that bug 1740031 (Kubuntu 7.04 fails to boot as guest- hangs after 
kernel boot) is related to this one, but since it was opened against kernel, 
not against the bootloader, I have decided to open this bug.

-Alexey Eremenko.

--

Comment By: Technologov (technologov)
Date: 2010-06-14 16:55

Message:
The problem happened on Core 2 CPUs...

Do you have a system with Core 2 CPU ?

--

Comment By: Jes Sorensen (jessorensen)
Date: 2010-06-11 15:49

Message:
Pulled down a copy and tested.

Ubuntu 7.04 installs and boots fine under recent KVM/QEMU. Looks like
problem has been resolved.

Closing
Jes


--

Comment By: Technologov (technologov)
Date: 2008-08-03 11:25

Message:
Logged In: YES 
user_id=1839746
Originator: YES

The situation is improving - distros integrate newest GRUB, which
workarounds this problem.

Fixed distros: openSUSE 11.0, Ubuntu 8.04.

Mandriva Linux (2007/2008/2009-beta1) are still problematic. Bug opened.

Mandriva Linux unable to boot or install on KVM Virtualizer due to old
GRUB: 
https://qa.mandriva.com/show_bug.cgi?id=42358

-Alexey, 3.8.2008.

--

Comment By: Henrik Holst (henrik_holst)
Date: 2008-02-08 16:47

Message:
Logged In: YES 
user_id=2003951
Originator: NO

I see the same here with KVM-60, on a Core 2 Duo 6600.

Host is debian testing on a 2.6.22-3-amd64 kernel, using kvm-intel. Trying
to boot a kubuntu-7.10 iso crashes directly, works flawlessy with -no-kvm
and have tried to run both a debian and w2k3 as gursts without any
problems.

--

Comment By: Technologov (technologov)
Date: 2007-09-05 13:19

Message:
Logged In: YES 
user_id=1839746
Originator: YES

Update: with KVM-36 this is partially fixed. Works on SVM, but still
stucks on VMX.

openSUSE 10.2 (32/64-bit) and Mandriva Linux 2007 boots now.

-Alexey Technologov

--

Comment By: Technologov (technologov)
Date: 2007-07-25 18:51

Message:
Logged In: YES 
user_id=1839746
Originator: YES

File Added: KVM29-VMX64-Ubuntu7-i386.txt

--

Comment By: Technologov (technologov)
Date: 2007-07-25 18:46

Message:
Logged In: YES 
user_id=1839746
Originator: YES

File Added: KVM29-VMX64-PCLinuxOS2007-i386.txt

--

Comment By: Technologov (technologov)
Date: 2007-07-25 18:45

Message:
Logged In: YES 
user_id=1839746
Originator: YES

File Added: KVM29-VMX64-openSUSE10.2.txt

--

Comment By: Technologov (technologov)
Date: 2007-07-25 18:44

Message:
Logged In: YES 
user_id=1839746
Originator: YES

File Added: KVM29-VMX64-Mandriva2007-i386.txt

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=1760424group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-1760424 ] Intel real mode emulation failures

2010-06-14 Thread SourceForge.net
Bugs item #1760424, was opened at 2007-07-25 17:40
Message generated for change (Settings changed) made by jessorensen
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=1760424group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: intel
Group: None
Status: Closed
Resolution: Fixed
Priority: 8
Private: No
Submitted By: Technologov (technologov)
Assigned to: Nobody/Anonymous (nobody)
Summary: Intel real mode emulation failures

Initial Comment:
Major Linux distros crash on KVM-29, Intel64 CPU.

Those include:

Mandriva 2007 (32/64-bit)
Ubuntu 7.04 (32/64-bit)
openSUSE 10.2 (32/64-bit)
PCLinuxOS 2007 (32-bit)

The all crash because of the same reason: Graphical Bootloader. (GRUB) 

If we fix the root of the issue, all the problems will gone.

All of above work fine with -no-kvm. (Qemu)

I thouught, that bug 1740031 (Kubuntu 7.04 fails to boot as guest- hangs after 
kernel boot) is related to this one, but since it was opened against kernel, 
not against the bootloader, I have decided to open this bug.

-Alexey Eremenko.

--

Comment By: Jes Sorensen (jessorensen)
Date: 2010-06-14 15:59

Message:
Yes I ran my test on a core2

If you can reproduce the problem, please open a new bug for it in
launchpad.



--

Comment By: Technologov (technologov)
Date: 2010-06-14 15:55

Message:
The problem happened on Core 2 CPUs...

Do you have a system with Core 2 CPU ?

--

Comment By: Jes Sorensen (jessorensen)
Date: 2010-06-11 14:49

Message:
Pulled down a copy and tested.

Ubuntu 7.04 installs and boots fine under recent KVM/QEMU. Looks like
problem has been resolved.

Closing
Jes


--

Comment By: Technologov (technologov)
Date: 2008-08-03 10:25

Message:
Logged In: YES 
user_id=1839746
Originator: YES

The situation is improving - distros integrate newest GRUB, which
workarounds this problem.

Fixed distros: openSUSE 11.0, Ubuntu 8.04.

Mandriva Linux (2007/2008/2009-beta1) are still problematic. Bug opened.

Mandriva Linux unable to boot or install on KVM Virtualizer due to old
GRUB: 
https://qa.mandriva.com/show_bug.cgi?id=42358

-Alexey, 3.8.2008.

--

Comment By: Henrik Holst (henrik_holst)
Date: 2008-02-08 15:47

Message:
Logged In: YES 
user_id=2003951
Originator: NO

I see the same here with KVM-60, on a Core 2 Duo 6600.

Host is debian testing on a 2.6.22-3-amd64 kernel, using kvm-intel. Trying
to boot a kubuntu-7.10 iso crashes directly, works flawlessy with -no-kvm
and have tried to run both a debian and w2k3 as gursts without any
problems.

--

Comment By: Technologov (technologov)
Date: 2007-09-05 12:19

Message:
Logged In: YES 
user_id=1839746
Originator: YES

Update: with KVM-36 this is partially fixed. Works on SVM, but still
stucks on VMX.

openSUSE 10.2 (32/64-bit) and Mandriva Linux 2007 boots now.

-Alexey Technologov

--

Comment By: Technologov (technologov)
Date: 2007-07-25 17:51

Message:
Logged In: YES 
user_id=1839746
Originator: YES

File Added: KVM29-VMX64-Ubuntu7-i386.txt

--

Comment By: Technologov (technologov)
Date: 2007-07-25 17:46

Message:
Logged In: YES 
user_id=1839746
Originator: YES

File Added: KVM29-VMX64-PCLinuxOS2007-i386.txt

--

Comment By: Technologov (technologov)
Date: 2007-07-25 17:45

Message:
Logged In: YES 
user_id=1839746
Originator: YES

File Added: KVM29-VMX64-openSUSE10.2.txt

--

Comment By: Technologov (technologov)
Date: 2007-07-25 17:44

Message:
Logged In: YES 
user_id=1839746
Originator: YES

File Added: KVM29-VMX64-Mandriva2007-i386.txt

--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=1760424group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Dave Hansen
On Mon, 2010-06-14 at 16:01 +0300, Avi Kivity wrote:
 If we drop unmapped pagecache pages, we need to be sure they can be 
 backed by the host, and that depends on the amount of sharing.

You also have to set up the host up properly, and continue to maintain
it in a way that finds and eliminates duplicates.

I saw some benchmarks where KSM was doing great, finding lots of
duplicate pages.  Then, the host filled up, and guests started
reclaiming.  As memory pressure got worse, so did KSM's ability to find
duplicates.

At the same time, I see what you're trying to do with this.  It really
can be an alternative to ballooning if we do it right, since ballooning
would probably evict similar pages.  Although it would only work in idle
guests, what about a knob that the host can turn to just get the guest
to start running reclaim?

-- Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Avi Kivity

On 06/14/2010 06:12 PM, Dave Hansen wrote:

On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote:
   

1. A slab page will not be freed until the entire page is free (all
slabs have been kfree'd so to speak). Normal reclaim will definitely
free this page, but a lot of it depends on how frequently we are
scanning the LRU list and when this page got added.
 

You don't have to be freeing entire slab pages for the reclaim to have
been useful.  You could just be making space so that _future_
allocations fill in the slab holes you just created.  You may not be
freeing pages, but you're reducing future system pressure.
   


Depends.  If you've evicted something that will be referenced soon, 
you're increasing system pressure.



If unmapped page cache is the easiest thing to evict, then it should be
the first thing that goes when a balloon request comes in, which is the
case this patch is trying to handle.  If it isn't the easiest thing to
evict, then we _shouldn't_ evict it.
   


Easy to evict is just one measure.  There's benefit (size of data 
evicted), cost to refill (seeks, cpu), and likelihood that the cost to 
refill will be incurred (recency).


It's all very complicated.  We need better information to make these 
decisions.  For one thing, I'd like to see age information tied to 
objects.  We may have two pages that were referenced in wildly different 
times be next to each other in LRU order.  We have many LRUs, but no 
idea of the relative recency of the tails of those LRUs.


If each page or object had an age, we could scale those ages by the 
benefit from reclaim and cost to refill and make a better decision as to 
what to evict first.  But of course page-age means increasing sizeof 
struct page, and we can only approximate its value by scanning the 
accessed bit, not determine it accurately (unlike the other objects 
managed by the cache).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Paul Brook
 On Mon, 2010-06-14 at 14:09 +0100, Paul Brook wrote:
 /main-system-bus/pci.0,addr=09.0/virtio-blk-pci
  
  There's a device missing between the main system bus and the pci bus. 
  Should be something like:
  
  /main-system-bus/piix4-pcihost/pci.0/_09.0
 
 Ok, I can easily come up with:
 
 /System/main-system-bus/i440FX-pcihost/PCI/pci.0/_09.0/virtio-blk-pci/virti
 o-blk

No. Now you've got way to many elements in the path.  My point is that you 
replace the name of the device with the bus address of the device.

However you determine the element names the path should be /busname/devicename 
pairs. I'm undecided whether the root element should be the main system bus, 
or a system device node that contains the main system bus.

  You already got the information you need, you just put it in the wrong
  place. The canonical ID for the device could be its bus address. In
  practice we'd probably want to allow the user to specify it by name,
  provided these are unique. e.g. in the above machine we could accept
  [...]/virtiio-blk-pci would as an aias for [...]:_09.0.
 
 Sure, if there was a guaranteed unique char* on a DeviceState that was
 consistent between guest invocations, we could just use that instead.  I
 suppose all devices could have a global system id property and if that
 was present we'd use that instead of creating the device path.

All we require is some way of uniquely addressing each device on the bus. For 
PCI that's trivial (devfn). For other busses we get to pick something 
appropriate.
 
  Device names have a restricted namespace, so we
  can use an initial prefix to disambiguate a name/label from a bus
  address.
 
 I'm not sure it's necessary.  It seems like it would only be necessary
 to differentiate the two if we wanted to translate between namespaces.
 But I think it's reasonable to require that if a global name is
 provided, it must always be provided for the life of the VM.

I don't think requiring that all devices are given a globally unique name is 
going to fly. locally (per-bus) unique user-specified are still a PITA, but 
may be acceptable. Having a canonical addressing system that's independent of 
user assigned IDs seems like a good thing.

  For busses that don't have a consistent addressing scheme then some sort
  of instance ID is unavoidable. I guess it may be possible to invent
  something based on other device properties (e.g. address of the first IO
  port/memory region).
 
 Instance IDs aren't always bad, we just run into trouble with hotplug
 and maybe creating unique ramblock names.  But, such busses typically
 don't support hotplug and don't have multiple instances of the same
 device on the bus, so I don't expect us to hit many issues there as long
 as we can get a stable address path.  Thanks,

Simple consecutive instance IDs are inherently unstable. They depend on teh 
order of device creation. The ID really wants to be something that can be 
reliably determined from the device bus itself (and/or its interaction with 
its parent bus).

Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Avi Kivity

On 06/14/2010 06:33 PM, Dave Hansen wrote:

On Mon, 2010-06-14 at 16:01 +0300, Avi Kivity wrote:
   

If we drop unmapped pagecache pages, we need to be sure they can be
backed by the host, and that depends on the amount of sharing.
 

You also have to set up the host up properly, and continue to maintain
it in a way that finds and eliminates duplicates.

I saw some benchmarks where KSM was doing great, finding lots of
duplicate pages.  Then, the host filled up, and guests started
reclaiming.  As memory pressure got worse, so did KSM's ability to find
duplicates.
   


Yup.  KSM needs to be backed up by ballooning, swap, and live migration.


At the same time, I see what you're trying to do with this.  It really
can be an alternative to ballooning if we do it right, since ballooning
would probably evict similar pages.  Although it would only work in idle
guests, what about a knob that the host can turn to just get the guest
to start running reclaim?
   


Isn't the knob in this proposal the balloon?  AFAICT, the idea here is 
to change how the guest reacts to being ballooned, but the trigger 
itself would not change.


My issue is that changing the type of object being preferentially 
reclaimed just changes the type of workload that would prematurely 
suffer from reclaim.  In this case, workloads that use a lot of unmapped 
pagecache would suffer.


btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.

2010-06-14 Thread Anthony Liguori

On 06/04/2010 04:45 PM, Cam Macdonell wrote:

This is useful for devices that do not want to take memory regions data with 
them on migration.
---
  arch_init.c  |   28 
  cpu-all.h|2 ++
  cpu-common.h |2 ++
  exec.c   |   12 
  4 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index cfc03ea..7a234fa 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f)
  current_addr + TARGET_PAGE_SIZE,
  MIGRATION_DIRTY_FLAG);

-p = qemu_get_ram_ptr(current_addr);
-
-if (is_dup_page(p, *p)) {
-qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
-qemu_put_byte(f, *p);
-} else {
-qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
-qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
-}
+if (!cpu_physical_memory_get_dirty(current_addr,
+NO_MIGRATION_FLAG)) {
+p = qemu_get_ram_ptr(current_addr);
+
+if (is_dup_page(p, *p)) {
+qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
+qemu_put_byte(f, *p);
+} else {
+qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
+qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
+}

-found = 1;
-break;
+found = 1;
+break;
+}
  }
   


Shouldn't we just disable live migration out right?

I would rather that the device mark migration as impossible having the 
user hot remove the device before migration and then add it again after 
migration.  Device assignment could also use this functionality.


What this does is make migration possible but fundamentally broken which 
is not a good thing.


Regards,

Anthony Liguori


  addr += TARGET_PAGE_SIZE;
  current_addr = (saved_addr + addr) % last_ram_offset;
@@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void)
  ram_addr_t count = 0;

  for (addr = 0; addr  last_ram_offset; addr += TARGET_PAGE_SIZE) {
-if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
+if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG)
+cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
  count++;
  }
  }
diff --git a/cpu-all.h b/cpu-all.h
index 9080cc7..4df00ab 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -887,6 +887,8 @@ extern int mem_prealloc;
  #define CODE_DIRTY_FLAG  0x02
  #define MIGRATION_DIRTY_FLAG 0x08

+#define NO_MIGRATION_FLAG 0x10
+
  #define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG | 
MIGRATION_DIRTY_FLAG)

  /* read dirty bit (return 0 or 1) */
diff --git a/cpu-common.h b/cpu-common.h
index 4b0ba60..a1ebbbe 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -39,6 +39,8 @@ static inline void 
cpu_register_physical_memory(target_phys_addr_t start_addr,
  cpu_register_physical_memory_offset(start_addr, size, phys_offset, 0);
  }

+void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size);
+
  ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
  ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
  ram_addr_t qemu_ram_alloc(ram_addr_t);
diff --git a/exec.c b/exec.c
index 39c18a7..c11d22f 100644
--- a/exec.c
+++ b/exec.c
@@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory, const 
char *path)
  }
  #endif

+void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length)
+{
+int i, len;
+uint8_t *p;
+
+len = length  TARGET_PAGE_BITS;
+p = phys_ram_flags + (start  TARGET_PAGE_BITS);
+for (i = 0; i  len; i++) {
+p[i] |= NO_MIGRATION_FLAG;
+}
+}
+
  ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
  {
  RAMBlock *new_block;
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] Re: [PATCH v6 0/6] Inter-VM Shared Memory Device with migration support

2010-06-14 Thread Anthony Liguori

On 06/11/2010 05:03 PM, Cam Macdonell wrote:

Hi Anthony,

Is my implementation of master/peer roles acceptable?


Yes, it looks good.


   I realize with
Alex's RAMList changes I may need to modify my patch, but is the
approach of marking memory non-migratable an acceptable
implementation?
   


Please make sure to address some of the CODING_STYLE comments too.

Regards,

Anthony Liguori


Thanks,
Cam

On Fri, Jun 4, 2010 at 3:45 PM, Cam Macdonellc...@cs.ualberta.ca  wrote:
   

Latest patch for PCI shared memory device that maps a host shared memory object
to be shared between guests

new in this series
- migration support with 'master' and 'peer' roles for guest to determine
  who owns memory.  With 'master', the guest has the canonical copy of
  the shared memory and will copy it with it on migration.  With 
'role=peer',
  the guest will not copy the shared memory, but attach to what is on the
  destination machine.
- modified phys_ram_dirty array for marking memory as not to be migrated
- add support for non-migrated memory regions

v5:
- fixed segfault for non-server case
- code style fixes
- removed limit on the number of guests
- shared memory server is now in qemu.git/contrib
- made ioeventfd setup function generic
- removed interrupts when guest joined (let application handle it)

v4:
- moved to single Doorbell register and use datamatch to trigger different
  VMs rather than one register per eventfd
- remove writing arbitrary values to eventfds.  Only values of 1 are now
  written to ensure correct usage

Cam Macdonell (6):
  Device specification for shared memory PCI device
  Adds two new functions for assigning ioeventfd and irqfds.
  Change phys_ram_dirty to phys_ram_status
  Add support for marking memory to not be migrated.  On migration,
memory is checked for the NO_MIGRATION_FLAG.
  Inter-VM shared memory PCI device
  the stand-alone shared memory server for inter-VM shared memory

  Makefile.target |3 +
  arch_init.c |   28 +-
  contrib/ivshmem-server/Makefile |   16 +
  contrib/ivshmem-server/README   |   30 ++
  contrib/ivshmem-server/ivshmem_server.c |  353 +
  contrib/ivshmem-server/send_scm.c   |  208 
  contrib/ivshmem-server/send_scm.h   |   19 +
  cpu-all.h   |   18 +-
  cpu-common.h|2 +
  docs/specs/ivshmem_device_spec.txt  |   96 
  exec.c  |   48 ++-
  hw/ivshmem.c|  852 +++
  kvm-all.c   |   32 ++
  kvm.h   |1 +
  qemu-char.c |6 +
  qemu-char.h |3 +
  qemu-doc.texi   |   32 ++
  17 files changed, 1710 insertions(+), 37 deletions(-)
  create mode 100644 contrib/ivshmem-server/Makefile
  create mode 100644 contrib/ivshmem-server/README
  create mode 100644 contrib/ivshmem-server/ivshmem_server.c
  create mode 100644 contrib/ivshmem-server/send_scm.c
  create mode 100644 contrib/ivshmem-server/send_scm.h
  create mode 100644 docs/specs/ivshmem_device_spec.txt
  create mode 100644 hw/ivshmem.c


 


   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Dave Hansen
On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote:
 On 06/14/2010 06:33 PM, Dave Hansen wrote:
  At the same time, I see what you're trying to do with this.  It really
  can be an alternative to ballooning if we do it right, since ballooning
  would probably evict similar pages.  Although it would only work in idle
  guests, what about a knob that the host can turn to just get the guest
  to start running reclaim?
 
 Isn't the knob in this proposal the balloon?  AFAICT, the idea here is 
 to change how the guest reacts to being ballooned, but the trigger 
 itself would not change.

I think the patch was made on the following assumptions:
1. Guests will keep filling their memory with relatively worthless page
   cache that they don't really need.
2. When they do this, it hurts the overall system with no real gain for
   anyone.

In the case of a ballooned guest, they _won't_ keep filling memory.  The
balloon will prevent them.  So, I guess I was just going down the path
of considering if this would be useful without ballooning in place.  To
me, it's really hard to justify _with_ ballooning in place.

 My issue is that changing the type of object being preferentially 
 reclaimed just changes the type of workload that would prematurely 
 suffer from reclaim.  In this case, workloads that use a lot of unmapped 
 pagecache would suffer.
 
 btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs?

Those tell you how to balance going after the different classes of
things that we can reclaim.

Again, this is useless when ballooning is being used.  But, I'm thinking
of a more general mechanism to force the system to both have MemFree
_and_ be acting as if it is under memory pressure.

Balbir, can you elaborate a bit on why you would need these patches on a
guest that is being ballooned?

-- Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory

2010-06-14 Thread Anthony Liguori

On 06/04/2010 04:45 PM, Cam Macdonell wrote:

this code is a standalone server which will pass file descriptors for the shared
memory region and eventfds to support interrupts between guests using inter-VM
shared memory.
---
  contrib/ivshmem-server/Makefile |   16 ++
  contrib/ivshmem-server/README   |   30 +++
  contrib/ivshmem-server/ivshmem_server.c |  353 +++
  contrib/ivshmem-server/send_scm.c   |  208 ++
  contrib/ivshmem-server/send_scm.h   |   19 ++
  5 files changed, 626 insertions(+), 0 deletions(-)
  create mode 100644 contrib/ivshmem-server/Makefile
  create mode 100644 contrib/ivshmem-server/README
  create mode 100644 contrib/ivshmem-server/ivshmem_server.c
  create mode 100644 contrib/ivshmem-server/send_scm.c
  create mode 100644 contrib/ivshmem-server/send_scm.h

diff --git a/contrib/ivshmem-server/Makefile b/contrib/ivshmem-server/Makefile
new file mode 100644
index 000..da40ffa
--- /dev/null
+++ b/contrib/ivshmem-server/Makefile
@@ -0,0 +1,16 @@
+CC = gcc
+CFLAGS = -O3 -Wall -Werror
+LIBS = -lrt
+
+# a very simple makefile to build the inter-VM shared memory server
+
+all: ivshmem_server
+
+.c.o:
+   $(CC) $(CFLAGS) -c $^ -o $@
+
+ivshmem_server: ivshmem_server.o send_scm.o
+   $(CC) $(CFLAGS) -o $@ $^ $(LIBS)
+
+clean:
+   rm -f *.o ivshmem_server
diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README
new file mode 100644
index 000..b1fc2a2
--- /dev/null
+++ b/contrib/ivshmem-server/README
@@ -0,0 +1,30 @@
+Using the ivshmem shared memory server
+--
+
+This server is only supported on Linux.
+
+To use the shared memory server, first compile it.  Running 'make' should
+accomplish this.  An executable named 'ivshmem_server' will be built.
+
+to display the options run:
+
+./ivshmem_server -h
+
+Options
+---
+
+-h  print help message
+
+-ppath on host
+unix socket to listen on.  The qemu-kvm chardev needs to connect on
+this socket. (default: '/tmp/ivshmem_socket')
+
+-sstring
+POSIX shared object to create that is the shared memory (default: 
'ivshmem')
+
+-m#
+size of the POSIX object in MBs (default: 1)
+
+-n#
+number of eventfds for each guest.  This number must match the
+'vectors' argument passed the ivshmem device. (default: 1)
diff --git a/contrib/ivshmem-server/ivshmem_server.c 
b/contrib/ivshmem-server/ivshmem_server.c
new file mode 100644
index 000..e0a7b98
--- /dev/null
+++ b/contrib/ivshmem-server/ivshmem_server.c
   


There's no licensing here.  I don't think this belongs in the qemu tree 
either to be honest.  If it were to be included, it ought to use all of 
the existing qemu infrastructure like the other qemu-* tools.


Regards,

Anthony Liguori


@@ -0,0 +1,353 @@
+/*
+ * A stand-alone shared memory server for inter-VM shared memory for KVM
+*/
+
+#includeerrno.h
+#includestring.h
+#includesys/types.h
+#includesys/socket.h
+#includesys/un.h
+#includeunistd.h
+#includesys/types.h
+#includesys/stat.h
+#includefcntl.h
+#includesys/eventfd.h
+#includesys/mman.h
+#includesys/select.h
+#includestdio.h
+#includestdlib.h
+#include send_scm.h
+
+#define DEFAULT_SOCK_PATH /tmp/ivshmem_socket
+#define DEFAULT_SHM_OBJ ivshmem
+
+#define DEBUG 1
+
+typedef struct server_state {
+vmguest_t *live_vms;
+int nr_allocated_vms;
+int shm_size;
+long live_count;
+long total_count;
+int shm_fd;
+char * path;
+char * shmobj;
+int maxfd, conn_socket;
+long msi_vectors;
+} server_state_t;
+
+void usage(char const *prg);
+int find_set(fd_set * readset, int max);
+void print_vec(server_state_t * s, const char * c);
+
+void add_new_guest(server_state_t * s);
+void parse_args(int argc, char **argv, server_state_t * s);
+int create_listening_socket(char * path);
+
+int main(int argc, char ** argv)
+{
+fd_set readset;
+server_state_t * s;
+
+s = (server_state_t *)calloc(1, sizeof(server_state_t));
+
+s-live_count = 0;
+s-total_count = 0;
+parse_args(argc, argv, s);
+
+/* open shared memory file  */
+if ((s-shm_fd = shm_open(s-shmobj, O_CREAT|O_RDWR, S_IRWXU))  0)
+{
+fprintf(stderr, kvm_ivshmem: could not open shared file\n);
+exit(-1);
+}
+
+ftruncate(s-shm_fd, s-shm_size);
+
+s-conn_socket = create_listening_socket(s-path);
+
+s-maxfd = s-conn_socket;
+
+for(;;) {
+int ret, handle, i;
+char buf[1024];
+
+print_vec(s, vm_sockets);
+
+FD_ZERO(readset);
+/* conn socket is in Live_vms at posn 0 */
+FD_SET(s-conn_socket,readset);
+for (i = 0; i  s-total_count; i++) {
+if (s-live_vms[i].alive != 0) {
+FD_SET(s-live_vms[i].sockfd,readset);
+}
+}
+
+printf(\nWaiting (maxfd = %d)\n, s-maxfd);
+
+ret = select(s-maxfd + 1,readset, NULL, NULL, NULL);
+
+if 

Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Jan Kiszka
Alex Williamson wrote:
 On Mon, 2010-06-14 at 14:09 +0100, Paul Brook wrote:
 /main-system-bus/pci.0,addr=09.0/virtio-blk-pci
 There's a device missing between the main system bus and the pci bus.  
 Should 
 be something like:

 /main-system-bus/piix4-pcihost/pci.0/_09.0
 
 Ok, I can easily come up with:
 
 /System/main-system-bus/i440FX-pcihost/PCI/pci.0/_09.0/virtio-blk-pci/virtio-blk

First two elements are redundant, '/' already stands for the main system
bus. Then I don't get what last element expresses (the target device is
virtio-blk-pci). Is this due to some vmstate layout? But that should not
be part into qdev paths (maybe I'm missing your use case).

And instead of introducing another hierarchy level with the bus address,
I would also prefer to add this as prefix or suffix to the device name,
e.g. driver.busaddr.

 
 to expand the previous output.  Code below.
 
 Could you explain why you add identified properties of the immediate
 parent bus and device?  They make the result ver much *not* a dev
 path in the qdev sense...
 In order to try to get a unique string.  Without looking into device
 properties, two e1000s would both be:

 /main-system-bus/pci.0/e1000
 /main-system-bus/pci.0/e1000

 Which is no better than simply e1000 and would require us to fall back
 to instance ids again.  The goal here is that anything that makes use of
 passing a dev when registering a vmstate gets an instance id of zero.
 You already got the information you need, you just put it in the wrong 
 place. 
 The canonical ID for the device could be its bus address. In practice we'd 
 probably want to allow the user to specify it by name, provided these are 
 unique. e.g. in the above machine we could accept [...]/virtiio-blk-pci 
 would 
 as an aias for [...]:_09.0.
 
 Sure, if there was a guaranteed unique char* on a DeviceState that was
 consistent between guest invocations, we could just use that instead.  I
 suppose all devices could have a global system id property and if that
 was present we'd use that instead of creating the device path.
 
 Device names have a restricted namespace, so we 
 can use an initial prefix to disambiguate a name/label from a bus address.
 
 I'm not sure it's necessary.  It seems like it would only be necessary
 to differentiate the two if we wanted to translate between namespaces.
 But I think it's reasonable to require that if a global name is
 provided, it must always be provided for the life of the VM.
 
 For busses that don't have a consistent addressing scheme then some sort of 
 instance ID is unavoidable. I guess it may be possible to invent something 
 based on other device properties (e.g. address of the first IO port/memory 
 region).
 
 Instance IDs aren't always bad, we just run into trouble with hotplug
 and maybe creating unique ramblock names.  But, such busses typically
 don't support hotplug and don't have multiple instances of the same
 device on the bus, so I don't expect us to hit many issues there as long
 as we can get a stable address path.  Thanks,
 

If stable instance numbers are required, we could simply keep them in
DeviceState and search for the smallest free one on additions. Actually,
I'm more in favor of this direction than including the bus address. That
way we could keep a canonical format across all buses and do not have to
provide possibly complex ID generation rules for each of them.

BTW, the problem isn't truly solved by printing paths. We also need to
parse them. It would be counterproductive if such paths ever see the
light of day (ie. leak outside vmstate) and a user tries to hack it
into the monitor or pass it on the command line. With my series, I'm
currently able to process paths like this one:

/i440FX-pcihost.0/pci.0/PIIX4_PM.0/i2c/smbus-eeprom.6

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.

2010-06-14 Thread Cam Macdonell
On Mon, Jun 14, 2010 at 9:51 AM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 06/04/2010 04:45 PM, Cam Macdonell wrote:

 This is useful for devices that do not want to take memory regions data
 with them on migration.
 ---
  arch_init.c  |   28 
  cpu-all.h    |    2 ++
  cpu-common.h |    2 ++
  exec.c       |   12 
  4 files changed, 32 insertions(+), 12 deletions(-)

 diff --git a/arch_init.c b/arch_init.c
 index cfc03ea..7a234fa 100644
 --- a/arch_init.c
 +++ b/arch_init.c
 @@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f)
                                              current_addr +
 TARGET_PAGE_SIZE,
                                              MIGRATION_DIRTY_FLAG);

 -            p = qemu_get_ram_ptr(current_addr);
 -
 -            if (is_dup_page(p, *p)) {
 -                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
 -                qemu_put_byte(f, *p);
 -            } else {
 -                qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
 -                qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
 -            }
 +            if (!cpu_physical_memory_get_dirty(current_addr,
 +                                                    NO_MIGRATION_FLAG)) {
 +                p = qemu_get_ram_ptr(current_addr);
 +
 +                if (is_dup_page(p, *p)) {
 +                    qemu_put_be64(f, current_addr |
 RAM_SAVE_FLAG_COMPRESS);
 +                    qemu_put_byte(f, *p);
 +                } else {
 +                    qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
 +                    qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
 +                }

 -            found = 1;
 -            break;
 +                found = 1;
 +                break;
 +            }
          }


 Shouldn't we just disable live migration out right?

I'm confused, as you seemed insistent on migration before.  Do you
want to support static migration (suspend/resume), but not live
migration?  What information do the master/peer roles represent then?


 I would rather that the device mark migration as impossible having the user
 hot remove the device before migration and then add it again after
 migration.  Device assignment could also use this functionality.

Would marking migration impossible be a new mechanism or are there
other devices that mark migration impossible? or something added to
QMP Sorry, you can't migrate with device 'x' attached?

Cam

          addr += TARGET_PAGE_SIZE;
          current_addr = (saved_addr + addr) % last_ram_offset;
 @@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void)
      ram_addr_t count = 0;

      for (addr = 0; addr  last_ram_offset; addr += TARGET_PAGE_SIZE) {
 -        if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
 +        if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG)
 +                cpu_physical_memory_get_dirty(addr,
 MIGRATION_DIRTY_FLAG)) {
              count++;
          }
      }
 diff --git a/cpu-all.h b/cpu-all.h
 index 9080cc7..4df00ab 100644
 --- a/cpu-all.h
 +++ b/cpu-all.h
 @@ -887,6 +887,8 @@ extern int mem_prealloc;
  #define CODE_DIRTY_FLAG      0x02
  #define MIGRATION_DIRTY_FLAG 0x08

 +#define NO_MIGRATION_FLAG 0x10
 +
  #define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG |
 MIGRATION_DIRTY_FLAG)

  /* read dirty bit (return 0 or 1) */
 diff --git a/cpu-common.h b/cpu-common.h
 index 4b0ba60..a1ebbbe 100644
 --- a/cpu-common.h
 +++ b/cpu-common.h
 @@ -39,6 +39,8 @@ static inline void
 cpu_register_physical_memory(target_phys_addr_t start_addr,
      cpu_register_physical_memory_offset(start_addr, size, phys_offset,
 0);
  }

 +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size);
 +
  ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
  ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
  ram_addr_t qemu_ram_alloc(ram_addr_t);
 diff --git a/exec.c b/exec.c
 index 39c18a7..c11d22f 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory,
 const char *path)
  }
  #endif

 +void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length)
 +{
 +    int i, len;
 +    uint8_t *p;
 +
 +    len = length  TARGET_PAGE_BITS;
 +    p = phys_ram_flags + (start  TARGET_PAGE_BITS);
 +    for (i = 0; i  len; i++) {
 +        p[i] |= NO_MIGRATION_FLAG;
 +    }
 +}
 +
  ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
  {
      RAMBlock *new_block;



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v6 4/6] Add support for marking memory to not be migrated. On migration, memory is checked for the NO_MIGRATION_FLAG.

2010-06-14 Thread Anthony Liguori

On 06/14/2010 11:08 AM, Cam Macdonell wrote:

On Mon, Jun 14, 2010 at 9:51 AM, Anthony Liguorianth...@codemonkey.ws  wrote:
   

On 06/04/2010 04:45 PM, Cam Macdonell wrote:
 

This is useful for devices that do not want to take memory regions data
with them on migration.
---
  arch_init.c  |   28 
  cpu-all.h|2 ++
  cpu-common.h |2 ++
  exec.c   |   12 
  4 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index cfc03ea..7a234fa 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -118,18 +118,21 @@ static int ram_save_block(QEMUFile *f)
  current_addr +
TARGET_PAGE_SIZE,
  MIGRATION_DIRTY_FLAG);

-p = qemu_get_ram_ptr(current_addr);
-
-if (is_dup_page(p, *p)) {
-qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_COMPRESS);
-qemu_put_byte(f, *p);
-} else {
-qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
-qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
-}
+if (!cpu_physical_memory_get_dirty(current_addr,
+NO_MIGRATION_FLAG)) {
+p = qemu_get_ram_ptr(current_addr);
+
+if (is_dup_page(p, *p)) {
+qemu_put_be64(f, current_addr |
RAM_SAVE_FLAG_COMPRESS);
+qemu_put_byte(f, *p);
+} else {
+qemu_put_be64(f, current_addr | RAM_SAVE_FLAG_PAGE);
+qemu_put_buffer(f, p, TARGET_PAGE_SIZE);
+}

-found = 1;
-break;
+found = 1;
+break;
+}
  }

   

Shouldn't we just disable live migration out right?
 

I'm confused, as you seemed insistent on migration before.  Do you
want to support static migration (suspend/resume), but not live
migration?  What information do the master/peer roles represent then?
   


When role=master, you should not disable live migration and you should 
still migrate the contents of the data.  Otherwise, when you migrate, 
you lose the contents of shared memory and since the role is master, 
it's the one responsible for the data.



I would rather that the device mark migration as impossible having the user
hot remove the device before migration and then add it again after
migration.  Device assignment could also use this functionality.
 

Would marking migration impossible be a new mechanism or are there
other devices that mark migration impossible? or something added to
QMP Sorry, you can't migrate with device 'x' attached?
   


We don't have such a mechanism today.

Regards,

Anthony Liguori


Cam

   

  addr += TARGET_PAGE_SIZE;
  current_addr = (saved_addr + addr) % last_ram_offset;
@@ -146,7 +149,8 @@ static ram_addr_t ram_save_remaining(void)
  ram_addr_t count = 0;

  for (addr = 0; addrlast_ram_offset; addr += TARGET_PAGE_SIZE) {
-if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
+if (!cpu_physical_memory_get_dirty(addr, NO_MIGRATION_FLAG)
+cpu_physical_memory_get_dirty(addr,
MIGRATION_DIRTY_FLAG)) {
  count++;
  }
  }
diff --git a/cpu-all.h b/cpu-all.h
index 9080cc7..4df00ab 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -887,6 +887,8 @@ extern int mem_prealloc;
  #define CODE_DIRTY_FLAG  0x02
  #define MIGRATION_DIRTY_FLAG 0x08

+#define NO_MIGRATION_FLAG 0x10
+
  #define DIRTY_ALL_FLAG  (VGA_DIRTY_FLAG | CODE_DIRTY_FLAG |
MIGRATION_DIRTY_FLAG)

  /* read dirty bit (return 0 or 1) */
diff --git a/cpu-common.h b/cpu-common.h
index 4b0ba60..a1ebbbe 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -39,6 +39,8 @@ static inline void
cpu_register_physical_memory(target_phys_addr_t start_addr,
  cpu_register_physical_memory_offset(start_addr, size, phys_offset,
0);
  }

+void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t size);
+
  ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr);
  ram_addr_t qemu_ram_map(ram_addr_t size, void *host);
  ram_addr_t qemu_ram_alloc(ram_addr_t);
diff --git a/exec.c b/exec.c
index 39c18a7..c11d22f 100644
--- a/exec.c
+++ b/exec.c
@@ -2786,6 +2786,18 @@ static void *file_ram_alloc(ram_addr_t memory,
const char *path)
  }
  #endif

+void cpu_mark_pages_no_migrate(ram_addr_t start, uint64_t length)
+{
+int i, len;
+uint8_t *p;
+
+len = lengthTARGET_PAGE_BITS;
+p = phys_ram_flags + (startTARGET_PAGE_BITS);
+for (i = 0; ilen; i++) {
+p[i] |= NO_MIGRATION_FLAG;
+}
+}
+
  ram_addr_t qemu_ram_map(ram_addr_t size, void *host)
  {
  RAMBlock *new_block;

   


 


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [Qemu-devel] [PATCH v2] ram_blocks: Convert to a QLIST

2010-06-14 Thread Anthony Liguori

On 06/11/2010 12:11 PM, Alex Williamson wrote:

This makes the RAM block list easier to manipulate.  Also incorporate
relevant variables into the RAMList struct.

Signed-off-by: Alex Williamsonalex.william...@redhat.com
Acked-by: Chris Wrightchr...@redhat.com
   


Applied.  Thanks.

Regards,

Anthony Liguori

---

v2: For qemu.git this time

  arch_init.c |   14 +--
  cpu-all.h   |   28 +++--
  exec.c  |   78 ---
  3 files changed, 54 insertions(+), 66 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 8e849a8..eb5b67c 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -110,7 +110,7 @@ static int ram_save_block(QEMUFile *f)
  ram_addr_t addr = 0;
  int bytes_sent = 0;

-while (addr  last_ram_offset) {
+while (addr  ram_list.last_offset) {
  if (cpu_physical_memory_get_dirty(current_addr, 
MIGRATION_DIRTY_FLAG)) {
  uint8_t *p;

@@ -133,7 +133,7 @@ static int ram_save_block(QEMUFile *f)
  break;
  }
  addr += TARGET_PAGE_SIZE;
-current_addr = (saved_addr + addr) % last_ram_offset;
+current_addr = (saved_addr + addr) % ram_list.last_offset;
  }

  return bytes_sent;
@@ -146,7 +146,7 @@ static ram_addr_t ram_save_remaining(void)
  ram_addr_t addr;
  ram_addr_t count = 0;

-for (addr = 0; addr  last_ram_offset; addr += TARGET_PAGE_SIZE) {
+for (addr = 0; addr  ram_list.last_offset; addr += TARGET_PAGE_SIZE) {
  if (cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
  count++;
  }
@@ -167,7 +167,7 @@ uint64_t ram_bytes_transferred(void)

  uint64_t ram_bytes_total(void)
  {
-return last_ram_offset;
+return ram_list.last_offset;
  }

  int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
@@ -191,7 +191,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, 
void *opaque)
  bytes_transferred = 0;

  /* Make sure all dirty bits are set */
-for (addr = 0; addr  last_ram_offset; addr += TARGET_PAGE_SIZE) {
+for (addr = 0; addr  ram_list.last_offset; addr += TARGET_PAGE_SIZE) {
  if (!cpu_physical_memory_get_dirty(addr, MIGRATION_DIRTY_FLAG)) {
  cpu_physical_memory_set_dirty(addr);
  }
@@ -200,7 +200,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, 
void *opaque)
  /* Enable dirty memory tracking */
  cpu_physical_memory_set_dirty_tracking(1);

-qemu_put_be64(f, last_ram_offset | RAM_SAVE_FLAG_MEM_SIZE);
+qemu_put_be64(f, ram_list.last_offset | RAM_SAVE_FLAG_MEM_SIZE);
  }

  bytes_transferred_last = bytes_transferred;
@@ -259,7 +259,7 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
  addr= TARGET_PAGE_MASK;

  if (flags  RAM_SAVE_FLAG_MEM_SIZE) {
-if (addr != last_ram_offset) {
+if (addr != ram_list.last_offset) {
  return -EINVAL;
  }
  }
diff --git a/cpu-all.h b/cpu-all.h
index 77eaf85..e31c2de 100644
--- a/cpu-all.h
+++ b/cpu-all.h
@@ -859,9 +859,21 @@ target_phys_addr_t cpu_get_phys_page_debug(CPUState *env, 
target_ulong addr);
  /* memory API */

  extern int phys_ram_fd;
-extern uint8_t *phys_ram_dirty;
  extern ram_addr_t ram_size;
-extern ram_addr_t last_ram_offset;
+
+typedef struct RAMBlock {
+uint8_t *host;
+ram_addr_t offset;
+ram_addr_t length;
+QLIST_ENTRY(RAMBlock) next;
+} RAMBlock;
+
+typedef struct RAMList {
+uint8_t *phys_dirty;
+ram_addr_t last_offset;
+QLIST_HEAD(ram, RAMBlock) blocks;
+} RAMList;
+extern RAMList ram_list;

  extern const char *mem_path;
  extern int mem_prealloc;
@@ -891,29 +903,29 @@ extern int mem_prealloc;
  /* read dirty bit (return 0 or 1) */
  static inline int cpu_physical_memory_is_dirty(ram_addr_t addr)
  {
-return phys_ram_dirty[addr  TARGET_PAGE_BITS] == 0xff;
+return ram_list.phys_dirty[addr  TARGET_PAGE_BITS] == 0xff;
  }

  static inline int cpu_physical_memory_get_dirty_flags(ram_addr_t addr)
  {
-return phys_ram_dirty[addr  TARGET_PAGE_BITS];
+return ram_list.phys_dirty[addr  TARGET_PAGE_BITS];
  }

  static inline int cpu_physical_memory_get_dirty(ram_addr_t addr,
  int dirty_flags)
  {
-return phys_ram_dirty[addr  TARGET_PAGE_BITS]  dirty_flags;
+return ram_list.phys_dirty[addr  TARGET_PAGE_BITS]  dirty_flags;
  }

  static inline void cpu_physical_memory_set_dirty(ram_addr_t addr)
  {
-phys_ram_dirty[addr  TARGET_PAGE_BITS] = 0xff;
+ram_list.phys_dirty[addr  TARGET_PAGE_BITS] = 0xff;
  }

  static inline int cpu_physical_memory_set_dirty_flags(ram_addr_t addr,
int dirty_flags)
  {
-return phys_ram_dirty[addr  TARGET_PAGE_BITS] |= dirty_flags;
+return ram_list.phys_dirty[addr  TARGET_PAGE_BITS] |= dirty_flags;
  }

  static 

Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Alex Williamson
On Mon, 2010-06-14 at 18:00 +0200, Jan Kiszka wrote:
 Alex Williamson wrote:
  On Mon, 2010-06-14 at 14:09 +0100, Paul Brook wrote:
  /main-system-bus/pci.0,addr=09.0/virtio-blk-pci
  There's a device missing between the main system bus and the pci bus.  
  Should 
  be something like:
 
  /main-system-bus/piix4-pcihost/pci.0/_09.0
  
  Ok, I can easily come up with:
  
  /System/main-system-bus/i440FX-pcihost/PCI/pci.0/_09.0/virtio-blk-pci/virtio-blk
 
 First two elements are redundant, '/' already stands for the main system
 bus.

Ok, we can start printing after the system bus.

  Then I don't get what last element expresses (the target device is
 virtio-blk-pci). Is this due to some vmstate layout? But that should not
 be part into qdev paths (maybe I'm missing your use case).

Sorry, I wasn't clear about that.  My printf is in the savevm code,
after it's already appended the idstr passed in from the device.  The
device path in the example above ends at virtio-blk-pci.  That's the
dev-info-name of the device passed into this function.

 And instead of introducing another hierarchy level with the bus address,
 I would also prefer to add this as prefix or suffix to the device name,
 e.g. driver.busaddr.

That's what I had started with.  The first post in this thread has
pci.0,addr=09.0 as a single hierarchy level.  The addr= may be
unnecessary there, but I also prefer something along those lines.  For
PCI it'd make sense to have name:addr, which comes out to
pci.0:09.0.  (Maybe rather than flagging properties as being relevant
to the path and printing them generically, we should extract specific
properties based on the bus type.)

  For busses that don't have a consistent addressing scheme then some sort 
  of 
  instance ID is unavoidable. I guess it may be possible to invent something 
  based on other device properties (e.g. address of the first IO port/memory 
  region).
  
  Instance IDs aren't always bad, we just run into trouble with hotplug
  and maybe creating unique ramblock names.  But, such busses typically
  don't support hotplug and don't have multiple instances of the same
  device on the bus, so I don't expect us to hit many issues there as long
  as we can get a stable address path.  Thanks,
  
 
 If stable instance numbers are required, we could simply keep them in
 DeviceState and search for the smallest free one on additions. Actually,
 I'm more in favor of this direction than including the bus address. That
 way we could keep a canonical format across all buses and do not have to
 provide possibly complex ID generation rules for each of them.

I started down that path, but it still breaks for hotplug.  If we start
a VM with two e1000 NICs, then remove the first, we can no longer
migrate because the target can't represent having a single e1000 with a
non-zero instance ID.

 BTW, the problem isn't truly solved by printing paths. We also need to
 parse them. It would be counterproductive if such paths ever see the
 light of day (ie. leak outside vmstate) and a user tries to hack it
 into the monitor or pass it on the command line. With my series, I'm
 currently able to process paths like this one:
 
 /i440FX-pcihost.0/pci.0/PIIX4_PM.0/i2c/smbus-eeprom.6

Yes, these are only intended for internal use, but we should get them to
sync up as much as possible.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Graphics PCI Passthrough support...

2010-06-14 Thread Parker, James S


All,

Are there any plans for PCI Passthrough support for Graphics within KVM, 
similar to current Xen support?

Regards,

Scott Parker 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Jan Kiszka
Alex Williamson wrote:
 On Mon, 2010-06-14 at 18:00 +0200, Jan Kiszka wrote:
 Alex Williamson wrote:
 On Mon, 2010-06-14 at 14:09 +0100, Paul Brook wrote:
 /main-system-bus/pci.0,addr=09.0/virtio-blk-pci
 There's a device missing between the main system bus and the pci bus.  
 Should 
 be something like:

 /main-system-bus/piix4-pcihost/pci.0/_09.0
 Ok, I can easily come up with:

 /System/main-system-bus/i440FX-pcihost/PCI/pci.0/_09.0/virtio-blk-pci/virtio-blk
 First two elements are redundant, '/' already stands for the main system
 bus.
 
 Ok, we can start printing after the system bus.
 
  Then I don't get what last element expresses (the target device is
 virtio-blk-pci). Is this due to some vmstate layout? But that should not
 be part into qdev paths (maybe I'm missing your use case).
 
 Sorry, I wasn't clear about that.  My printf is in the savevm code,
 after it's already appended the idstr passed in from the device.  The
 device path in the example above ends at virtio-blk-pci.  That's the
 dev-info-name of the device passed into this function.
 
 And instead of introducing another hierarchy level with the bus address,
 I would also prefer to add this as prefix or suffix to the device name,
 e.g. driver.busaddr.
 
 That's what I had started with.  The first post in this thread has
 pci.0,addr=09.0 as a single hierarchy level.  The addr= may be
 unnecessary there, but I also prefer something along those lines.  For
 PCI it'd make sense to have name:addr, which comes out to
 pci.0:09.0.  (Maybe rather than flagging properties as being relevant
 to the path and printing them generically, we should extract specific
 properties based on the bus type.)

Not bus.addr, driver.addr. We only have one PCI bus here, not as many as
there are slots on that bus.

 
 For busses that don't have a consistent addressing scheme then some sort 
 of 
 instance ID is unavoidable. I guess it may be possible to invent something 
 based on other device properties (e.g. address of the first IO port/memory 
 region).
 Instance IDs aren't always bad, we just run into trouble with hotplug
 and maybe creating unique ramblock names.  But, such busses typically
 don't support hotplug and don't have multiple instances of the same
 device on the bus, so I don't expect us to hit many issues there as long
 as we can get a stable address path.  Thanks,

 If stable instance numbers are required, we could simply keep them in
 DeviceState and search for the smallest free one on additions. Actually,
 I'm more in favor of this direction than including the bus address. That
 way we could keep a canonical format across all buses and do not have to
 provide possibly complex ID generation rules for each of them.
 
 I started down that path, but it still breaks for hotplug.  If we start
 a VM with two e1000 NICs, then remove the first, we can no longer
 migrate because the target can't represent having a single e1000 with a
 non-zero instance ID.

That's indeed a good point.

Still, I'm worried about having to define numbering schemes for all the
buses qemu supports. Maybe we can run a mixture: address-based for
hotplug-capably buses (they tend to be cooperative in this regard) and
simple instance numbers for the rest, likely the majority.

 
 BTW, the problem isn't truly solved by printing paths. We also need to
 parse them. It would be counterproductive if such paths ever see the
 light of day (ie. leak outside vmstate) and a user tries to hack it
 into the monitor or pass it on the command line. With my series, I'm
 currently able to process paths like this one:

 /i440FX-pcihost.0/pci.0/PIIX4_PM.0/i2c/smbus-eeprom.6
 
 Yes, these are only intended for internal use, but we should get them to
 sync up as much as possible.  Thanks,

Unless there is a good reason, the match should be 100%.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM-AUTOTEST PATCH v2 1/3] KVM test: add kvm_monitor.py, an interface to QEMU monitors

2010-06-14 Thread Michael Goldish
This module should replace vm.send_monitor_cmd().  Instead of connecting to the
monitor each time a command is issued, this module maintains a continuous
connection to the monitor.  It disconnects when a test terminates and
reconnects as soon as the next test begins (upon unpickling).

It currently contains only an interface to the human monitor.  A QMP interface
will be introduced in a future patch.

Changes from v1:
- Add name parameter to __init__()
- Remove help() method
- Rename help attribute to _help_str to indicate private use
- Rename lock to _lock
- Rename socket to _socket

Signed-off-by: Michael Goldish mgold...@redhat.com
---
 client/tests/kvm/kvm_monitor.py |  354 +++
 1 files changed, 354 insertions(+), 0 deletions(-)
 create mode 100644 client/tests/kvm/kvm_monitor.py

diff --git a/client/tests/kvm/kvm_monitor.py b/client/tests/kvm/kvm_monitor.py
new file mode 100644
index 000..27045a4
--- /dev/null
+++ b/client/tests/kvm/kvm_monitor.py
@@ -0,0 +1,354 @@
+
+Interfaces to the QEMU monitor.
+
+...@copyright: 2008-2010 Red Hat Inc.
+
+
+import socket, time, threading, logging
+import kvm_utils
+
+
+class MonitorError(Exception):
+pass
+
+
+class MonitorConnectError(MonitorError):
+pass
+
+
+class MonitorSendError(MonitorError):
+pass
+
+
+class MonitorLockError(MonitorError):
+pass
+
+
+class MonitorProtocolError(MonitorError):
+pass
+
+
+class Monitor:
+
+Common code for monitor classes.
+
+
+def __init__(self, name, filename):
+
+Initialize the instance.
+
+@param name: Monitor identifier (a string)
+@param filename: Monitor socket filename
+@raise MonitorConnectError: Raised if the connection fails
+
+self.name = name
+self.filename = filename
+self._lock = threading.RLock()
+self._socket = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
+self._socket.setblocking(False)
+
+try:
+self._socket.connect(filename)
+except socket.error:
+raise MonitorConnectError(Could not connect to monitor socket)
+
+
+def __del__(self):
+# Automatically close the connection when the instance is garbage
+# collected
+try:
+self._socket.shutdown(socket.SHUT_RDWR)
+except socket.error:
+pass
+self._socket.close()
+
+
+# The following two functions are defined to make sure the state is set
+# exclusively by the constructor call as specified in __getinitargs__().
+
+def __getstate__(self):
+pass
+
+
+def __setstate__(self, state):
+pass
+
+
+def __getinitargs__(self):
+# Save some information when pickling -- will be passed to the
+# constructor upon unpickling
+return self.name, self.filename, True
+
+
+def _acquire_lock(self, timeout=20):
+end_time = time.time() + timeout
+while time.time()  end_time:
+if self._lock.acquire(False):
+return True
+time.sleep(0.05)
+return False
+
+
+def _recvall(self):
+s = 
+while True:
+try:
+data = self._socket.recv(1024)
+except socket.error:
+break
+if not data:
+break
+s += data
+return s
+
+
+class HumanMonitor(Monitor):
+
+Wraps human monitor commands.
+
+
+def __init__(self, name, filename, suppress_exceptions=False):
+
+Connect to the monitor socket and find the (qemu) prompt.
+
+@param name: Monitor identifier (a string)
+@param filename: Monitor socket filename
+@raise MonitorConnectError: Raised if the connection fails and
+suppress_exceptions is False
+@raise MonitorProtocolError: Raised if the initial (qemu) prompt isn't
+found and suppress_exceptions is False
+@note: Other exceptions may be raised.  See _get_command_output's
+docstring.
+
+try:
+Monitor.__init__(self, name, filename)
+
+self.protocol = human
+
+# Find the initial (qemu) prompt
+s, o = self._read_up_to_qemu_prompt(20)
+if not s:
+raise MonitorProtocolError(Could not find (qemu) prompt 
+   after connecting to monitor. 
+   Output so far: %r % o)
+
+# Save the output of 'help' for future use
+self._help_str = self._get_command_output(help)
+
+except MonitorError, e:
+if suppress_exceptions:
+logging.warn(e)
+else:
+raise
+
+
+# Private methods
+
+def _read_up_to_qemu_prompt(self, timeout=20):
+o = 
+end_time = time.time() + timeout
+while time.time()  end_time:
+try:
+ 

[KVM-AUTOTEST PATCH v2 2/3] KVM test: use new monitor interface

2010-06-14 Thread Michael Goldish
- Add new monitor definition syntax that allows definition of multiple monitors.
  Monitors are now defined like other objects in the config file:

  monitors = MyMonitor SomeOtherMonitor YetAnotherMonitor   # defines 3 
monitors
  monitor_type = human# default for all monitors
  monitor_type_SomeOtherMonitor = qmp # applies only to SomeOtherMonitor
  monitor_type_YetAnotherMonitor = qmp# applies only to 
YetAnotherMonitor
  main_monitor = MyMonitor# defines the main monitor to use
  # in the test

- Use the new syntax in tests_base.cfg.sample.

- Establish monitor connections using kvm_monitor in VM.create().
  Store all monitors in self.monitors.  Store main monitor in self.monitor.

- Replace calls to send_monitor_cmd() with appropriate calls to methods of
  self.monitor (the main monitor).

- For now, ignore the parameter screendump_verbose because currently monitor
  commands are always silent (when successful).

Changes from v1:
- Turn VM.monitor into a property

Signed-off-by: Michael Goldish mgold...@redhat.com
---
 client/tests/kvm/kvm_preprocessing.py  |   33 ++--
 client/tests/kvm/kvm_test_utils.py |   35 +--
 client/tests/kvm/kvm_vm.py |  236 +---
 client/tests/kvm/tests/balloon_check.py|   12 +-
 client/tests/kvm/tests/boot_savevm.py  |   41 ++--
 client/tests/kvm/tests/ksm_overcommit.py   |8 +-
 client/tests/kvm/tests/pci_hotplug.py  |   13 +-
 client/tests/kvm/tests/physical_resources_check.py |   40 ++--
 client/tests/kvm/tests/shutdown.py |2 +-
 client/tests/kvm/tests/stepmaker.py|   44 ++--
 client/tests/kvm/tests/steps.py|   12 +-
 client/tests/kvm/tests_base.cfg.sample |7 +-
 12 files changed, 241 insertions(+), 242 deletions(-)

diff --git a/client/tests/kvm/kvm_preprocessing.py 
b/client/tests/kvm/kvm_preprocessing.py
index 76c8268..ee3e9b2 100644
--- a/client/tests/kvm/kvm_preprocessing.py
+++ b/client/tests/kvm/kvm_preprocessing.py
@@ -1,7 +1,7 @@
 import sys, os, time, commands, re, logging, signal, glob, threading, shutil
 from autotest_lib.client.bin import test, utils
 from autotest_lib.client.common_lib import error
-import kvm_vm, kvm_utils, kvm_subprocess, ppm_utils
+import kvm_vm, kvm_utils, kvm_subprocess, kvm_monitor, ppm_utils
 try:
 import PIL.Image
 except ImportError:
@@ -83,7 +83,11 @@ def preprocess_vm(test, params, env, name):
 raise error.TestError(Could not start VM)
 
 scrdump_filename = os.path.join(test.debugdir, pre_%s.ppm % name)
-vm.send_monitor_cmd(screendump %s % scrdump_filename)
+try:
+if vm.monitor:
+vm.monitor.screendump(scrdump_filename)
+except kvm_monitor.MonitorError, e:
+logging.warn(e)
 
 
 def postprocess_image(test, params):
@@ -117,7 +121,11 @@ def postprocess_vm(test, params, env, name):
 return
 
 scrdump_filename = os.path.join(test.debugdir, post_%s.ppm % name)
-vm.send_monitor_cmd(screendump %s % scrdump_filename)
+try:
+if vm.monitor:
+vm.monitor.screendump(scrdump_filename)
+except kvm_monitor.MonitorError, e:
+logging.warn(e)
 
 if params.get(kill_vm) == yes:
 kill_vm_timeout = float(params.get(kill_vm_timeout, 0))
@@ -356,8 +364,9 @@ def postprocess(test, params, env):
 for vm in kvm_utils.env_get_all_vms(env):
 if not vm.is_dead():
 logging.info(VM '%s' is alive., vm.name)
-logging.info(The monitor unix socket of '%s' is: %s,
- vm.name, vm.monitor_file_name)
+for m in vm.monitors:
+logging.info('%s' has a %s monitor unix socket at: %s,
+ vm.name, m.protocol, m.filename)
 logging.info(The command line used to start '%s' was:\n%s,
  vm.name, vm.make_qemu_command())
 raise error.JobError(Abort requested (%s) % exc_string)
@@ -403,10 +412,6 @@ def _take_screendumps(test, params, env):
  kvm_utils.generate_random_string(6))
 delay = float(params.get(screendump_delay, 5))
 quality = int(params.get(screendump_quality, 30))
-if params.get(screendump_verbose) == 'yes':
-screendump_verbose = True
-else:
-screendump_verbose = False
 
 cache = {}
 
@@ -414,11 +419,11 @@ def _take_screendumps(test, params, env):
 for vm in kvm_utils.env_get_all_vms(env):
 if vm.is_dead():
 continue
-if screendump_verbose:
-vm.send_monitor_cmd(screendump %s % temp_filename)
-else:
-vm.send_monitor_cmd(screendump %s % temp_filename,
-verbose=False)
+try:
+ 

[KVM-AUTOTEST PATCH v2 3/3] KVM test: kvm_monitor.py: add QMP interface

2010-06-14 Thread Michael Goldish
An initial QMP client implementation.
Should be fully functional and supports asynchronous events.
However, most tests must be modified to support it, because it returns output
in a different format from the human monitor (the human monitor returns strings
and the QMP one returns dicts or lists).

To enable QMP, set main_monitor to a monitor whose monitor_type is qmp.

For example (a single QMP monitor):

monitors = monitor1
monitor_type_monitor1 = qmp
main_monitor = monitor1

Another example (multiple monitors, both human and QMP):

monitors = MyMonitor SomeOtherMonitor YetAnotherMonitor   # defines 3 
monitors
monitor_type = human# default for all monitors
monitor_type_SomeOtherMonitor = qmp # applies only to SomeOtherMonitor
monitor_type_YetAnotherMonitor = qmp# applies only to YetAnotherMonitor
main_monitor = SomeOtherMonitor # the main monitor is a QMP one, so
# the test will use QMP

Note:
Monitor methods now raise exceptions such as MonitorLockError and QMPCmdError.
If this turns out to be a bad idea, it shouldn't be hard to revert to the old
convention of returning a (status, output) tuple.

Changes from v1:
- Upon connection make sure the QMP greeting is received (otherwise raise an
  exception)
- Upon connection make sure the json module is available
- Rename the events attribute to _events to indicate private use
- Use _lock instead of lock
- Use _socket instead of socket

Signed-off-by: Michael Goldish mgold...@redhat.com
---
 client/tests/kvm/kvm_monitor.py |  305 +++
 client/tests/kvm/kvm_vm.py  |8 +-
 2 files changed, 310 insertions(+), 3 deletions(-)

diff --git a/client/tests/kvm/kvm_monitor.py b/client/tests/kvm/kvm_monitor.py
index 27045a4..12cf773 100644
--- a/client/tests/kvm/kvm_monitor.py
+++ b/client/tests/kvm/kvm_monitor.py
@@ -6,6 +6,11 @@ Interfaces to the QEMU monitor.
 
 import socket, time, threading, logging
 import kvm_utils
+try:
+import json
+except ImportError:
+logging.warning(Could not import json module. 
+QMP monitor functionality disabled.)
 
 
 class MonitorError(Exception):
@@ -28,6 +33,14 @@ class MonitorProtocolError(MonitorError):
 pass
 
 
+class MonitorNotSupportedError(MonitorError):
+pass
+
+
+class QMPCmdError(MonitorError):
+pass
+
+
 class Monitor:
 
 Common code for monitor classes.
@@ -352,3 +365,295 @@ class HumanMonitor(Monitor):
 @return: The command's output
 
 return self._get_command_output(mouse_button %d % state)
+
+
+class QMPMonitor(Monitor):
+
+Wraps QMP monitor commands.
+
+
+def __init__(self, name, filename, suppress_exceptions=False):
+
+Connect to the monitor socket, read the greeting message and issue the
+qmp_capabilities command.  Also make sure the json module is available.
+
+@param name: Monitor identifier (a string)
+@param filename: Monitor socket filename
+@raise MonitorConnectError: Raised if the connection fails and
+suppress_exceptions is False
+@raise MonitorProtocolError: Raised if the no QMP greeting message is
+received and suppress_exceptions is False
+@raise MonitorNotSupportedError: Raised if json isn't available and
+suppress_exceptions is False
+@note: Other exceptions may be raised if the qmp_capabilities command
+fails.  See _get_command_output's docstring.
+
+try:
+Monitor.__init__(self, name, filename)
+
+self.protocol = qmp
+self._greeting = None
+self._events = []
+
+# Make sure json is available
+try:
+json
+except NameError:
+raise MonitorNotSupportedError(QMP requires the json module 
+   (Python 2.6 and up))
+
+# Read greeting message
+end_time = time.time() + 20
+while time.time()  end_time:
+for obj in self._read_objects():
+if QMP in obj:
+self._greeting = obj[QMP]
+break
+if self._greeting:
+break
+time.sleep(0.1)
+else:
+raise MonitorProtocolError(No QMP greeting message received)
+
+# Issue qmp_capabilities
+self._get_command_output(qmp_capabilities)
+
+except MonitorError, e:
+if suppress_exceptions:
+logging.warn(e)
+else:
+raise
+
+
+# Private methods
+
+def _build_cmd(self, cmd, args=None):
+obj = {execute: cmd}
+if args:
+obj[arguments] = args
+return obj
+
+
+def _read_objects(self, timeout=5):
+
+Read 

Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-14 08:12:56]:

 On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote:
  1. A slab page will not be freed until the entire page is free (all
  slabs have been kfree'd so to speak). Normal reclaim will definitely
  free this page, but a lot of it depends on how frequently we are
  scanning the LRU list and when this page got added.
 
 You don't have to be freeing entire slab pages for the reclaim to have
 been useful.  You could just be making space so that _future_
 allocations fill in the slab holes you just created.  You may not be
 freeing pages, but you're reducing future system pressure.
 
 If unmapped page cache is the easiest thing to evict, then it should be
 the first thing that goes when a balloon request comes in, which is the
 case this patch is trying to handle.  If it isn't the easiest thing to
 evict, then we _shouldn't_ evict it.


Like I said earlier, a lot of that works correctly as you said, but it
is also an idealization. If you've got duplicate pages and you know
that they are duplicated and can be retrieved at a lower cost, why
wouldn't we go after them first?

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Dave Hansen
On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote:
 If you've got duplicate pages and you know
 that they are duplicated and can be retrieved at a lower cost, why
 wouldn't we go after them first?

I agree with this in theory.  But, the guest lacks the information about
what is truly duplicated and what the costs are for itself and/or the
host to recreate it.  Unmapped page cache may be the best proxy that
we have at the moment for easy to recreate, but I think it's still too
poor a match to make these patches useful.

-- Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/3]: Fixes to IRQ routing

2010-06-14 Thread Chris Lalancette
As we've discussed previously, here is a series of patches to
fix some of the IRQ routing issues we have in KVM.  With this series
in place I was able to successfully kdump a RHEL-5 64-bit, and RHEL-6
32- and 64-bit guest on CPU's other than the BSP.  RHEL-5 32-bit kdump still
does not work; it gets stuck on Checking 'hlt' instruction.  However,
it does that both before and after this series, so there is something
else going on there that I still have to debug.

I also need to change the kvm_migrate_pit_timer function to migrate the
timer over to the last CPU that handled the timer interrupt, on the
theory that that particlar CPU is likely to handle the timer interrupt again
in the near future.  However, this is an optimization and shouldn't delay the
inclusion of the rest of the series for correctness.

Changes since RFC:
 - Changed ps-inject_lock from raw_spinlock_t to spinlock_t
 - Fixed up some formatting issues
 - Changed to have one PIT workqueue per-guest
 - Remember to cancel_work_sync when destroying the PIT

Changes since v1:
 - Call cancel_work_sync everywhere we call hrtimer_cancel
 - Bring back the reinjection logic
 - Fix up formatting issues from checkpatch

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] Introduce a workqueue to deliver PIT timer interrupts.

2010-06-14 Thread Chris Lalancette
We really want to kvm_set_irq during the hrtimer callback,
but that is risky because that is during interrupt context.
Instead, offload the work to a workqueue, which is a bit safer
and should provide most of the same functionality.

Signed-off-by: Chris Lalancette clala...@redhat.com
---
 arch/x86/kvm/i8254.c |  125 -
 arch/x86/kvm/i8254.h |4 +-
 arch/x86/kvm/irq.c   |1 -
 3 files changed, 74 insertions(+), 56 deletions(-)

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 188d827..3bed8ac 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -34,6 +34,7 @@
 
 #include linux/kvm_host.h
 #include linux/slab.h
+#include linux/workqueue.h
 
 #include irq.h
 #include i8254.h
@@ -244,11 +245,11 @@ static void kvm_pit_ack_irq(struct kvm_irq_ack_notifier 
*kian)
 {
struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state,
 irq_ack_notifier);
-   raw_spin_lock(ps-inject_lock);
+   spin_lock(ps-inject_lock);
if (atomic_dec_return(ps-pit_timer.pending)  0)
atomic_inc(ps-pit_timer.pending);
ps-irq_ack = 1;
-   raw_spin_unlock(ps-inject_lock);
+   spin_unlock(ps-inject_lock);
 }
 
 void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu)
@@ -267,7 +268,8 @@ void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu)
 static void destroy_pit_timer(struct kvm_timer *pt)
 {
pr_debug(execute del timer!\n);
-   hrtimer_cancel(pt-timer);
+   if (hrtimer_cancel(pt-timer))
+   cancel_work_sync(pt-kvm-arch.vpit-expired);
 }
 
 static bool kpit_is_periodic(struct kvm_timer *ktimer)
@@ -281,6 +283,58 @@ static struct kvm_timer_ops kpit_ops = {
.is_periodic = kpit_is_periodic,
 };
 
+static void pit_do_work(struct work_struct *work)
+{
+   struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
+   struct kvm *kvm = pit-kvm;
+   struct kvm_vcpu *vcpu;
+   int i;
+   struct kvm_kpit_state *ps = pit-pit_state;
+   int inject = 0;
+
+   /* Try to inject pending interrupts when
+* last one has been acked.
+*/
+   spin_lock(ps-inject_lock);
+   if (ps-irq_ack) {
+   ps-irq_ack = 0;
+   inject = 1;
+   }
+   spin_unlock(ps-inject_lock);
+   if (inject) {
+   kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 1);
+   kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 0);
+
+   /*
+* Provides NMI watchdog support via Virtual Wire mode.
+* The route is: PIT - PIC - LVT0 in NMI mode.
+*
+* Note: Our Virtual Wire implementation is simplified, only
+* propagating PIT interrupts to all VCPUs when they have set
+* LVT0 to NMI delivery. Other PIC interrupts are just sent to
+* VCPU0, and only if its LVT0 is in EXTINT mode.
+*/
+   if (kvm-arch.vapics_in_nmi_mode  0)
+   kvm_for_each_vcpu(i, vcpu, kvm)
+   kvm_apic_nmi_wd_deliver(vcpu);
+   }
+}
+
+static enum hrtimer_restart pit_timer_fn(struct hrtimer *data)
+{
+   struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
+   struct kvm_pit *pt = ktimer-kvm-arch.vpit;
+
+   if (ktimer-reinject)
+   queue_work(pt-wq, pt-expired);
+
+   if (ktimer-t_ops-is_periodic(ktimer)) {
+   hrtimer_add_expires_ns(ktimer-timer, ktimer-period);
+   return HRTIMER_RESTART;
+   } else
+   return HRTIMER_NORESTART;
+}
+
 static void create_pit_timer(struct kvm_kpit_state *ps, u32 val, int is_period)
 {
struct kvm_timer *pt = ps-pit_timer;
@@ -291,14 +345,14 @@ static void create_pit_timer(struct kvm_kpit_state *ps, 
u32 val, int is_period)
pr_debug(create pit timer, interval is %llu nsec\n, interval);
 
/* TODO The new value only affected after the retriggered */
-   hrtimer_cancel(pt-timer);
+   if (hrtimer_cancel(pt-timer))
+   cancel_work_sync(pt-kvm-arch.vpit-expired);
pt-period = interval;
ps-is_periodic = is_period;
 
-   pt-timer.function = kvm_timer_fn;
+   pt-timer.function = pit_timer_fn;
pt-t_ops = kpit_ops;
pt-kvm = ps-pit-kvm;
-   pt-vcpu = pt-kvm-bsp_vcpu;
 
atomic_set(pt-pending, 0);
ps-irq_ack = 1;
@@ -626,7 +680,14 @@ struct kvm_pit *kvm_create_pit(struct kvm *kvm, u32 flags)
 
mutex_init(pit-pit_state.lock);
mutex_lock(pit-pit_state.lock);
-   raw_spin_lock_init(pit-pit_state.inject_lock);
+   spin_lock_init(pit-pit_state.inject_lock);
+
+   pit-wq = create_singlethread_workqueue(kvm-pit-wq);
+   if (!pit-wq) {
+   kfree(pit);
+   return NULL;
+   }
+   INIT_WORK(pit-expired, pit_do_work);
 
kvm-arch.vpit = 

[PATCH v2 2/3] Allow any LAPIC to accept PIC interrupts.

2010-06-14 Thread Chris Lalancette
If the guest wants to accept timer interrupts on a CPU other
than the BSP, we need to remove this gate.

Signed-off-by: Chris Lalancette clala...@redhat.com
---
 arch/x86/kvm/lapic.c |   12 +---
 1 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index d8258a0..ee0f76c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1107,13 +1107,11 @@ int kvm_apic_accept_pic_intr(struct kvm_vcpu *vcpu)
u32 lvt0 = apic_get_reg(vcpu-arch.apic, APIC_LVT0);
int r = 0;
 
-   if (kvm_vcpu_is_bsp(vcpu)) {
-   if (!apic_hw_enabled(vcpu-arch.apic))
-   r = 1;
-   if ((lvt0  APIC_LVT_MASKED) == 0 
-   GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
-   r = 1;
-   }
+   if (!apic_hw_enabled(vcpu-arch.apic))
+   r = 1;
+   if ((lvt0  APIC_LVT_MASKED) == 0 
+   GET_APIC_DELIVERY_MODE(lvt0) == APIC_MODE_EXTINT)
+   r = 1;
return r;
 }
 
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/3] In DM_LOWEST, only deliver interrupts to vcpus with enabled LAPIC's

2010-06-14 Thread Chris Lalancette
Otherwise we might try to deliver a timer interrupt to a cpu that
can't possibly handle it.

Signed-off-by: Chris Lalancette clala...@redhat.com
---
 virt/kvm/irq_comm.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/virt/kvm/irq_comm.c b/virt/kvm/irq_comm.c
index 52f412f..06cf61e 100644
--- a/virt/kvm/irq_comm.c
+++ b/virt/kvm/irq_comm.c
@@ -100,7 +100,7 @@ int kvm_irq_delivery_to_apic(struct kvm *kvm, struct 
kvm_lapic *src,
if (r  0)
r = 0;
r += kvm_apic_set_irq(vcpu, irq);
-   } else {
+   } else if (kvm_lapic_enabled(vcpu)) {
if (!lowest)
lowest = vcpu;
else if (kvm_apic_compare_prio(vcpu, lowest)  0)
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Dave Hansen d...@linux.vnet.ibm.com [2010-06-14 10:09:31]:

 On Mon, 2010-06-14 at 22:28 +0530, Balbir Singh wrote:
  If you've got duplicate pages and you know
  that they are duplicated and can be retrieved at a lower cost, why
  wouldn't we go after them first?
 
 I agree with this in theory.  But, the guest lacks the information about
 what is truly duplicated and what the costs are for itself and/or the
 host to recreate it.  Unmapped page cache may be the best proxy that
 we have at the moment for easy to recreate, but I think it's still too
 poor a match to make these patches useful.


That is why the policy (in the next set) will come from the host. As
to whether the data is truly duplicated, my experiments show up to 60%
of the page cache is duplicated. The first patch today is again
enabled by the host. Both of them are expected to be useful in the
cache != none case.

The data I have shows more details including the performance and
overhead.

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Avi Kivity

On 06/14/2010 06:55 PM, Dave Hansen wrote:

On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote:
   

On 06/14/2010 06:33 PM, Dave Hansen wrote:
 

At the same time, I see what you're trying to do with this.  It really
can be an alternative to ballooning if we do it right, since ballooning
would probably evict similar pages.  Although it would only work in idle
guests, what about a knob that the host can turn to just get the guest
to start running reclaim?
   

Isn't the knob in this proposal the balloon?  AFAICT, the idea here is
to change how the guest reacts to being ballooned, but the trigger
itself would not change.
 

I think the patch was made on the following assumptions:
1. Guests will keep filling their memory with relatively worthless page
cache that they don't really need.
2. When they do this, it hurts the overall system with no real gain for
anyone.

In the case of a ballooned guest, they _won't_ keep filling memory.  The
balloon will prevent them.  So, I guess I was just going down the path
of considering if this would be useful without ballooning in place.  To
me, it's really hard to justify _with_ ballooning in place.
   


There are two decisions that need to be made:

- how much memory a guest should be given
- given some guest memory, what's the best use for it

The first question can perhaps be answered by looking at guest I/O rates 
and giving more memory to more active guests.  The second question is 
hard, but not any different than running non-virtualized - except if we 
can detect sharing or duplication.  In this case, dropping a duplicated 
page is worthwhile, while dropping a shared page provides no benefit.


How the patch helps answer either question, I'm not sure.  I don't think 
preferential dropping of unmapped page cache is the answer.



My issue is that changing the type of object being preferentially
reclaimed just changes the type of workload that would prematurely
suffer from reclaim.  In this case, workloads that use a lot of unmapped
pagecache would suffer.

btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs?
 

Those tell you how to balance going after the different classes of
things that we can reclaim.

Again, this is useless when ballooning is being used.  But, I'm thinking
of a more general mechanism to force the system to both have MemFree
_and_ be acting as if it is under memory pressure.
   


If there is no memory pressure on the host, there is no reason for the 
guest to pretend it is under pressure.  If there is memory pressure on 
the host, it should share the pain among its guests by applying the 
balloon.  So I don't think voluntarily dropping cache is a good direction.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-14 18:34:58]:

 On 06/14/2010 06:12 PM, Dave Hansen wrote:
 On Mon, 2010-06-14 at 14:18 +0530, Balbir Singh wrote:
 1. A slab page will not be freed until the entire page is free (all
 slabs have been kfree'd so to speak). Normal reclaim will definitely
 free this page, but a lot of it depends on how frequently we are
 scanning the LRU list and when this page got added.
 You don't have to be freeing entire slab pages for the reclaim to have
 been useful.  You could just be making space so that _future_
 allocations fill in the slab holes you just created.  You may not be
 freeing pages, but you're reducing future system pressure.
 
 Depends.  If you've evicted something that will be referenced soon,
 you're increasing system pressure.


I don't think slab pages care about being referenced soon, they are
either allocated or freed. A page is just a storage unit for the data
structure. A new one can be allocated on demand.
 
 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-06-14 19:34:00]:

 On 06/14/2010 06:55 PM, Dave Hansen wrote:
 On Mon, 2010-06-14 at 18:44 +0300, Avi Kivity wrote:
 On 06/14/2010 06:33 PM, Dave Hansen wrote:
 At the same time, I see what you're trying to do with this.  It really
 can be an alternative to ballooning if we do it right, since ballooning
 would probably evict similar pages.  Although it would only work in idle
 guests, what about a knob that the host can turn to just get the guest
 to start running reclaim?
 Isn't the knob in this proposal the balloon?  AFAICT, the idea here is
 to change how the guest reacts to being ballooned, but the trigger
 itself would not change.
 I think the patch was made on the following assumptions:
 1. Guests will keep filling their memory with relatively worthless page
 cache that they don't really need.
 2. When they do this, it hurts the overall system with no real gain for
 anyone.
 
 In the case of a ballooned guest, they _won't_ keep filling memory.  The
 balloon will prevent them.  So, I guess I was just going down the path
 of considering if this would be useful without ballooning in place.  To
 me, it's really hard to justify _with_ ballooning in place.
 
 There are two decisions that need to be made:
 
 - how much memory a guest should be given
 - given some guest memory, what's the best use for it
 
 The first question can perhaps be answered by looking at guest I/O
 rates and giving more memory to more active guests.  The second
 question is hard, but not any different than running non-virtualized
 - except if we can detect sharing or duplication.  In this case,
 dropping a duplicated page is worthwhile, while dropping a shared
 page provides no benefit.

I think there is another way of looking at it, give some free memory

1. Can the guest run more applications or run faster
2. Can the host potentially get this memory via ballooning or some
other means to start newer guest instances

I think the answer to 1 and 2 is yes.

 
 How the patch helps answer either question, I'm not sure.  I don't
 think preferential dropping of unmapped page cache is the answer.


Preferential dropping as selected by the host, that knows about the
setup and if there is duplication involved. While we use the term
preferential dropping, remember it is still via LRU and we don't
always succeed. It is a best effort (if you can and the unmapped pages
are not highly referenced) scenario.
 
 My issue is that changing the type of object being preferentially
 reclaimed just changes the type of workload that would prematurely
 suffer from reclaim.  In this case, workloads that use a lot of unmapped
 pagecache would suffer.
 
 btw, aren't /proc/sys/vm/swapiness and vfs_cache_pressure similar knobs?
 Those tell you how to balance going after the different classes of
 things that we can reclaim.
 
 Again, this is useless when ballooning is being used.  But, I'm thinking
 of a more general mechanism to force the system to both have MemFree
 _and_ be acting as if it is under memory pressure.
 
 If there is no memory pressure on the host, there is no reason for
 the guest to pretend it is under pressure.  If there is memory
 pressure on the host, it should share the pain among its guests by
 applying the balloon.  So I don't think voluntarily dropping cache
 is a good direction.


There are two situations

1. Voluntarily drop cache, if it was setup to do so (the host knows
that it caches that information anyway)
2. Drop the cache on either a special balloon option, again the host
knows it caches that very same information, so it prefers to free that
up first. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/T/D][PATCH 2/2] Linux/Guest cooperative unmapped page cache control

2010-06-14 Thread Dave Hansen
On Mon, 2010-06-14 at 19:34 +0300, Avi Kivity wrote:
  Again, this is useless when ballooning is being used.  But, I'm thinking
  of a more general mechanism to force the system to both have MemFree
  _and_ be acting as if it is under memory pressure.
 
 
 If there is no memory pressure on the host, there is no reason for the 
 guest to pretend it is under pressure.

I can think of quite a few places where this would be beneficial.

Ballooning is dangerous.  I've OOMed quite a few guests by
over-ballooning them.  Anything that's voluntary like this is safer than
things imposed by the host, although you do trade of effectiveness.

If all the guests do this, then it leaves that much more free memory on
the host, which can be used flexibly for extra host page cache, new
guests, etc...  A system in this state where everyone is proactively
keeping their footprints down is more likely to be able to handle load
spikes.  Reclaim is an expensive, costly activity, and this ensures that
we don't have to do that when we're busy doing other things like
handling load spikes.  This was one of the concepts behind CMM2: reduce
the overhead during peak periods.

It's also handy for planning.  Guests exhibiting this behavior will
_act_ as if they're under pressure.  That's a good thing to approximate
how a guest will act when it _is_ under pressure.

 If there is memory pressure on 
 the host, it should share the pain among its guests by applying the 
 balloon.  So I don't think voluntarily dropping cache is a good direction.

I think we're trying to consider things slightly outside of ballooning
at this point.  If ballooning was the end-all solution, I'm fairly sure
Balbir wouldn't be looking at this stuff.  Just trying to keep options
open. :)

-- Dave

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 5/5] [scsi-bsg]: Add initial support for BSG based SCSIDeviceInfo

2010-06-14 Thread Blue Swirl
On Mon, Jun 14, 2010 at 9:44 AM, Nicholas A. Bellinger
n...@linux-iscsi.org wrote:
 From: Nicholas Bellinger n...@linux-iscsi.org

 This patch adds initial support for using the Linux BSG interface with 
 write/read vectored
 AIO as a QEMU backstore (SCSIDeviceInfo) with hw/scsi-bus.c compatible HBA 
 emulation.

Did I miss the docs?


 So far it has been tested with x86_64 host and guest using hw/megasas.c and 
 TCM_Loop LLD
 Port LUNs.  Because this path uses struct iovec for struct 
 sg_io_v4-d[out,in]_xferp payloads,
 which currently requires a patch to linux/block/bsg.c:bsg_map_hdr() in order 
 to setup the
 user - kernel iovecs.  This also will only currently work with paired 
 user/kernel
 (eg: 64bit user / 64bit kernel) because of different pointer sizes in struct 
 iovec-iov_base.

 There are also two FIXMEs in hw/scsi-bsg.c:bsg_generic_initfn() related to 
 extraction of
 SCSI LUN and device type values using BSG and required by QEMU-KVM.

 Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
 ---
  Makefile.objs |    2 +-
  hw/scsi-bsg.c |  588 
 +
  2 files changed, 589 insertions(+), 1 deletions(-)
  create mode 100644 hw/scsi-bsg.c

 diff --git a/Makefile.objs b/Makefile.objs
 index 188d617..c4fcb72 100644
 --- a/Makefile.objs
 +++ b/Makefile.objs
 @@ -197,7 +197,7 @@ hw-obj-$(CONFIG_IDE_CMD646) += ide/cmd646.o
  hw-obj-$(CONFIG_IDE_MACIO) += ide/macio.o

  # SCSI layer
 -hw-obj-y += scsi-disk.o scsi-generic.o
 +hw-obj-y += scsi-disk.o scsi-generic.o scsi-bsg.o

Instead of '#ifdef __linux__' (which should be '#ifdef CONFIG_LINUX'),
please compile the object only if CONFIG_LINUX is set, something like:
hw-obj-$(CONFIG_LINUX) += scsi-bsg.o

Please also check if this could be compiled in common-obj set.

  hw-obj-y += lsi53c895a.o megasas.o
  hw-obj-$(CONFIG_ESP) += esp.o

 diff --git a/hw/scsi-bsg.c b/hw/scsi-bsg.c
 new file mode 100644
 index 000..fc76b76
 --- /dev/null
 +++ b/hw/scsi-bsg.c
 @@ -0,0 +1,588 @@
 +/*
 + * block layer implementation of the sg v4 interface for Linux hosts
 + *
 + * Copyright (c) 2010 Rising Tide Systems
 + * Written by Nicholas A. Bellinger n...@linux-iscsi.org
 + *
 + * Based on hw/scsi-generic code by Laurent Vivier, Paul Brook, and Fabrice 
 Bellard
 + *
 + * This code is licenced under the LGPL.
 + */
 +
 +#include qemu-common.h
 +#include qemu-error.h
 +#include block.h
 +#include scsi.h
 +#include dma.h
 +#include block/raw-posix-aio.h
 +
 +#ifdef __linux__
 +
 +#define DEBUG_BSG
 +#undef DEBUG_BSG_IO
 +#undef DEBUG_BSG_MAP

This should be
//#define DEBUG_BSG
//#define DEBUG_BSG_IO
//#define DEBUG_BSG_MAP

 +
 +#ifdef DEBUG_BSG
 +#define DPRINTF(fmt, ...) \
 +do { printf(scsi-bsg:  fmt , ## __VA_ARGS__); } while (0)
 +#else
 +#define DPRINTF(fmt, ...) do {} while(0)
 +#endif
 +
 +#define BADF(fmt, ...) \
 +do { fprintf(stderr, scsi-bsg:  fmt , ## __VA_ARGS__); } while (0)
 +
 +#include stdio.h
 +#include sys/types.h
 +#include sys/stat.h
 +#include sys/epoll.h
 +#include unistd.h
 +#include scsi/sg.h
 +#include linux/bsg.h
 +#include scsi-defs.h
 +
 +#define SCSI_SENSE_BUF_SIZE 96
 +
 +#define SG_ERR_DRIVER_TIMEOUT 0x06
 +#define SG_ERR_DRIVER_SENSE 0x08
 +
 +#ifndef MAX_UINT
 +#define MAX_UINT ((unsigned int)-1)

The standard macro is UINT_MAX.

 +#endif
 +
 +typedef struct SCSIBSGState SCSIBSGState;
 +
 +typedef struct SCSIBSGReq {
 +    SCSIRequest req;
 +    uint8_t *buf;
 +    int buflen;
 +    QEMUIOVector iov;
 +    QEMUIOVector aio_iov;
 +    struct sg_io_v4 bsg_hdr;
 +} SCSIBSGReq;
 +
 +struct SCSIBSGState {
 +    SCSIDevice qdev;
 +    BlockDriverState *bs;
 +    int lun;
 +    int driver_status;
 +    uint8_t sensebuf[SCSI_SENSE_BUF_SIZE];
 +    uint8_t senselen;
 +};
 +
 +static int bsg_read(int fd, void *p_read, int to_read)
 +{
 +    int err;
 +
 +    while (to_read  0) {
 +        err = read(fd, p_read, to_read);
 +        if (err = 0) {
 +            to_read -= err;
 +            p_read += err;
 +        } else if (errno == EINTR)
 +            continue;
 +        else {
 +            printf(bsg device %d read failed, errno: %d\n,
 +                    fd, errno);

DPRINTF?

 +            return errno;
 +        }
 +    }
 +    return 0;
 +}
 +
 +static SCSIBSGReq *bsg_new_request(SCSIDevice *d, uint32_t tag, uint32_t lun)
 +{
 +    SCSIRequest *req;
 +    SCSIBSGReq *r;
 +
 +    req = scsi_req_alloc(sizeof(SCSIBSGReq), d, tag, lun);
 +    r = DO_UPCAST(SCSIBSGReq, req, req);
 +    qemu_iovec_init(r-iov, 1);
 +    qemu_iovec_init(r-aio_iov, 1);
 +    return r;
 +}
 +
 +static void bsg_remove_request(SCSIBSGReq *r)
 +{
 +    qemu_free(r-buf);
 +    qemu_iovec_destroy(r-iov);
 +    qemu_iovec_destroy(r-aio_iov);
 +    scsi_req_free(r-req);
 +}
 +
 +static void bsg_command_complete(void *opaque, int ret)
 +{
 +    SCSIBSGReq *r = (SCSIBSGReq *)opaque;

Useless cast in C.

 +    SCSIBSGState *s = DO_UPCAST(SCSIBSGState, qdev, r-req.dev);
 +
 +    s-driver_status = 

[ kvm-Bugs-2933400 ] virtio-blk io errors / data corruption on raw drives 1 TB

2010-06-14 Thread SourceForge.net
Bugs item #2933400, was opened at 2010-01-16 15:35
Message generated for change (Comment added) made by gyver
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2933400group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 9
Private: No
Submitted By: MaSc82 (masc82)
Assigned to: Nobody/Anonymous (nobody)
Summary: virtio-blk io errors / data corruption on raw drives  1 TB

Initial Comment:
When attaching raw drives  1 TB, buffer io errors will most likely occur, 
filesystems get corrupted. Processes (like mkfs.ext4) might freeze completely 
when filesystems are created on the guest.

Here's a typical log excerpt when using mkfs.ext4 on a 1.5 TB drive on a Ubuntu 
9.10 guest:
(kern.log)
Jan 15 20:40:44 q kernel: [  677.076602] Buffer I/O error on device vde, 
logical block 366283764
Jan 15 20:40:44 q kernel: [  677.076607] Buffer I/O error on device vde, 
logical block 366283765
Jan 15 20:40:44 q kernel: [  677.076611] Buffer I/O error on device vde, 
logical block 366283766
Jan 15 20:40:44 q kernel: [  677.076616] Buffer I/O error on device vde, 
logical block 366283767
Jan 15 20:40:44 q kernel: [  677.076621] Buffer I/O error on device vde, 
logical block 366283768
Jan 15 20:40:44 q kernel: [  677.076626] Buffer I/O error on device vde, 
logical block 366283769
(messages)
Jan 15 20:40:44 q kernel: [  677.076534] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076541] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076546] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076599] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076604] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076609] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076613] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076618] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076623] lost page write due to I/O error on vde
Jan 15 20:40:44 q kernel: [  677.076628] lost page write due to I/O error on vde
Jan 15 20:45:55 q Backgrounding to notify hosts...
(The following entries will repeat infinitely, mkfs.ext4 will not exit and 
cannot be killed)
Jan 15 20:49:27 q kernel: [ 1200.520096] mkfs.ext4 D  0 
 1839   1709 0x
Jan 15 20:49:27 q kernel: [ 1200.520101]  88004e157cb8 0082 
88004e157c58 00015880
Jan 15 20:49:27 q kernel: [ 1200.520115]  88004ef6c7c0 00015880 
00015880 00015880
Jan 15 20:49:27 q kernel: [ 1200.520118]  00015880 88004ef6c7c0 
00015880 00015880
Jan 15 20:49:27 q kernel: [ 1200.520123] Call Trace:
Jan 15 20:49:27 q kernel: [ 1200.520157]  [810da0f0] ? 
sync_page+0x0/0x50
Jan 15 20:49:27 q kernel: [ 1200.520178]  [815255f8] 
io_schedule+0x28/0x40
Jan 15 20:49:27 q kernel: [ 1200.520182]  [810da12d] 
sync_page+0x3d/0x50
Jan 15 20:49:27 q kernel: [ 1200.520185]  [81525b17] 
__wait_on_bit+0x57/0x80
Jan 15 20:49:27 q kernel: [ 1200.520192]  [810da29e] 
wait_on_page_bit+0x6e/0x80
Jan 15 20:49:27 q kernel: [ 1200.520205]  [81078650] ? 
wake_bit_function+0x0/0x40
Jan 15 20:49:27 q kernel: [ 1200.520210]  [810e44e0] ? 
pagevec_lookup_tag+0x20/0x30
Jan 15 20:49:27 q kernel: [ 1200.520213]  [810da745] 
wait_on_page_writeback_range+0xf5/0x190
Jan 15 20:49:27 q kernel: [ 1200.520217]  [810da807] 
filemap_fdatawait+0x27/0x30
Jan 15 20:49:27 q kernel: [ 1200.520220]  [810dacb4] 
filemap_write_and_wait+0x44/0x50
Jan 15 20:49:27 q kernel: [ 1200.520235]  [8114ba9f] 
__sync_blockdev+0x1f/0x40
Jan 15 20:49:27 q kernel: [ 1200.520239]  [8114bace] 
sync_blockdev+0xe/0x10
Jan 15 20:49:27 q kernel: [ 1200.520241]  [8114baea] 
block_fsync+0x1a/0x20
Jan 15 20:49:27 q kernel: [ 1200.520249]  [81142f26] 
vfs_fsync+0x86/0xf0
Jan 15 20:49:27 q kernel: [ 1200.520252]  [81142fc9] 
do_fsync+0x39/0x60
Jan 15 20:49:27 q kernel: [ 1200.520255]  [8114301b] 
sys_fsync+0xb/0x10
Jan 15 20:49:27 q kernel: [ 1200.520271]  [81011fc2] 
system_call_fastpath+0x16/0x1b

In my case I was switching to virtio at one point, but the behaviour didn't 
show until there was  1 TB data on the filesystem. very dangerous.

Tested using 2 different SATA controllers, 1.5 TB lvm/mdraid, single 1.5 TB 
drive and 2 TB lvm/mdraid.
The behaviour does not occur with if=scsi or if=ide.

#2914397 might be related: 
https://sourceforge.net/tracker/?func=detailaid=2914397group_id=180599atid=893831
This blog post might also relate: 
http://www.neuhalfen.name/2009/08/05/OpenSolaris_KVM_and_large_IDE_drives/

CPU: Intel 

Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Alex Williamson
On Mon, 2010-06-14 at 18:49 +0200, Jan Kiszka wrote:
 Alex Williamson wrote:
  On Mon, 2010-06-14 at 18:00 +0200, Jan Kiszka wrote: 
  And instead of introducing another hierarchy level with the bus address,
  I would also prefer to add this as prefix or suffix to the device name,
  e.g. driver.busaddr.
  
  That's what I had started with.  The first post in this thread has
  pci.0,addr=09.0 as a single hierarchy level.  The addr= may be
  unnecessary there, but I also prefer something along those lines.  For
  PCI it'd make sense to have name:addr, which comes out to
  pci.0:09.0.  (Maybe rather than flagging properties as being relevant
  to the path and printing them generically, we should extract specific
  properties based on the bus type.)
 
 Not bus.addr, driver.addr. We only have one PCI bus here, not as many as
 there are slots on that bus.

Ok, I can get it down to something like:

/i440FX-pcihost/pci.0/virtio-blk-pci,09.0

The addr on the device is initially a little non-intuitive to me since
it's a property of the bus, but I guess it make sense if we think of
that level as slot, which includes an address and driver.

  For busses that don't have a consistent addressing scheme then some sort 
  of 
  instance ID is unavoidable. I guess it may be possible to invent 
  something 
  based on other device properties (e.g. address of the first IO 
  port/memory 
  region).
  Instance IDs aren't always bad, we just run into trouble with hotplug
  and maybe creating unique ramblock names.  But, such busses typically
  don't support hotplug and don't have multiple instances of the same
  device on the bus, so I don't expect us to hit many issues there as long
  as we can get a stable address path.  Thanks,
 
  If stable instance numbers are required, we could simply keep them in
  DeviceState and search for the smallest free one on additions. Actually,
  I'm more in favor of this direction than including the bus address. That
  way we could keep a canonical format across all buses and do not have to
  provide possibly complex ID generation rules for each of them.
  
  I started down that path, but it still breaks for hotplug.  If we start
  a VM with two e1000 NICs, then remove the first, we can no longer
  migrate because the target can't represent having a single e1000 with a
  non-zero instance ID.
 
 That's indeed a good point.
 
 Still, I'm worried about having to define numbering schemes for all the
 buses qemu supports. Maybe we can run a mixture: address-based for
 hotplug-capably buses (they tend to be cooperative in this regard) and
 simple instance numbers for the rest, likely the majority.

Yep, I share that concern, which is I say instance numbers aren't always
bad.  If we have devices that we don't care about doing hotplug on, we
can live with instance numbers.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/5] Introduce canonical device hierarchy string

2010-06-14 Thread Alex Williamson
On Mon, 2010-06-14 at 09:02 +0200, Gerd Hoffmann wrote:
 Hi,
 
  My premise with this attempt is that we walk the hierarchy and use the
  names to create the base of the path.  As we get to the device,
  particularly to the parent bus of the device, we need to start looking at
  properties to ensure uniqueness.
 
 You'll need that for every bus along the way down to the device.  Create 
 a virtual machine with two lsi scsi host adapters, then attach a disk 
 with scsi id 0 to each.  Just the scsi id isn't good enougth to identify 
 the device.  You'll need the lsi pci address too.

Yep, see below.

  For now, the only properties I've tagged as path
  properties are PCI bus addresses and MAC addresses.
 
 mac address isn't needed here.  You need the property which specifies 
 the bus address.  For PCI this obviously is the PCI address.  For scsi 
 the scsi id.  For ISA you can use the I/O port base.  virtio-serial the 
 port number, ...

PCI: addr
SCSI: scsi-id
ISA: serial/parallel = iobase, others??
ide-drive: unit
I2C: address

virtio-serial doesn't seem to make a DeviceState per port, so I think it
can be skipped.  I'm sure I'm still missing some...

That gives me the list below (printed from qdev_init()):

/i440FX-pcihost
/i440FX-pcihost/pci.0/i440FX.00.0
/i440FX-pcihost/pci.0/PIIX3.01.0
/i440FX-pcihost/pci.0/cirrus-vga.02.0
/i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/mc146818rtc
/i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-serial.0x3f8
/i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-serial.0x2f8
/i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-parallel.0x378
/i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/i8042
/i440FX-pcihost/pci.0/PIIX3.01.0/isa.0/isa-fdc
/i440FX-pcihost/pci.0/i82551.03.0
/i440FX-pcihost/pci.0/virtio-net-pci.04.0
/i440FX-pcihost/pci.0/e1000.05.0
/i440FX-pcihost/pci.0/rtl8139.06.0
/i440FX-pcihost/pci.0/pcnet.07.0
/i440FX-pcihost/pci.0/piix3-ide.01.1
/i440FX-pcihost/pci.0/piix3-ide.01.1/ide.0/ide-drive.0
/i440FX-pcihost/pci.0/piix3-ide.01.1/ide.0/ide-drive.1
/i440FX-pcihost/pci.0/piix3-ide.01.1/ide.1/ide-drive.0
/i440FX-pcihost/pci.0/PIIX4_PM.01.3
/i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.80
/i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.81
/i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.82
/i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.83
/i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.84
/i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.85
/i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.86
/i440FX-pcihost/pci.0/PIIX4_PM.01.3/i2c/smbus-eeprom.87
/i440FX-pcihost/pci.0/lsi53c895a.08.0/scsi.0/scsi-disk.0
/i440FX-pcihost/pci.0/lsi53c895a.08.0/scsi.0/scsi-disk.3
/i440FX-pcihost/pci.0/lsi53c895a.08.0
/i440FX-pcihost/pci.0/virtio-serial-pci.09.0
/i440FX-pcihost/pci.0/virtio-serial-pci.0a.0
/i440FX-pcihost/pci.0/virtio-blk-pci.0b.0


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND PATCH] acpi_piix4: save gpe and pci hotplug slot status

2010-06-14 Thread Alex Williamson
PCI hotplug currently doesn't work after a migration because
we don't migrate the enable bits of the GPE state.  Pull hotplug
structs into vmstate.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/acpi_piix4.c |   29 -
 1 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index a87286b..8d1a628 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -283,9 +283,33 @@ static int vmstate_acpi_post_load(void *opaque, int 
version_id)
 return 0;
 }
 
+static const VMStateDescription vmstate_gpe = {
+.name = gpe,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT16(sts, struct gpe_regs),
+VMSTATE_UINT16(en, struct gpe_regs),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static const VMStateDescription vmstate_pci_status = {
+.name = pci_status,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT32(up, struct pci_status),
+VMSTATE_UINT32(down, struct pci_status),
+VMSTATE_END_OF_LIST()
+}
+};
+
 static const VMStateDescription vmstate_acpi = {
 .name = piix4_pm,
-.version_id = 1,
+.version_id = 2,
 .minimum_version_id = 1,
 .minimum_version_id_old = 1,
 .post_load = vmstate_acpi_post_load,
@@ -297,6 +321,9 @@ static const VMStateDescription vmstate_acpi = {
 VMSTATE_STRUCT(apm, PIIX4PMState, 0, vmstate_apm, APMState),
 VMSTATE_TIMER(tmr_timer, PIIX4PMState),
 VMSTATE_INT64(tmr_overflow_time, PIIX4PMState),
+VMSTATE_STRUCT(gpe, PIIX4PMState, 2, vmstate_gpe, struct gpe_regs),
+VMSTATE_STRUCT(pci0_status, PIIX4PMState, 2, vmstate_pci_status,
+   struct pci_status),
 VMSTATE_END_OF_LIST()
 }
 };

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] KVM: x86: XSAVE/XRSTOR live migration support

2010-06-14 Thread Marcelo Tosatti
On Sun, Jun 13, 2010 at 05:29:39PM +0800, Sheng Yang wrote:
 This patch enable save/restore of xsave state.
 
 Signed-off-by: Sheng Yang sh...@linux.intel.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] qemu: kvm: Enable XSAVE live migration support

2010-06-14 Thread Marcelo Tosatti
On Fri, Jun 11, 2010 at 12:36:49PM +0800, Sheng Yang wrote:
 
 Signed-off-by: Sheng Yang sh...@linux.intel.com
 ---
  qemu-kvm-x86.c|  109 
 -
  qemu-kvm.c|   24 +++
  qemu-kvm.h|   28 +
  target-i386/cpu.h |5 ++
  target-i386/kvm.c |2 +
  target-i386/machine.c |   20 +
  6 files changed, 169 insertions(+), 19 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4] test: Add XSAVE unit test

2010-06-14 Thread Marcelo Tosatti
On Sun, Jun 13, 2010 at 04:49:19PM +0800, Sheng Yang wrote:
 Based on IDT test framework.
 
 Signed-off-by: Sheng Yang sh...@linux.intel.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] test: Add IDT framework

2010-06-14 Thread Marcelo Tosatti
On Sun, Jun 13, 2010 at 11:02:02AM +0300, Avi Kivity wrote:
 Signed-off-by: Sheng Yang sh...@linux.intel.com
 Signed-off-by: Avi Kivity a...@redhat.com
 ---
 
 v3: rearrange printf()s
 
 v2: accurate instruction boundary tests
 avoid playing with the stack; use exception tables instead

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RESEND PATCH] acpi_piix4: save gpe and pci hotplug slot status

2010-06-14 Thread Anthony Liguori

On 06/14/2010 03:28 PM, Alex Williamson wrote:

PCI hotplug currently doesn't work after a migration because
we don't migrate the enable bits of the GPE state.  Pull hotplug
structs into vmstate.

Signed-off-by: Alex Williamsonalex.william...@redhat.com
   


Applied.  Thanks.

Regards,

Anthony Liguori

---

  hw/acpi_piix4.c |   29 -
  1 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index a87286b..8d1a628 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -283,9 +283,33 @@ static int vmstate_acpi_post_load(void *opaque, int 
version_id)
  return 0;
  }

+static const VMStateDescription vmstate_gpe = {
+.name = gpe,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT16(sts, struct gpe_regs),
+VMSTATE_UINT16(en, struct gpe_regs),
+VMSTATE_END_OF_LIST()
+}
+};
+
+static const VMStateDescription vmstate_pci_status = {
+.name = pci_status,
+.version_id = 1,
+.minimum_version_id = 1,
+.minimum_version_id_old = 1,
+.fields  = (VMStateField []) {
+VMSTATE_UINT32(up, struct pci_status),
+VMSTATE_UINT32(down, struct pci_status),
+VMSTATE_END_OF_LIST()
+}
+};
+
  static const VMStateDescription vmstate_acpi = {
  .name = piix4_pm,
-.version_id = 1,
+.version_id = 2,
  .minimum_version_id = 1,
  .minimum_version_id_old = 1,
  .post_load = vmstate_acpi_post_load,
@@ -297,6 +321,9 @@ static const VMStateDescription vmstate_acpi = {
  VMSTATE_STRUCT(apm, PIIX4PMState, 0, vmstate_apm, APMState),
  VMSTATE_TIMER(tmr_timer, PIIX4PMState),
  VMSTATE_INT64(tmr_overflow_time, PIIX4PMState),
+VMSTATE_STRUCT(gpe, PIIX4PMState, 2, vmstate_gpe, struct gpe_regs),
+VMSTATE_STRUCT(pci0_status, PIIX4PMState, 2, vmstate_pci_status,
+   struct pci_status),
  VMSTATE_END_OF_LIST()
  }
  };



   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RESEND PATCH] acpi_piix4: save gpe and pci hotplug slot status

2010-06-14 Thread Alexander Graf

On 14.06.2010, at 22:58, Anthony Liguori wrote:

 On 06/14/2010 03:28 PM, Alex Williamson wrote:
 PCI hotplug currently doesn't work after a migration because
 we don't migrate the enable bits of the GPE state.  Pull hotplug
 structs into vmstate.
 
 Signed-off-by: Alex Williamsonalex.william...@redhat.com
   
 
 Applied.  Thanks.

Is this stable material?

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fix -DDEBUG oops

2010-06-14 Thread Zachary Amsden

Patch is fairly self-explanatory.
KVM -DDEBUG fix

Fix a slight error with assertion in local APIC code.

Signed-off-by: Zachary Amsden zams...@redhat.com

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index d8258a0..024f6d1 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -329,7 +329,7 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct 
kvm_lapic *source,
   dest_mode 0x%x, short_hand 0x%x\n,
   target, source, dest, dest_mode, short_hand);
 
-   ASSERT(!target);
+   ASSERT(target);
switch (short_hand) {
case APIC_DEST_NOSHORT:
if (dest_mode == 0)


Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Paul Brook
 On Mon, 2010-06-14 at 18:49 +0200, Jan Kiszka wrote:
  Alex Williamson wrote:
   On Mon, 2010-06-14 at 18:00 +0200, Jan Kiszka wrote:
   And instead of introducing another hierarchy level with the bus
   address, I would also prefer to add this as prefix or suffix to the
   device name, e.g. driver.busaddr.
   
   That's what I had started with.  The first post in this thread has
   pci.0,addr=09.0 as a single hierarchy level.  The addr= may be
   unnecessary there, but I also prefer something along those lines.  For
   PCI it'd make sense to have name:addr, which comes out to
   pci.0:09.0.  (Maybe rather than flagging properties as being relevant
   to the path and printing them generically, we should extract specific
   properties based on the bus type.)
  
  Not bus.addr, driver.addr. We only have one PCI bus here, not as many as
  there are slots on that bus.
 
 Ok, I can get it down to something like:
 
 /i440FX-pcihost/pci.0/virtio-blk-pci,09.0
 
 The addr on the device is initially a little non-intuitive to me since
 it's a property of the bus, but I guess it make sense if we think of
 that level as slot, which includes an address and driver.

That indicates you're thinking about things the wrong way.
The point of this path is to uniquely identify an entity.

/i440FX-pcihost/pci0 identifies a PCI bus attached to the i440FX-pcihost 
device. To identify a device attached to that bus all you need to know is the 
devfn of the device.

For an end-user it may be helpful to allow devices to be identified by the 
device type (assuming only one device of a particular type on that bus), or 
specify the device type as a crude error checking mechanism. However we're 
talking about canonical addresses. These need not include the device type. 
Verifying that the device is actually what you expect is a separate problem, 
and the device type is not sufficient for that.

i.e. /i440FX-pcihost/pci.0/,09.0 Is an appropriate canonical address.
 
   I started down that path, but it still breaks for hotplug.  If we start
   a VM with two e1000 NICs, then remove the first, we can no longer
   migrate because the target can't represent having a single e1000 with a
   non-zero instance ID.
  
  That's indeed a good point.

That's a feature. If you start with two NICs and remove the first, the chances 
are that the second will be in a different place to the nice created in a 
single-nic config. Allowing migration between these two will cause a PCI 
device to suddenly change location. This is not physically or logically 
possible, and everyone looses.

Hot-removing a nic and then hotplugging a new nic in the same location may 
result in something that is reasonable to migrate.

Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v6 6/6] the stand-alone shared memory server for inter-VM shared memory

2010-06-14 Thread Cam Macdonell
On Mon, Jun 14, 2010 at 9:53 AM, Anthony Liguori anth...@codemonkey.ws wrote:
 On 06/04/2010 04:45 PM, Cam Macdonell wrote:

 this code is a standalone server which will pass file descriptors for the
 shared
 memory region and eventfds to support interrupts between guests using
 inter-VM
 shared memory.
 ---
  contrib/ivshmem-server/Makefile         |   16 ++
  contrib/ivshmem-server/README           |   30 +++
  contrib/ivshmem-server/ivshmem_server.c |  353
 +++
  contrib/ivshmem-server/send_scm.c       |  208 ++
  contrib/ivshmem-server/send_scm.h       |   19 ++
  5 files changed, 626 insertions(+), 0 deletions(-)
  create mode 100644 contrib/ivshmem-server/Makefile
  create mode 100644 contrib/ivshmem-server/README
  create mode 100644 contrib/ivshmem-server/ivshmem_server.c
  create mode 100644 contrib/ivshmem-server/send_scm.c
  create mode 100644 contrib/ivshmem-server/send_scm.h

 diff --git a/contrib/ivshmem-server/Makefile
 b/contrib/ivshmem-server/Makefile
 new file mode 100644
 index 000..da40ffa
 --- /dev/null
 +++ b/contrib/ivshmem-server/Makefile
 @@ -0,0 +1,16 @@
 +CC = gcc
 +CFLAGS = -O3 -Wall -Werror
 +LIBS = -lrt
 +
 +# a very simple makefile to build the inter-VM shared memory server
 +
 +all: ivshmem_server
 +
 +.c.o:
 +       $(CC) $(CFLAGS) -c $^ -o $@
 +
 +ivshmem_server: ivshmem_server.o send_scm.o
 +       $(CC) $(CFLAGS) -o $@ $^ $(LIBS)
 +
 +clean:
 +       rm -f *.o ivshmem_server
 diff --git a/contrib/ivshmem-server/README b/contrib/ivshmem-server/README
 new file mode 100644
 index 000..b1fc2a2
 --- /dev/null
 +++ b/contrib/ivshmem-server/README
 @@ -0,0 +1,30 @@
 +Using the ivshmem shared memory server
 +--
 +
 +This server is only supported on Linux.
 +
 +To use the shared memory server, first compile it.  Running 'make' should
 +accomplish this.  An executable named 'ivshmem_server' will be built.
 +
 +to display the options run:
 +
 +./ivshmem_server -h
 +
 +Options
 +---
 +
 +    -h  print help message
 +
 +    -ppath on host
 +        unix socket to listen on.  The qemu-kvm chardev needs to connect
 on
 +        this socket. (default: '/tmp/ivshmem_socket')
 +
 +    -sstring
 +        POSIX shared object to create that is the shared memory (default:
 'ivshmem')
 +
 +    -m#
 +        size of the POSIX object in MBs (default: 1)
 +
 +    -n#
 +        number of eventfds for each guest.  This number must match the
 +        'vectors' argument passed the ivshmem device. (default: 1)
 diff --git a/contrib/ivshmem-server/ivshmem_server.c
 b/contrib/ivshmem-server/ivshmem_server.c
 new file mode 100644
 index 000..e0a7b98
 --- /dev/null
 +++ b/contrib/ivshmem-server/ivshmem_server.c


 There's no licensing here.  I don't think this belongs in the qemu tree
 either to be honest.  If it were to be included, it ought to use all of the
 existing qemu infrastructure like the other qemu-* tools.

For the time being, I'm willing to leave it out and host it externally.


 Regards,

 Anthony Liguori

 @@ -0,0 +1,353 @@
 +/*
 + * A stand-alone shared memory server for inter-VM shared memory for KVM
 +*/
 +
 +#includeerrno.h
 +#includestring.h
 +#includesys/types.h
 +#includesys/socket.h
 +#includesys/un.h
 +#includeunistd.h
 +#includesys/types.h
 +#includesys/stat.h
 +#includefcntl.h
 +#includesys/eventfd.h
 +#includesys/mman.h
 +#includesys/select.h
 +#includestdio.h
 +#includestdlib.h
 +#include send_scm.h
 +
 +#define DEFAULT_SOCK_PATH /tmp/ivshmem_socket
 +#define DEFAULT_SHM_OBJ ivshmem
 +
 +#define DEBUG 1
 +
 +typedef struct server_state {
 +    vmguest_t *live_vms;
 +    int nr_allocated_vms;
 +    int shm_size;
 +    long live_count;
 +    long total_count;
 +    int shm_fd;
 +    char * path;
 +    char * shmobj;
 +    int maxfd, conn_socket;
 +    long msi_vectors;
 +} server_state_t;
 +
 +void usage(char const *prg);
 +int find_set(fd_set * readset, int max);
 +void print_vec(server_state_t * s, const char * c);
 +
 +void add_new_guest(server_state_t * s);
 +void parse_args(int argc, char **argv, server_state_t * s);
 +int create_listening_socket(char * path);
 +
 +int main(int argc, char ** argv)
 +{
 +    fd_set readset;
 +    server_state_t * s;
 +
 +    s = (server_state_t *)calloc(1, sizeof(server_state_t));
 +
 +    s-live_count = 0;
 +    s-total_count = 0;
 +    parse_args(argc, argv, s);
 +
 +    /* open shared memory file  */
 +    if ((s-shm_fd = shm_open(s-shmobj, O_CREAT|O_RDWR, S_IRWXU))  0)
 +    {
 +        fprintf(stderr, kvm_ivshmem: could not open shared file\n);
 +        exit(-1);
 +    }
 +
 +    ftruncate(s-shm_fd, s-shm_size);
 +
 +    s-conn_socket = create_listening_socket(s-path);
 +
 +    s-maxfd = s-conn_socket;
 +
 +    for(;;) {
 +        int ret, handle, i;
 +        char buf[1024];
 +
 +        print_vec(s, vm_sockets);
 +
 +        FD_ZERO(readset);
 +        /* conn socket is in Live_vms at posn 0 */
 +        FD_SET(s-conn_socket,readset);
 + 

Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Alex Williamson
On Mon, 2010-06-14 at 22:43 +0100, Paul Brook wrote:
  On Mon, 2010-06-14 at 18:49 +0200, Jan Kiszka wrote:
   Alex Williamson wrote:
  Ok, I can get it down to something like:
  
  /i440FX-pcihost/pci.0/virtio-blk-pci,09.0
  
  The addr on the device is initially a little non-intuitive to me since
  it's a property of the bus, but I guess it make sense if we think of
  that level as slot, which includes an address and driver.
 
 That indicates you're thinking about things the wrong way.
 The point of this path is to uniquely identify an entity.
 
 /i440FX-pcihost/pci0 identifies a PCI bus attached to the i440FX-pcihost 
 device. To identify a device attached to that bus all you need to know is the 
 devfn of the device.

Hmm, I think that identifies where the device is, not what the device
is.  It's more helpful to say the e1000 in slot 7 than it is to say
the device in slot 7.

 For an end-user it may be helpful to allow devices to be identified by the 
 device type (assuming only one device of a particular type on that bus), or 
 specify the device type as a crude error checking mechanism. However we're 
 talking about canonical addresses. These need not include the device type. 
 Verifying that the device is actually what you expect is a separate problem, 
 and the device type is not sufficient for that.
 
 i.e. /i440FX-pcihost/pci.0/,09.0 Is an appropriate canonical address.

We seem to keep introducing new problems, and I'm not sure this one
exists.  If I drop the device name/type and use only the devfn, then
what's to prevent the /e1000,09.0/rom (/,09.0/rom) from being stuffed
into the /rtl8139,09.0/rom (/,09.0/rom) on a migration?  (or we match it
to /09.0/savevm when trying to migrate state)  We can argue that e1000
isn't a sufficient identifier, but I can't think of a case where it'd
fail.

I started down that path, but it still breaks for hotplug.  If we start
a VM with two e1000 NICs, then remove the first, we can no longer
migrate because the target can't represent having a single e1000 with a
non-zero instance ID.
   
   That's indeed a good point.
 
 That's a feature. If you start with two NICs and remove the first, the 
 chances 
 are that the second will be in a different place to the nice created in a 
 single-nic config. Allowing migration between these two will cause a PCI 
 device to suddenly change location. This is not physically or logically 
 possible, and everyone looses.

If the BAR addresses don't follow the VM when it's migrated, that's
another bug that needs to be fixed, but we can't get there until we at
least have some infrastructure in place to make that bug possible.

 Hot-removing a nic and then hotplugging a new nic in the same location may 
 result in something that is reasonable to migrate.

It might... it might not.  I'd rather make it work than try to document
the restrictions.

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/3] Introduce a workqueue to deliver PIT timer interrupts.

2010-06-14 Thread Marcelo Tosatti
On Mon, Jun 14, 2010 at 01:11:20PM -0400, Chris Lalancette wrote:
 We really want to kvm_set_irq during the hrtimer callback,
 but that is risky because that is during interrupt context.
 Instead, offload the work to a workqueue, which is a bit safer
 and should provide most of the same functionality.
 
 Signed-off-by: Chris Lalancette clala...@redhat.com
 ---
  arch/x86/kvm/i8254.c |  125 -
  arch/x86/kvm/i8254.h |4 +-
  arch/x86/kvm/irq.c   |1 -
  3 files changed, 74 insertions(+), 56 deletions(-)
 
 diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
 index 188d827..3bed8ac 100644
 --- a/arch/x86/kvm/i8254.c
 +++ b/arch/x86/kvm/i8254.c
 @@ -34,6 +34,7 @@
  
  #include linux/kvm_host.h
  #include linux/slab.h
 +#include linux/workqueue.h
  
  #include irq.h
  #include i8254.h
 @@ -244,11 +245,11 @@ static void kvm_pit_ack_irq(struct kvm_irq_ack_notifier 
 *kian)
  {
   struct kvm_kpit_state *ps = container_of(kian, struct kvm_kpit_state,
irq_ack_notifier);
 - raw_spin_lock(ps-inject_lock);
 + spin_lock(ps-inject_lock);
   if (atomic_dec_return(ps-pit_timer.pending)  0)
   atomic_inc(ps-pit_timer.pending);
   ps-irq_ack = 1;
 - raw_spin_unlock(ps-inject_lock);
 + spin_unlock(ps-inject_lock);
  }
  
  void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu)
 @@ -267,7 +268,8 @@ void __kvm_migrate_pit_timer(struct kvm_vcpu *vcpu)
  static void destroy_pit_timer(struct kvm_timer *pt)
  {
   pr_debug(execute del timer!\n);
 - hrtimer_cancel(pt-timer);
 + if (hrtimer_cancel(pt-timer))
 + cancel_work_sync(pt-kvm-arch.vpit-expired);
  }
  
  static bool kpit_is_periodic(struct kvm_timer *ktimer)
 @@ -281,6 +283,58 @@ static struct kvm_timer_ops kpit_ops = {
   .is_periodic = kpit_is_periodic,
  };
  
 +static void pit_do_work(struct work_struct *work)
 +{
 + struct kvm_pit *pit = container_of(work, struct kvm_pit, expired);
 + struct kvm *kvm = pit-kvm;
 + struct kvm_vcpu *vcpu;
 + int i;
 + struct kvm_kpit_state *ps = pit-pit_state;
 + int inject = 0;
 +
 + /* Try to inject pending interrupts when
 +  * last one has been acked.
 +  */
 + spin_lock(ps-inject_lock);
 + if (ps-irq_ack) {
 + ps-irq_ack = 0;
 + inject = 1;
 + }
 + spin_unlock(ps-inject_lock);
 + if (inject) {
 + kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 1);
 + kvm_set_irq(kvm, kvm-arch.vpit-irq_source_id, 0, 0);
 +
 + /*
 +  * Provides NMI watchdog support via Virtual Wire mode.
 +  * The route is: PIT - PIC - LVT0 in NMI mode.
 +  *
 +  * Note: Our Virtual Wire implementation is simplified, only
 +  * propagating PIT interrupts to all VCPUs when they have set
 +  * LVT0 to NMI delivery. Other PIC interrupts are just sent to
 +  * VCPU0, and only if its LVT0 is in EXTINT mode.
 +  */
 + if (kvm-arch.vapics_in_nmi_mode  0)
 + kvm_for_each_vcpu(i, vcpu, kvm)
 + kvm_apic_nmi_wd_deliver(vcpu);
 + }
 +}
 +
 +static enum hrtimer_restart pit_timer_fn(struct hrtimer *data)
 +{
 + struct kvm_timer *ktimer = container_of(data, struct kvm_timer, timer);
 + struct kvm_pit *pt = ktimer-kvm-arch.vpit;
 +
 + if (ktimer-reinject)
 + queue_work(pt-wq, pt-expired);

The problem is queue_work only queues the work if it was not already
queued. So multiple queue_work() calls can collapse into one executed
job.

You need to maintain a counter here in pit_timer_fn, and reinject at 
some point (perhaps on ACK) if there are multiple interrupts pending.

 +
 + if (ktimer-t_ops-is_periodic(ktimer)) {
 + hrtimer_add_expires_ns(ktimer-timer, ktimer-period);
 + return HRTIMER_RESTART;
 + } else
 + return HRTIMER_NORESTART;
 +}
 +
  static void create_pit_timer(struct kvm_kpit_state *ps, u32 val, int 
 is_period)
  {
   struct kvm_timer *pt = ps-pit_timer;
 @@ -291,14 +345,14 @@ static void create_pit_timer(struct kvm_kpit_state *ps, 
 u32 val, int is_period)
   pr_debug(create pit timer, interval is %llu nsec\n, interval);
  
   /* TODO The new value only affected after the retriggered */
 - hrtimer_cancel(pt-timer);
 + if (hrtimer_cancel(pt-timer))
 + cancel_work_sync(pt-kvm-arch.vpit-expired);

There can be a queued work instance even if the hrtimer is not active,
so cancel_work_sync should be unconditional.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] KVM: MMU: don't walk every parent pages while mark unsync

2010-06-14 Thread Marcelo Tosatti
On Fri, Jun 11, 2010 at 09:35:15PM +0800, Xiao Guangrong wrote:
 While we mark the parent's unsync_child_bitmap, if the parent is already
 unsynced, it no need walk it's parent, it can reduce some unnecessary
 workload
 
 Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 ---
  arch/x86/kvm/mmu.c |   61 ++-
  1 files changed, 17 insertions(+), 44 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index eb20682..a92863f 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -175,7 +175,7 @@ struct kvm_shadow_walk_iterator {
shadow_walk_okay((_walker));  \
shadow_walk_next((_walker)))
  
 -typedef int (*mmu_parent_walk_fn) (struct kvm_mmu_page *sp);
 +typedef void (*mmu_parent_walk_fn) (struct kvm_mmu_page *sp, u64 *spte);
  
  static struct kmem_cache *pte_chain_cache;
  static struct kmem_cache *rmap_desc_cache;
 @@ -1024,7 +1024,6 @@ static void mmu_page_remove_parent_pte(struct 
 kvm_mmu_page *sp,
   BUG();
  }
  
 -
  static void mmu_parent_walk(struct kvm_mmu_page *sp, mmu_parent_walk_fn fn)
  {
   struct kvm_pte_chain *pte_chain;
 @@ -1034,63 +1033,37 @@ static void mmu_parent_walk(struct kvm_mmu_page *sp, 
 mmu_parent_walk_fn fn)
  
   if (!sp-multimapped  sp-parent_pte) {
   parent_sp = page_header(__pa(sp-parent_pte));
 - fn(parent_sp);
 - mmu_parent_walk(parent_sp, fn);
 + fn(parent_sp, sp-parent_pte);
   return;
   }
 +
   hlist_for_each_entry(pte_chain, node, sp-parent_ptes, link)
   for (i = 0; i  NR_PTE_CHAIN_ENTRIES; ++i) {
 - if (!pte_chain-parent_ptes[i])
 + u64 *spte = pte_chain-parent_ptes[i];
 +
 + if (!spte)
   break;
 - parent_sp = 
 page_header(__pa(pte_chain-parent_ptes[i]));
 - fn(parent_sp);
 - mmu_parent_walk(parent_sp, fn);
 + parent_sp = page_header(__pa(spte));
 + fn(parent_sp, spte);
   }
  }
  
 -static void kvm_mmu_update_unsync_bitmap(u64 *spte)
 +static void mark_unsync(struct kvm_mmu_page *sp, u64 *spte);
 +static void kvm_mmu_mark_parents_unsync(struct kvm_mmu_page *sp)
  {
 - unsigned int index;
 - struct kvm_mmu_page *sp = page_header(__pa(spte));
 -
 - index = spte - sp-spt;
 - if (!__test_and_set_bit(index, sp-unsync_child_bitmap))
 - sp-unsync_children++;
 - WARN_ON(!sp-unsync_children);
 + mmu_parent_walk(sp, mark_unsync);
  }
  
 -static void kvm_mmu_update_parents_unsync(struct kvm_mmu_page *sp)
 +static void mark_unsync(struct kvm_mmu_page *sp, u64 *spte)
  {
 - struct kvm_pte_chain *pte_chain;
 - struct hlist_node *node;
 - int i;
 + unsigned int index;
  
 - if (!sp-parent_pte)
 + index = spte - sp-spt;
 + if (__test_and_set_bit(index, sp-unsync_child_bitmap))
   return;
 -
 - if (!sp-multimapped) {
 - kvm_mmu_update_unsync_bitmap(sp-parent_pte);
 + if (sp-unsync_children++)
   return;

This looks wrong. If the sp has an unrelated children marked as 
unsync (which increased sp-unsync_children), you stop the walk? 

Applied 1-6, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: VMX: optimize APIC EOI

2010-06-14 Thread Marcelo Tosatti

Use information provided in exit_qualification to shortcut EOI path.

Reduces EOI latency from 4k to 2k cycles on Nehalem.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

Index: kvm/arch/x86/kvm/lapic.c
===
--- kvm.orig/arch/x86/kvm/lapic.c
+++ kvm/arch/x86/kvm/lapic.c
@@ -1279,3 +1279,9 @@ int kvm_hv_vapic_msr_read(struct kvm_vcp
 
return 0;
 }
+
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+   apic_set_eoi(vcpu-arch.apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
Index: kvm/arch/x86/kvm/lapic.h
===
--- kvm.orig/arch/x86/kvm/lapic.h
+++ kvm/arch/x86/kvm/lapic.h
@@ -52,6 +52,8 @@ int kvm_x2apic_msr_read(struct kvm_vcpu 
 int kvm_hv_vapic_msr_write(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_hv_vapic_msr_read(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
+
 static inline bool kvm_hv_vapic_assist_page_enabled(struct kvm_vcpu *vcpu)
 {
return vcpu-arch.hv_vapic  HV_X64_MSR_APIC_ASSIST_PAGE_ENABLE;
Index: kvm/arch/x86/kvm/vmx.c
===
--- kvm.orig/arch/x86/kvm/vmx.c
+++ kvm/arch/x86/kvm/vmx.c
@@ -3395,6 +3395,17 @@ static int handle_wbinvd(struct kvm_vcpu
 
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+   unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+   int access_type, offset;
+
+   access_type = (exit_qualification  12)  0xf;
+   offset = exit_qualification  0xfff;
+   if (access_type == 1  offset == APIC_EOI) {
+   kvm_lapic_set_eoi(vcpu);
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
+
return emulate_instruction(vcpu, 0, 0, 0) == EMULATE_DONE;
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC PATCH 1/5] qdev: Create qdev_get_dev_path()

2010-06-14 Thread Paul Brook
   Ok, I can get it down to something like:
   
   /i440FX-pcihost/pci.0/virtio-blk-pci,09.0
   
   The addr on the device is initially a little non-intuitive to me since
   it's a property of the bus, but I guess it make sense if we think of
   that level as slot, which includes an address and driver.
  
  That indicates you're thinking about things the wrong way.
  The point of this path is to uniquely identify an entity.
  
  /i440FX-pcihost/pci0 identifies a PCI bus attached to the i440FX-pcihost
  device. To identify a device attached to that bus all you need to know is
  the devfn of the device.
 
 Hmm, I think that identifies where the device is, not what the device
 is.  It's more helpful to say the e1000 in slot 7 than it is to say
 the device in slot 7.

Why is this more useful? Canonical addresses should not be helpful. They 
should identify entities within a machine that is already known to be 
consistent. Making them helpful just makes them more volatile.
 
  For an end-user it may be helpful to allow devices to be identified by
  the device type (assuming only one device of a particular type on that
  bus), or specify the device type as a crude error checking mechanism.
  However we're talking about canonical addresses. These need not include
  the device type. Verifying that the device is actually what you expect
  is a separate problem, and the device type is not sufficient for that.
  
  i.e. /i440FX-pcihost/pci.0/,09.0 Is an appropriate canonical address.
 
 We seem to keep introducing new problems, and I'm not sure this one
 exists.  If I drop the device name/type and use only the devfn, then
 what's to prevent the /e1000,09.0/rom (/,09.0/rom) from being stuffed
 into the /rtl8139,09.0/rom (/,09.0/rom) on a migration?  (or we match it
 to /09.0/savevm when trying to migrate state)  We can argue that e1000
 isn't a sufficient identifier, but I can't think of a case where it'd
 fail.

The migration code needs to check that the devices are actually compatible. 
I'd expect this to require much more than just the device name. What you 
actually need is more like An e1000 with 64k eeprom, fast ethernet PHY, and 
frobnitz B. In fact what you really want to do is transfer the device tree 
(including properties), and create the machine from scratch, not load state 
into a pre-supplied device tree.

 I started down that path, but it still breaks for hotplug.  If we
 start a VM with two e1000 NICs, then remove the first, we can no
 longer migrate because the target can't represent having a single
 e1000 with a non-zero instance ID.

That's indeed a good point.
  
  That's a feature. If you start with two NICs and remove the first, the
  chances are that the second will be in a different place to the nice
  created in a single-nic config. Allowing migration between these two
  will cause a PCI device to suddenly change location. This is not
  physically or logically possible, and everyone looses.
 
 If the BAR addresses don't follow the VM when it's migrated, that's
 another bug that needs to be fixed, but we can't get there until we at
 least have some infrastructure in place to make that bug possible.

Not BAR addresses, the actual PCI device addresses. Devices on the PCI bus are 
addressed by device and function.  This is guest visible.  The device part of 
this address corresponds to the physical slot, which typically effects IRQ 
routing (amongst other things).  If you arbitrarily move a device from slot A 
to slot B then this will have catastrophic effects on a running machine.

Paul
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VT-d: fix PCI device detach from virtual machine

2010-06-14 Thread David Woodhouse
On Thu, 2009-02-26 at 17:31 +0800, Han, Weidong wrote:
 When assign a device behind conventional PCI bridge or PCIe to
 PCI/PCI-x bridge to a domain, it must assign its bridge and may
 also need to assign secondary interface to the same domain. 
 
 Dependent assignment is already there, but dependent
 deassignment is missed when detach device from virtual machine.
 This results in conventional PCI device assignment failure after
 it has been assigned once. This patch addes dependent
 deassignment, and fixes the issue. 

Um, this code makes my head hurt.

Why are we doing this in the first place? Because the IOMMU works on the
source-id in PCIe transactions, the pci_find_upstream_pcie_bridge()
function effectively tells us which PCI device our own device will be
masquerading as, for the purposes of DMA.

So why do we bother setting up a context in the IOMMU for the device
itself, when no DMA will ever appear to come from this device? And
likewise why do we bother setting up a context for intermediate PCI
bridges?

Why not just jump straight to the 'DMA proxy' device, and use that
_only_?

We'll have to cope with multiple devices behind the same 'proxy', but it
looks like our handling of that is totally screwed already...  what
happens right now if you have two PCI devices behind the same PCIe-PCI
bridge, and try to attach both of them to different domains... or both
to the _same_ domain, in fact, and then detach just one of them. I think
the answer to the latter question is that your newly-added
iommu_detach_dependent_devices() routine will tear down the mapping on
the 'proxy' device and faults will start happening for the device which
is supposed to still be mapped?

Confused... and tempted to rip it all out and start over.

-- 
dwmw2

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 5/5] [scsi-bsg]: Add initial support for BSG based SCSIDeviceInfo

2010-06-14 Thread Nicholas A. Bellinger
On Mon, 2010-06-14 at 18:11 +, Blue Swirl wrote:
 On Mon, Jun 14, 2010 at 9:44 AM, Nicholas A. Bellinger
 n...@linux-iscsi.org wrote:
  From: Nicholas Bellinger n...@linux-iscsi.org
 
  This patch adds initial support for using the Linux BSG interface with 
  write/read vectored
  AIO as a QEMU backstore (SCSIDeviceInfo) with hw/scsi-bus.c compatible HBA 
  emulation.
 
 Did I miss the docs?

I assume you mean the docs for CLI usage, yes..?  This ops are the same
as scsi-generic, and using megasas HBA emulation look like:

qemu-system-x86_64 -m 512 -boot c ~/lenny-32bit.img -drive
if=none,id=mydisk1,file=/dev/4\:0\:1\:0 -device megasas,id=raid -device
scsi-bsg,bus=raid.0,scsi-id=1,drive=mydisk1

 
 
  So far it has been tested with x86_64 host and guest using hw/megasas.c and 
  TCM_Loop LLD
  Port LUNs.  Because this path uses struct iovec for struct 
  sg_io_v4-d[out,in]_xferp payloads,
  which currently requires a patch to linux/block/bsg.c:bsg_map_hdr() in 
  order to setup the
  user - kernel iovecs.  This also will only currently work with paired 
  user/kernel
  (eg: 64bit user / 64bit kernel) because of different pointer sizes in 
  struct iovec-iov_base.
 
  There are also two FIXMEs in hw/scsi-bsg.c:bsg_generic_initfn() related to 
  extraction of
  SCSI LUN and device type values using BSG and required by QEMU-KVM.
 
  Signed-off-by: Nicholas A. Bellinger n...@linux-iscsi.org
  ---
   Makefile.objs |2 +-
   hw/scsi-bsg.c |  588 
  +
   2 files changed, 589 insertions(+), 1 deletions(-)
   create mode 100644 hw/scsi-bsg.c
 
  diff --git a/Makefile.objs b/Makefile.objs
  index 188d617..c4fcb72 100644
  --- a/Makefile.objs
  +++ b/Makefile.objs
  @@ -197,7 +197,7 @@ hw-obj-$(CONFIG_IDE_CMD646) += ide/cmd646.o
   hw-obj-$(CONFIG_IDE_MACIO) += ide/macio.o
 
   # SCSI layer
  -hw-obj-y += scsi-disk.o scsi-generic.o
  +hw-obj-y += scsi-disk.o scsi-generic.o scsi-bsg.o
 
 Instead of '#ifdef __linux__' (which should be '#ifdef CONFIG_LINUX'),
 please compile the object only if CONFIG_LINUX is set, something like:
 hw-obj-$(CONFIG_LINUX) += scsi-bsg.o
 
 Please also check if this could be compiled in common-obj set.
 

Ok, I added the 'hw-obj-$(CONFIG_LINUX) += scsi-bsg.o' mentioned above..

   hw-obj-y += lsi53c895a.o megasas.o
   hw-obj-$(CONFIG_ESP) += esp.o
 
  diff --git a/hw/scsi-bsg.c b/hw/scsi-bsg.c
  new file mode 100644
  index 000..fc76b76
  --- /dev/null
  +++ b/hw/scsi-bsg.c
  @@ -0,0 +1,588 @@
  +/*
  + * block layer implementation of the sg v4 interface for Linux hosts
  + *
  + * Copyright (c) 2010 Rising Tide Systems
  + * Written by Nicholas A. Bellinger n...@linux-iscsi.org
  + *
  + * Based on hw/scsi-generic code by Laurent Vivier, Paul Brook, and 
  Fabrice Bellard
  + *
  + * This code is licenced under the LGPL.
  + */
  +
  +#include qemu-common.h
  +#include qemu-error.h
  +#include block.h
  +#include scsi.h
  +#include dma.h
  +#include block/raw-posix-aio.h
  +
  +#ifdef __linux__
  +
  +#define DEBUG_BSG
  +#undef DEBUG_BSG_IO
  +#undef DEBUG_BSG_MAP
 
 This should be
 //#define DEBUG_BSG
 //#define DEBUG_BSG_IO
 //#define DEBUG_BSG_MAP
 

Fixed

  +
  +#ifdef DEBUG_BSG
  +#define DPRINTF(fmt, ...) \
  +do { printf(scsi-bsg:  fmt , ## __VA_ARGS__); } while (0)
  +#else
  +#define DPRINTF(fmt, ...) do {} while(0)
  +#endif
  +
  +#define BADF(fmt, ...) \
  +do { fprintf(stderr, scsi-bsg:  fmt , ## __VA_ARGS__); } while (0)
  +
  +#include stdio.h
  +#include sys/types.h
  +#include sys/stat.h
  +#include sys/epoll.h
  +#include unistd.h
  +#include scsi/sg.h
  +#include linux/bsg.h
  +#include scsi-defs.h
  +
  +#define SCSI_SENSE_BUF_SIZE 96
  +
  +#define SG_ERR_DRIVER_TIMEOUT 0x06
  +#define SG_ERR_DRIVER_SENSE 0x08
  +
  +#ifndef MAX_UINT
  +#define MAX_UINT ((unsigned int)-1)
 
 The standard macro is UINT_MAX.
 

Fixed (This was originally from hw/scsi-generic.c)

  +#endif
  +
  +typedef struct SCSIBSGState SCSIBSGState;
  +
  +typedef struct SCSIBSGReq {
  +SCSIRequest req;
  +uint8_t *buf;
  +int buflen;
  +QEMUIOVector iov;
  +QEMUIOVector aio_iov;
  +struct sg_io_v4 bsg_hdr;
  +} SCSIBSGReq;
  +
  +struct SCSIBSGState {
  +SCSIDevice qdev;
  +BlockDriverState *bs;
  +int lun;
  +int driver_status;
  +uint8_t sensebuf[SCSI_SENSE_BUF_SIZE];
  +uint8_t senselen;
  +};
  +
  +static int bsg_read(int fd, void *p_read, int to_read)
  +{
  +int err;
  +
  +while (to_read  0) {
  +err = read(fd, p_read, to_read);
  +if (err = 0) {
  +to_read -= err;
  +p_read += err;
  +} else if (errno == EINTR)
  +continue;
  +else {
  +printf(bsg device %d read failed, errno: %d\n,
  +fd, errno);
 
 DPRINTF?

I made this into a error_report()

 
  +return errno;
  +}
  +}
  +return 0;
  +}
  +
  +static SCSIBSGReq 

[PATCH] KVM: x86 emulator: fix pusha instruction emulation

2010-06-14 Thread Wei Yongjun
emulate pusha instruction only writeback the last
EDI register, but the other registers which need
to be writeback is ignored. This patch fixed it.

---
 arch/x86/kvm/emulate.c |  133 ++--
 1 files changed, 73 insertions(+), 60 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index a4c2dcd..c990db0 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1553,6 +1553,64 @@ exception:
return X86EMUL_PROPAGATE_FAULT;
 }
 
+static inline int writeback(struct x86_emulate_ctxt *ctxt,
+   struct x86_emulate_ops *ops)
+{
+   int rc;
+   struct decode_cache *c = ctxt-decode;
+   u32 err;
+
+   switch (c-dst.type) {
+   case OP_REG:
+   /* The 4-byte case *is* correct:
+* in 64-bit mode we zero-extend.
+*/
+   switch (c-dst.bytes) {
+   case 1:
+   *(u8 *)c-dst.ptr = (u8)c-dst.val;
+   break;
+   case 2:
+   *(u16 *)c-dst.ptr = (u16)c-dst.val;
+   break;
+   case 4:
+   *c-dst.ptr = (u32)c-dst.val;
+   break;  /* 64b: zero-ext */
+   case 8:
+   *c-dst.ptr = c-dst.val;
+   break;
+   }
+   break;
+   case OP_MEM:
+   if (c-lock_prefix)
+   rc = ops-cmpxchg_emulated(
+   (unsigned long)c-dst.ptr,
+   c-dst.orig_val,
+   c-dst.val,
+   c-dst.bytes,
+   err,
+   ctxt-vcpu);
+   else
+   rc = ops-write_emulated(
+   (unsigned long)c-dst.ptr,
+   c-dst.val,
+   c-dst.bytes,
+   err,
+   ctxt-vcpu);
+   if (rc == X86EMUL_PROPAGATE_FAULT)
+   emulate_pf(ctxt,
+ (unsigned long)c-dst.ptr, err);
+   if (rc != X86EMUL_CONTINUE)
+   return rc;
+   break;
+   case OP_NONE:
+   /* no writeback */
+   break;
+   default:
+   break;
+   }
+   return X86EMUL_CONTINUE;
+}
+
 static inline void emulate_push(struct x86_emulate_ctxt *ctxt,
struct x86_emulate_ops *ops)
 {
@@ -1651,11 +1709,12 @@ static int emulate_pop_sreg(struct x86_emulate_ctxt 
*ctxt,
return rc;
 }
 
-static void emulate_pusha(struct x86_emulate_ctxt *ctxt,
+static int emulate_pusha(struct x86_emulate_ctxt *ctxt,
  struct x86_emulate_ops *ops)
 {
struct decode_cache *c = ctxt-decode;
unsigned long old_esp = c-regs[VCPU_REGS_RSP];
+   int rc = X86EMUL_CONTINUE;
int reg = VCPU_REGS_RAX;
 
while (reg = VCPU_REGS_RDI) {
@@ -1663,8 +1722,18 @@ static void emulate_pusha(struct x86_emulate_ctxt *ctxt,
(c-src.val = old_esp) : (c-src.val = c-regs[reg]);
 
emulate_push(ctxt, ops);
+
+   rc = writeback(ctxt, ops);
+   if (rc != X86EMUL_CONTINUE)
+   return rc;
+
++reg;
}
+
+   /* Disable writeback. */
+   c-dst.type = OP_NONE;
+
+   return rc;
 }
 
 static int emulate_popa(struct x86_emulate_ctxt *ctxt,
@@ -1817,64 +1886,6 @@ static int emulate_ret_far(struct x86_emulate_ctxt *ctxt,
return rc;
 }
 
-static inline int writeback(struct x86_emulate_ctxt *ctxt,
-   struct x86_emulate_ops *ops)
-{
-   int rc;
-   struct decode_cache *c = ctxt-decode;
-   u32 err;
-
-   switch (c-dst.type) {
-   case OP_REG:
-   /* The 4-byte case *is* correct:
-* in 64-bit mode we zero-extend.
-*/
-   switch (c-dst.bytes) {
-   case 1:
-   *(u8 *)c-dst.ptr = (u8)c-dst.val;
-   break;
-   case 2:
-   *(u16 *)c-dst.ptr = (u16)c-dst.val;
-   break;
-   case 4:
-   *c-dst.ptr = (u32)c-dst.val;
-   break;  /* 64b: zero-ext */
-   case 8:
-   *c-dst.ptr = c-dst.val;
-   break;
-   }
-   break;
-   case OP_MEM:
-   if (c-lock_prefix)
-   rc = ops-cmpxchg_emulated(
-   (unsigned long)c-dst.ptr,
-   c-dst.orig_val,
-   

  1   2   >