Re: How many threads should a kvm vm be starting?

2011-09-28 Thread Daniel P. Berrange
On Tue, Sep 27, 2011 at 04:04:41PM -0600, Thomas Fjellstrom wrote:
 On September 27, 2011, Avi Kivity wrote:
  On 09/27/2011 03:29 AM, Thomas Fjellstrom wrote:
   I just noticed something interesting, a virtual machine on one of my
   servers seems to have 69 threads (including the main thread). Other
   guests on the machine only have a couple threads.
   
   Is this normal? or has something gone horribly wrong?
  
  It's normal if the guest does a lot of I/O.  The thread count should go
  down when the guest idles.
 
 Ah, that would make sense. Though it kind of defeats assigning a vm a single 
 cpu/core. A single VM can now DOS an entire multi-core-cpu server. It pretty 
 much pegged my dual core (with HT) server for a couple hours.

You can mitigate these problems by putting each KVM process in its own
cgroup, and using the 'cpu_shares' tunable to ensure that each KVM
process gets the same relative ratio of CPU time, regardless of how
many threads it is running. With newer kernels there are other CPU
tunables for placing hard caps on CPU utilization of the process as
a whole too.

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt

2011-09-28 Thread Daniel P. Berrange
On Tue, Sep 27, 2011 at 08:10:21PM +0200, Reeted wrote:
 I repost this, this time by also including the libvirt mailing list.
 
 Info on my libvirt: it's the version in Ubuntu 11.04 Natty which is
 0.8.8-1ubuntu6.5 . I didn't recompile this one, while Kernel and
 qemu-kvm are vanilla and compiled by hand as described below.
 
 My original message follows:
 
 This is really strange.
 
 I just installed a new host with kernel 3.0.3 and Qemu-KVM 0.14.1
 compiled by me.
 
 I have created the first VM.
 This is on LVM, virtio etc... if I run it directly from bash
 console, it boots in 8 seconds (it's a bare ubuntu with no
 graphics), while if I boot it under virsh (libvirt) it boots in
 20-22 seconds. This is the time from after Grub to the login prompt,
 or from after Grub to the ssh-server up.

 I was almost able to replicate the whole libvirt command line on the
 bash console, and it still goes almost 3x faster when launched from
 bash than with virsh start vmname. The part I wasn't able to
 replicate is the -netdev part because I still haven't understood the
 semantics of it.

-netdev is just an alternative way of setting up networking that
avoids QEMU's nasty VLAN concept. Using -netdev allows QEMU to
use more efficient codepaths for networking, which should improve
the network performance.

 This is my bash commandline:
 
 /opt/qemu-kvm-0.14.1/bin/qemu-system-x86_64 -M pc-0.14 -enable-kvm
 -m 2002 -smp 2,sockets=2,cores=1,threads=1 -name vmname1-1 -uuid
 ee75e28a-3bf3-78d9-3cba-65aa63973380 -nodefconfig -nodefaults
 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname1-1.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc
 -boot order=dc,menu=on -drive 
 file=/dev/mapper/vgPtpVM-lvVM_Vmname1_d1,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none,aio=native
 -device 
 virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
 -drive 
 if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,cache=none,aio=native
 -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
 -net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no
 -usb -vnc 127.0.0.1:0 -vga cirrus -device
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


This shows KVM is being requested, but we should validate that KVM is
definitely being activated when under libvirt. You can test this by
doing:

virsh qemu-monitor-command vmname1 'info kvm'

 Which was taken from libvirt's command line. The only modifications
 I did to the original libvirt commandline (seen with ps aux) were:
 
 - Removed -S

Fine, has no effect on performance.

 - Network was: -netdev tap,fd=17,id=hostnet0,vhost=on,vhostfd=18
 -device 
 virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
 Has been simplified to: -net nic,model=virtio -net
 tap,ifname=tap0,script=no,downscript=no
 and manual bridging of the tap0 interface.

You could have equivalently used

 -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
 -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3

That said, I don't expect this has anything todo with the performance
since booting a guest rarely involves much network I/O unless you're
doing something odd like NFS-root / iSCSI-root.

 Firstly I had thought that this could be fault of the VNC: I have
 compiled qemu-kvm with no separate vnc thread. I thought that
 libvirt might have connected to the vnc server at all times and this
 could have slowed down the whole VM.
 But then I also tried connecting vith vncviewer to the KVM machine
 launched directly from bash, and the speed of it didn't change. So
 no, it doesn't seem to be that.

Yeah, I have never seen VNC be responsible for the kind of slowdown
you describe.

 BTW: is the slowdown of the VM on no separate vnc thread only in
 effect when somebody is actually connected to VNC, or always?

Probably, but again I dont think it is likely to be relevant here.

 Also, note that the time difference is not visible in dmesg once the
 machine has booted. So it's not a slowdown in detecting devices.
 Devices are always detected within the first 3 seconds, according to
 dmesg, at 3.6 seconds the first ext4 mount begins. It seems to be
 really the OS boot that is slow... it seems an hard disk performance
 problem.


There are a couple of things that would be different between running the
VM directly, vs via libvirt.

 - Security drivers - SELinux/AppArmour
 - CGroups

If it is was AppArmour causing this slowdown I don't think you would have
been the first person to complain, so lets ignore that. Which leaves
cgroups as a likely culprit. Do a

  grep cgroup /proc/mounts

if any of them are mounted, then for each cgroups mount in turn,

 - Umount the cgroup
 - Restart libvirtd
 - Test your guest boot performance


Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org   

Re: How many threads should a kvm vm be starting?

2011-09-28 Thread Thomas Fjellstrom
On September 28, 2011, Daniel P. Berrange wrote:
 On Tue, Sep 27, 2011 at 04:04:41PM -0600, Thomas Fjellstrom wrote:
  On September 27, 2011, Avi Kivity wrote:
   On 09/27/2011 03:29 AM, Thomas Fjellstrom wrote:
I just noticed something interesting, a virtual machine on one of my
servers seems to have 69 threads (including the main thread). Other
guests on the machine only have a couple threads.

Is this normal? or has something gone horribly wrong?
   
   It's normal if the guest does a lot of I/O.  The thread count should go
   down when the guest idles.
  
  Ah, that would make sense. Though it kind of defeats assigning a vm a
  single cpu/core. A single VM can now DOS an entire multi-core-cpu
  server. It pretty much pegged my dual core (with HT) server for a couple
  hours.
 
 You can mitigate these problems by putting each KVM process in its own
 cgroup, and using the 'cpu_shares' tunable to ensure that each KVM
 process gets the same relative ratio of CPU time, regardless of how
 many threads it is running. With newer kernels there are other CPU
 tunables for placing hard caps on CPU utilization of the process as
 a whole too.

I'll have to look into how to set that up with libvirt. A brief search leads 
me to believe its rather easy to set up, so I'll have to do that asap :)

 Regards,
 Daniel


-- 
Thomas Fjellstrom
tho...@fjellstrom.ca
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files

2011-09-28 Thread Alexander Graf

On 28.09.2011, at 04:40, Alex Williamson wrote:

 On Tue, 2011-09-27 at 16:28 -0500, Scott Wood wrote:
 On 09/26/2011 07:45 PM, Alex Williamson wrote:
 On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
 On 09/26/2011 01:34 PM, Alex Williamson wrote:
 /* Reset the device */
 #define VFIO_DEVICE_RESET _IO(, ,)
 
 What generic way do we have to do this?  We should probably have a way
 to determine whether it's possible, without actually asking to do it.
 
 It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
 bit to the device flags to indicate if it's available or we could add a
 probe arg to the ioctl to either check for existence or do it.
 
 Even with PCI, isn't this only possible if function-level reset is
 supported?
 
 There are a couple other things we can do if FLR isn't present (D3hot
 transition, secondary bus reset, device specific resets are possible).
 
 I think we need a flag.
 
 Ok, PCI has a pci_probe_reset_function() and pci_reset_function().  I'd
 probably mimic those in the vfio device ops.  Common vfio code can probe
 the reset and set the flag appropriately and we can have a common
 VFIO_DEVICE_RESET ioctl that calls into the device ops reset function.
 
 For devices that can't be reset by the kernel, we'll want the ability to
 stop/start DMA acccess through the IOMMU (or other bus-specific means),
 separate from whether the fd is open.  If a device is assigned to a
 partition and that partition gets reset, we'll want to disable DMA
 before we re-use the memory, and enable it after the partition has reset
 or quiesced the device (which requires the fd to be open).
 
 Maybe this can be accomplished via an iommu_detach_device() to
 temporarily disassociate it from the domain.  We could also unmap all
 the DMA.  Anyway, a few possibilities.
 
 /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
 #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS  _IOW(, , int)
 #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS _IOW(, , int)
 
 Hope that covers it.
 
 It could be done this way, but I predict that the code (both kernel and
 user side) will be larger.  Maybe not much more complex, but more
 boilerplate.
 
 How will you manage extensions to the interface?
 
 I would assume we'd do something similar to the kvm capabilities checks.
 
 This information is already built into the data-structure approach.
 
 If we define it to be part of the flags, then it's built-in to the ioctl
 approach too...
 
 The table should not be particularly large, and you'll need to keep the
 information around in some form regardless.  Maybe in the PCI case you
 could produce it dynamically (though I probably wouldn't), but it really
 wouldn't make sense in the device tree case.
 
 It would be entirely dynamic for PCI, there's no advantage to caching
 it.  Even for device tree, if you can't fetch it dynamically, you'd have
 to duplicate it between an internal data structure and a buffer reading
 the table.
 
 I don't think we'd need to keep the device tree path/index info around
 for anything but the table -- but really, this is a minor consideration.
 
 You also lose the ability to easily have a human look at the hexdump for
 debugging; you'll need a special lsvfio tool.  You might want one
 anyway to pretty-print the info, but with ioctls it's mandatory.
 
 I don't think this alone justifies duplicating the data and making it
 difficult to parse on both ends.  Chances are we won't need such a tool
 for the ioctl interface because it's easier to get it right the first
 time ;)
 
 It's not just useful for getting the code right, but for e.g. sanity
 checking that the devices were bound properly.  I think such a tool
 would be generally useful, no matter what the kernel interface ends up
 being.  I don't just use lspci to debug the PCI subsystem. :-)
 
 This is also a minor consideration.  Looking at hexdumps isn't much to
 rely on for debugging and if we take the step of writing a tool, it's
 not much harder to write for either interface.  The table is more akin
 to dumping the data, but I feel the ioctl is easier for how a driver
 would probably make use of the data (linear vs random access).
 
 Note that I'm not stuck on this interface, I was just thinking about how
 to generate the table last week, it seemed like a pain so I thought I'd
 spend a few minutes outlining an ioctl interface... turns out it's not
 so bad.  Thanks,
 
 Yeah, it can work either way, as long as the information's there and
 there's a way to add new bits of information, or new bus types, down the
 road.  Mainly a matter of aesthetics between the two.
 
 It'd be nice if David would chime back in since he objected to the
 table.  Does an ioctl interface look better?  Alex Graf, any opinions?

I'm honestly pretty indifferent on ioctl vs. linear read. I got the impression 
that people dislike ioctls for whatever reason, so we went ahead and did the 
design based on read(). With KVM, ioctls are a constant pain to extend, but so 

Re: [libvirt] Qemu/KVM is 3x slower under libvirt

2011-09-28 Thread Reeted

On 09/28/11 09:51, Daniel P. Berrange wrote:

On Tue, Sep 27, 2011 at 08:10:21PM +0200, Reeted wrote:

I repost this, this time by also including the libvirt mailing list.

Info on my libvirt: it's the version in Ubuntu 11.04 Natty which is
0.8.8-1ubuntu6.5 . I didn't recompile this one, while Kernel and
qemu-kvm are vanilla and compiled by hand as described below.

My original message follows:

This is really strange.

I just installed a new host with kernel 3.0.3 and Qemu-KVM 0.14.1
compiled by me.

I have created the first VM.
This is on LVM, virtio etc... if I run it directly from bash
console, it boots in 8 seconds (it's a bare ubuntu with no
graphics), while if I boot it under virsh (libvirt) it boots in
20-22 seconds. This is the time from after Grub to the login prompt,
or from after Grub to the ssh-server up.

I was almost able to replicate the whole libvirt command line on the
bash console, and it still goes almost 3x faster when launched from
bash than with virsh start vmname. The part I wasn't able to
replicate is the -netdev part because I still haven't understood the
semantics of it.

-netdev is just an alternative way of setting up networking that
avoids QEMU's nasty VLAN concept. Using -netdev allows QEMU to
use more efficient codepaths for networking, which should improve
the network performance.


This is my bash commandline:

/opt/qemu-kvm-0.14.1/bin/qemu-system-x86_64 -M pc-0.14 -enable-kvm
-m 2002 -smp 2,sockets=2,cores=1,threads=1 -name vmname1-1 -uuid
ee75e28a-3bf3-78d9-3cba-65aa63973380 -nodefconfig -nodefaults
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname1-1.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc
-boot order=dc,menu=on -drive 
file=/dev/mapper/vgPtpVM-lvVM_Vmname1_d1,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none,aio=native
-device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-drive 
if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,cache=none,aio=native
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
-net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no
-usb -vnc 127.0.0.1:0 -vga cirrus -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


This shows KVM is being requested, but we should validate that KVM is
definitely being activated when under libvirt. You can test this by
doing:

 virsh qemu-monitor-command vmname1 'info kvm'


kvm support: enabled

I think I would see a higher impact if it was KVM not enabled.


Which was taken from libvirt's command line. The only modifications
I did to the original libvirt commandline (seen with ps aux) were:

- Removed -S

Fine, has no effect on performance.


- Network was: -netdev tap,fd=17,id=hostnet0,vhost=on,vhostfd=18
-device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
Has been simplified to: -net nic,model=virtio -net
tap,ifname=tap0,script=no,downscript=no
and manual bridging of the tap0 interface.

You could have equivalently used

  -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
  -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3


It's this! It's this!! (thanks for the line)

It raises boot time by 10-13 seconds

But now I don't know where to look During boot there is a pause 
usually between /scripts/init-bottom  (Ubuntu 11.04 guest) and the 
appearance of login prompt, however that is not really meaningful  
because there is probably much background activity going on there, with 
init etc. which don't display messages



init-bottom does just this

-
#!/bin/sh -e
# initramfs init-bottom script for udev

PREREQ=

# Output pre-requisites
prereqs()
{
echo $PREREQ
}

case $1 in
prereqs)
prereqs
exit 0
;;
esac


# Stop udevd, we'll miss a few events while we run init, but we catch up
pkill udevd

# Move /dev to the real filesystem
mount -n -o move /dev ${rootmnt}/dev
-

It doesn't look like it should take time to execute.
So there is probably some other background activity going on... and that 
is slower, but I don't know what that is.



Another thing that can be noticed is that the dmesg message:

[   13.290173] eth0: no IPv6 routers present

(which is also the last message)

happens on average 1 (one) second earlier in the fast case (-net) than 
in the slow case (-netdev)




That said, I don't expect this has anything todo with the performance
since booting a guest rarely involves much network I/O unless you're
doing something odd like NFS-root / iSCSI-root.


No there is nothing like that. No network disks or nfs.

I had ntpdate, but I removed that and it changed nothing.



Firstly I had thought that this could be fault of the VNC: I have
compiled qemu-kvm with no separate vnc thread. I thought that
libvirt might have connected to the vnc server at all times and 

Re: [libvirt] Qemu/KVM is 3x slower under libvirt

2011-09-28 Thread Daniel P. Berrange
On Wed, Sep 28, 2011 at 11:19:43AM +0200, Reeted wrote:
 On 09/28/11 09:51, Daniel P. Berrange wrote:
 This is my bash commandline:
 
 /opt/qemu-kvm-0.14.1/bin/qemu-system-x86_64 -M pc-0.14 -enable-kvm
 -m 2002 -smp 2,sockets=2,cores=1,threads=1 -name vmname1-1 -uuid
 ee75e28a-3bf3-78d9-3cba-65aa63973380 -nodefconfig -nodefaults
 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname1-1.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc
 -boot order=dc,menu=on -drive 
 file=/dev/mapper/vgPtpVM-lvVM_Vmname1_d1,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none,aio=native
 -device 
 virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
 -drive 
 if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,cache=none,aio=native
 -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
 -net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no
 -usb -vnc 127.0.0.1:0 -vga cirrus -device
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
 
 This shows KVM is being requested, but we should validate that KVM is
 definitely being activated when under libvirt. You can test this by
 doing:
 
  virsh qemu-monitor-command vmname1 'info kvm'
 
 kvm support: enabled
 
 I think I would see a higher impact if it was KVM not enabled.
 
 Which was taken from libvirt's command line. The only modifications
 I did to the original libvirt commandline (seen with ps aux) were:


 - Network was: -netdev tap,fd=17,id=hostnet0,vhost=on,vhostfd=18
 -device 
 virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
 Has been simplified to: -net nic,model=virtio -net
 tap,ifname=tap0,script=no,downscript=no
 and manual bridging of the tap0 interface.
 You could have equivalently used
 
   -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
   -device 
  virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
 
 It's this! It's this!! (thanks for the line)
 
 It raises boot time by 10-13 seconds

Ok, that is truely bizarre and I don't really have any explanation
for why that is. I guess you could try 'vhost=off' too and see if that
makes the difference.

 
 But now I don't know where to look During boot there is a pause
 usually between /scripts/init-bottom  (Ubuntu 11.04 guest) and the
 appearance of login prompt, however that is not really meaningful
 because there is probably much background activity going on there,
 with init etc. which don't display messages
 
 
 init-bottom does just this
 
 -
 #!/bin/sh -e
 # initramfs init-bottom script for udev
 
 PREREQ=
 
 # Output pre-requisites
 prereqs()
 {
 echo $PREREQ
 }
 
 case $1 in
 prereqs)
 prereqs
 exit 0
 ;;
 esac
 
 
 # Stop udevd, we'll miss a few events while we run init, but we catch up
 pkill udevd
 
 # Move /dev to the real filesystem
 mount -n -o move /dev ${rootmnt}/dev
 -
 
 It doesn't look like it should take time to execute.
 So there is probably some other background activity going on... and
 that is slower, but I don't know what that is.
 
 
 Another thing that can be noticed is that the dmesg message:
 
 [   13.290173] eth0: no IPv6 routers present
 
 (which is also the last message)
 
 happens on average 1 (one) second earlier in the fast case (-net)
 than in the slow case (-netdev)

Hmm, none of that looks particularly suspect. So I don't really have
much idea what else to try apart from the 'vhost=off' possibilty.


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt

2011-09-28 Thread Reeted

On 09/28/11 11:28, Daniel P. Berrange wrote:

On Wed, Sep 28, 2011 at 11:19:43AM +0200, Reeted wrote:

On 09/28/11 09:51, Daniel P. Berrange wrote:

This is my bash commandline:

/opt/qemu-kvm-0.14.1/bin/qemu-system-x86_64 -M pc-0.14 -enable-kvm
-m 2002 -smp 2,sockets=2,cores=1,threads=1 -name vmname1-1 -uuid
ee75e28a-3bf3-78d9-3cba-65aa63973380 -nodefconfig -nodefaults
-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname1-1.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc
-boot order=dc,menu=on -drive 
file=/dev/mapper/vgPtpVM-lvVM_Vmname1_d1,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none,aio=native
-device 
virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
-drive 
if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,cache=none,aio=native
-device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
-net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no
-usb -vnc 127.0.0.1:0 -vga cirrus -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

This shows KVM is being requested, but we should validate that KVM is
definitely being activated when under libvirt. You can test this by
doing:

 virsh qemu-monitor-command vmname1 'info kvm'

kvm support: enabled

I think I would see a higher impact if it was KVM not enabled.


Which was taken from libvirt's command line. The only modifications
I did to the original libvirt commandline (seen with ps aux) were:



- Network was: -netdev tap,fd=17,id=hostnet0,vhost=on,vhostfd=18
-device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
Has been simplified to: -net nic,model=virtio -net
tap,ifname=tap0,script=no,downscript=no
and manual bridging of the tap0 interface.

You could have equivalently used

  -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
  -device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3

It's this! It's this!! (thanks for the line)

It raises boot time by 10-13 seconds

Ok, that is truely bizarre and I don't really have any explanation
for why that is. I guess you could try 'vhost=off' too and see if that
makes the difference.


YES!
It's the vhost. With vhost=on it takes about 12 seconds more time to boot.

...meaning? :-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Daniel P. Berrange
On Wed, Sep 28, 2011 at 11:49:01AM +0200, Reeted wrote:
 On 09/28/11 11:28, Daniel P. Berrange wrote:
 On Wed, Sep 28, 2011 at 11:19:43AM +0200, Reeted wrote:
 On 09/28/11 09:51, Daniel P. Berrange wrote:
 This is my bash commandline:
 
 /opt/qemu-kvm-0.14.1/bin/qemu-system-x86_64 -M pc-0.14 -enable-kvm
 -m 2002 -smp 2,sockets=2,cores=1,threads=1 -name vmname1-1 -uuid
 ee75e28a-3bf3-78d9-3cba-65aa63973380 -nodefconfig -nodefaults
 -chardev 
 socket,id=charmonitor,path=/var/lib/libvirt/qemu/vmname1-1.monitor,server,nowait
 -mon chardev=charmonitor,id=monitor,mode=readline -rtc base=utc
 -boot order=dc,menu=on -drive 
 file=/dev/mapper/vgPtpVM-lvVM_Vmname1_d1,if=none,id=drive-virtio-disk0,boot=on,format=raw,cache=none,aio=native
 -device 
 virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0
 -drive 
 if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw,cache=none,aio=native
 -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0
 -net nic,model=virtio -net tap,ifname=tap0,script=no,downscript=no
 -usb -vnc 127.0.0.1:0 -vga cirrus -device
 virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
 This shows KVM is being requested, but we should validate that KVM is
 definitely being activated when under libvirt. You can test this by
 doing:
 
  virsh qemu-monitor-command vmname1 'info kvm'
 kvm support: enabled
 
 I think I would see a higher impact if it was KVM not enabled.
 
 Which was taken from libvirt's command line. The only modifications
 I did to the original libvirt commandline (seen with ps aux) were:
 
 - Network was: -netdev tap,fd=17,id=hostnet0,vhost=on,vhostfd=18
 -device 
 virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
 Has been simplified to: -net nic,model=virtio -net
 tap,ifname=tap0,script=no,downscript=no
 and manual bridging of the tap0 interface.
 You could have equivalently used
 
   -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
   -device 
  virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
 It's this! It's this!! (thanks for the line)
 
 It raises boot time by 10-13 seconds
 Ok, that is truely bizarre and I don't really have any explanation
 for why that is. I guess you could try 'vhost=off' too and see if that
 makes the difference.
 
 YES!
 It's the vhost. With vhost=on it takes about 12 seconds more time to boot.
 
 ...meaning? :-)

I've no idea. I was always under the impression that 'vhost=on' was
the 'make it go much faster' switch. So something is going wrong
here that I cna't explain.

Perhaps one of the network people on this list can explain...


To turn vhost off in the libvirt XML, you should be able to use
driver name='qemu'/ for the interface in question,eg


interface type='user'
  mac address='52:54:00:e5:48:58'/
  model type='virtio'/
  driver name='qemu'/
/interface

Regards,
Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Daniel P. Berrange
On Wed, Sep 28, 2011 at 12:19:09PM +0200, Reeted wrote:
 On 09/28/11 11:53, Daniel P. Berrange wrote:
 On Wed, Sep 28, 2011 at 11:49:01AM +0200, Reeted wrote:
 YES!
 It's the vhost. With vhost=on it takes about 12 seconds more time to boot.
 
 ...meaning? :-)
 I've no idea. I was always under the impression that 'vhost=on' was
 the 'make it go much faster' switch. So something is going wrong
 here that I cna't explain.
 
 Perhaps one of the network people on this list can explain...
 
 
 To turn vhost off in the libvirt XML, you should be able to use
 driver name='qemu'/  for the interface in question,eg
 
 
  interface type='user'
mac address='52:54:00:e5:48:58'/
model type='virtio'/
driver name='qemu'/
  /interface
 
 
 Ok that seems to work: it removes the vhost part in the virsh launch
 hence cutting down 12secs of boot time.
 
 If nobody comes out with an explanation of why, I will open another
 thread on the kvm list for this. I would probably need to test disk
 performance on vhost=on to see if it degrades or it's for another
 reason that boot time is increased.

Be sure to CC the qemu-devel mailing list too next time, since that has
a wider audience who might be able to help


Daniel
-- 
|: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org  -o- http://virt-manager.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 0/3] acpi: fix up EJ0 in DSDT

2011-09-28 Thread Amos Kong
- Original Message -
 This is a second iteration of the patch.  The patch has been
 significantly reworked to address (offline) comments by Gleb.
 
 I think the infrastructure created is generic enough
 to be generally useful beyond the specific bug
 that I would like to fix. Specifically it
 will be able to find S3 Name to patch that,
 or process compiled CPU SSDT to avoid the need for
 hardcoded offsets.
 
 Please comment.
 
 Main changes:
   - tools rewritten in python
   - Original ASL retains _EJ0 methods, BIOS patches that to EJ0_
   - generic ACP_EXTRACT infrastructure that can match Method
   and Name Operators
   - instead of matching specific method name, insert tags
 in original DSL source and match that to AML


Patch looks good to me.
Acked-by: Amos Kong ak...@redhat.com

 -
 
 Here's a bug: guest thinks it can eject VGA device and ISA bridge.
 
 [root@dhcp74-172 ~]#lspci
 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
 (rev 02)
 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA
 [Natoma/Triton II]
 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE
 [Natoma/Triton II]
 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
 00:02.0 VGA compatible controller: Cirrus Logic GD 5446
 00:03.0 PCI bridge: Red Hat, Inc. Device 0001
 00:04.0 Ethernet controller: Qumranet, Inc. Virtio network device
 00:05.0 SCSI storage controller: Qumranet, Inc. Virtio block device
 01:00.0 Ethernet controller: Intel Corporation 82540EM Gigabit
 Ethernet Controller (rev 03)
 
 [root@dhcp74-172 ~]# ls /sys/bus/pci/slots/1/
 adapter  address  attention  latch  module  power
 [root@dhcp74-172 ~]# ls /sys/bus/pci/slots/2/
 adapter  address  attention  latch  module  power
 
 [root@dhcp74-172 ~]# echo 0  /sys/bus/pci/slots/2/power
 [root@dhcp74-172 ~]# lspci
 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
 (rev 02)
 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA
 [Natoma/Triton II]
 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE
 [Natoma/Triton II]
 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
 00:03.0 PCI bridge: Red Hat, Inc. Device 0001
 00:04.0 Ethernet controller: Qumranet, Inc. Virtio network device
 00:05.0 SCSI storage controller: Qumranet, Inc. Virtio block device
 01:00.0 Ethernet controller: Intel Corporation 82540EM Gigabit
 Ethernet Controller (rev 03)
 
 This is wrong because slots 1 and 2 are marked as not hotpluggable
 in qemu.
 
 The reason is that our acpi tables declare both _RMV with value 0,
 and _EJ0 method for these slots. What happens in this case
 is undocumented by ACPI spec, so linux ignores _RMV,
 and windows seems to ignore _EJ0.
 
 The correct way to suppress hotplug is not to have _EJ0,
 so this is what this patch does: it probes PIIX and
 modifies DSDT to match.
 
 With these patches applied, we get:
 
 [root@dhcp74-172 ~]# ls /sys/bus/pci/slots/1/
 address
 [root@dhcp74-172 ~]# ls /sys/bus/pci/slots/2/
 address
 
 
 
 Michael S. Tsirkin (3):
   acpi: generate and parse mixed asl/aml listing
   acpi: EJ0 method name patching
   acpi: remove _RMV
 
  Makefile |   10 +-
  src/acpi-dsdt.dsl|   96 ---
  src/acpi.c   |   31 ++
  tools/acpi_extract.py|  195
  ++
  tools/acpi_extract_preprocess.py |   37 +++
  5 files changed, 307 insertions(+), 62 deletions(-)
  create mode 100755 tools/acpi_extract.py
  create mode 100755 tools/acpi_extract_preprocess.py
 
 --
 1.7.5.53.gc233e
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] virtio-net: Verify page list size before fitting into skb

2011-09-28 Thread Sasha Levin
On Tue, 2011-09-27 at 15:37 +0300, Michael S. Tsirkin wrote:
 On Tue, Sep 27, 2011 at 02:20:29PM +0300, Sasha Levin wrote:
  On Tue, 2011-09-27 at 10:00 +0300, Michael S. Tsirkin wrote:
  skb = page_to_skb(vi, page, len);
  ...
   
   Sorry I don't get it yet. Where is mergeable ignored here?
  
  The NULL deref happens in page_to_skb(), before merge buffers are
  handled.
 
 The len here is a single buffer length, so for mergeable
 buffers it is = the size of the buffer we gave to hardware,
 which is PAGE_SIZE.
 
 err = virtqueue_add_buf_single(vi-rvq, page_address(page),
PAGE_SIZE, false, page, gfp);
  
 
 Unless of course we are trying to work around broken hardware again,
 which I don't have a problem with, but should probably
 get appropriate comments in code and trigger a warning.
 
  I'll test it and see if it's really the case.

I've verified it with VIRTIO_NET_F_MRG_RXBUF set, and we still get the
NULL deref.



-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Richard W.M. Jones
On Wed, Sep 28, 2011 at 12:19:09PM +0200, Reeted wrote:
 On 09/28/11 11:53, Daniel P. Berrange wrote:
 On Wed, Sep 28, 2011 at 11:49:01AM +0200, Reeted wrote:
 YES!
 It's the vhost. With vhost=on it takes about 12 seconds more time to boot.
 
 ...meaning? :-)
 I've no idea. I was always under the impression that 'vhost=on' was
 the 'make it go much faster' switch. So something is going wrong
 here that I cna't explain.
 
 Perhaps one of the network people on this list can explain...
 
 
 To turn vhost off in the libvirt XML, you should be able to use
 driver name='qemu'/  for the interface in question,eg
 
 
  interface type='user'
mac address='52:54:00:e5:48:58'/
model type='virtio'/
driver name='qemu'/
  /interface
 
 
 Ok that seems to work: it removes the vhost part in the virsh launch
 hence cutting down 12secs of boot time.
 
 If nobody comes out with an explanation of why, I will open another
 thread on the kvm list for this. I would probably need to test disk
 performance on vhost=on to see if it degrades or it's for another
 reason that boot time is increased.

Is it using CPU during this time, or is the qemu-kvm process idle?

It wouldn't be the first time that a network option ROM sat around
waiting for an imaginary console user to press a key.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Reeted

On 09/28/11 11:53, Daniel P. Berrange wrote:

On Wed, Sep 28, 2011 at 11:49:01AM +0200, Reeted wrote:

YES!
It's the vhost. With vhost=on it takes about 12 seconds more time to boot.

...meaning? :-)

I've no idea. I was always under the impression that 'vhost=on' was
the 'make it go much faster' switch. So something is going wrong
here that I cna't explain.

Perhaps one of the network people on this list can explain...


To turn vhost off in the libvirt XML, you should be able to use
driver name='qemu'/  for the interface in question,eg


 interface type='user'
   mac address='52:54:00:e5:48:58'/
   model type='virtio'/
   driver name='qemu'/
 /interface



Ok that seems to work: it removes the vhost part in the virsh launch 
hence cutting down 12secs of boot time.


If nobody comes out with an explanation of why, I will open another 
thread on the kvm list for this. I would probably need to test disk 
performance on vhost=on to see if it degrades or it's for another reason 
that boot time is increased.


Thanks so much for your help Daniel,
Reeted
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Stephan Diestelhorst
On Tuesday 27 September 2011, 12:44:02 Jeremy Fitzhardinge wrote:
 On 09/27/2011 02:34 AM, Stephan Diestelhorst wrote:
  On Wednesday 14 September 2011, 17:31:32 Jeremy Fitzhardinge wrote:
  This series replaces the existing paravirtualized spinlock mechanism
  with a paravirtualized ticketlock mechanism.
  [...] 
  The unlock code is very straightforward:
 prev = *lock;
 __ticket_unlock_release(lock);
 if (unlikely(__ticket_in_slowpath(lock)))
 __ticket_unlock_slowpath(lock, prev);
 
  which generates:
 push   %rbp
 mov%rsp,%rbp
 
  movzwl (%rdi),%esi
 addb   $0x2,(%rdi)
  movzwl (%rdi),%eax
 testb  $0x1,%ah
 jne1f
 
 pop%rbp
 retq   
 
 ### SLOWPATH START
  1: movzwl (%rdi),%edx
 movzbl %dh,%ecx
 mov%edx,%eax
 and$-2,%ecx # clear TICKET_SLOWPATH_FLAG
 mov%cl,%dh
 cmp%dl,%cl  # test to see if lock is uncontended
 je 3f
 
  2: movzbl %dl,%esi
 callq  *__ticket_unlock_kick# kick anyone waiting
 pop%rbp
 retq   
 
  3: lock cmpxchg %dx,(%rdi) # use cmpxchg to safely write back flag
 jmp2b
 ### SLOWPATH END
  [...]
  Thoughts? Comments? Suggestions?
  You have a nasty data race in your code that can cause a losing
  acquirer to sleep forever, because its setting the TICKET_SLOWPATH flag
  can race with the lock holder releasing the lock.
 
  I used the code for the slow path from the GIT repo.
 
  Let me try to point out an interleaving:
 
  Lock is held by one thread, contains 0x0200.
 
  _Lock holder_   _Acquirer_
  mov$0x200,%eax
  lock xadd %ax,(%rdi)
  // ax:= 0x0200, lock:= 0x0400
  ...
  // this guy spins for a while, reading
  // the lock
  ...
  //trying to free the lock
  movzwl (%rdi),%esi (esi:=0x0400)
  addb   $0x2,(%rdi) (LOCAL copy of lock is now: 0x0402)
  movzwl (%rdi),%eax (local forwarding from previous store: eax := 0x0402)
  testb  $0x1,%ah(no wakeup of anybody)
  jne1f
 
  callq  *__ticket_lock_spinning
...
// __ticket_enter_slowpath(lock)
lock or (%rdi), $0x100
// (global view of lock := 0x0500)
  ...
ACCESS_ONCE(lock-tickets.head) == want
// (reads 0x00)
  ...
xen_poll_irq(irq); // goes to sleep
  ...
  [addb   $0x2,(%rdi)]
  // (becomes globally visible only now! global view of lock := 0x0502)
  ...
 
  Your code is reusing the (just about) safe version of unlocking a
  spinlock without understanding the effect that close has on later
  memory ordering. It may work on CPUs that cannot do narrow - wide
  store to load forwarding and have to make the addb store visible
  globally. This is an implementation artifact of specific uarches, and
  you mustn't rely on it, since our specified memory model allows looser
  behaviour.
 
 Ah, thanks for this observation.  I've seen this bug before when I
 didn't pay attention to the unlock W vs flag R ordering at all, and I
 was hoping the aliasing would be sufficient - and certainly this seems
 to have been OK on my Intel systems.  But you're saying that it will
 fail on current AMD systems?

I have tested this and have not seen it fail on publicly released AMD
systems. But as I have tried to point out, this does not mean it is
safe to do in software, because future microarchtectures may have more
capable forwarding engines.

 Have you tested this, or is this just from code analysis (which I
 agree with after reviewing the ordering rules in the Intel manual).

We have found a similar issue in Novell's PV ticket lock implementation
during internal product testing.

  Since you want to get that addb out to global memory before the second
  read, either use a LOCK prefix for it, add an MFENCE between addb and
  movzwl, or use a LOCKed instruction that will have a fencing effect
  (e.g., to top-of-stack)between addb and movzwl.
 
 Hm.  I don't really want to do any of those because it will probably
 have a significant effect on the unlock performance; I was really trying
 to avoid adding any more locked instructions.  A previous version of the
 code had an mfence in here, but I hit on the idea of using aliasing to
 get the ordering I want - but overlooked the possible effect of store
 forwarding.

Well, I'd be curious about the actual performance impact. If the store
needs to commit to memory due to aliasing anyways, this would slow down
execution, too. After all it is better to write 

qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Jan Kiszka
Alex,

we have this diff in qemu-kvm:

diff --git a/exec.c b/exec.c
index c1e045d..f188549 100644
--- a/exec.c
+++ b/exec.c
@@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
uint8_t *buf,
 cpu_physical_memory_set_dirty_flags(
 addr1, (0xff  ~CODE_DIRTY_FLAG));
 }
+   /* qemu doesn't execute guest code directly, but kvm does
+  therefore flush instruction caches */
+   if (kvm_enabled())
+   flush_icache_range((unsigned long)ptr,
+  ((unsigned long)ptr)+l);
 qemu_put_ram_ptr(ptr);
 }
 } else {


flush_icache_range() is doing something only on PPC hosts. So do we need
this upstream?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Alexander Graf

On 28.09.2011, at 16:23, Jan Kiszka wrote:

 Alex,
 
 we have this diff in qemu-kvm:
 
 diff --git a/exec.c b/exec.c
 index c1e045d..f188549 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
 uint8_t *buf,
 cpu_physical_memory_set_dirty_flags(
 addr1, (0xff  ~CODE_DIRTY_FLAG));
 }
 + /* qemu doesn't execute guest code directly, but kvm does
 +therefore flush instruction caches */
 + if (kvm_enabled())
 + flush_icache_range((unsigned long)ptr,
 +((unsigned long)ptr)+l);
 qemu_put_ram_ptr(ptr);
 }
 } else {
 
 
 flush_icache_range() is doing something only on PPC hosts. So do we need
 this upstream?

This makes sure that when device emulation overwrites code that is already 
present in the cache of a CPU, it gets flushed from the icache. I'm fairly sure 
we want that :). But let's ask Ben and David as well.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] virtio-net: Verify page list size before fitting into skb

2011-09-28 Thread Sasha Levin
This patch verifies that the length of a buffer stored in a linked list
of pages is small enough to fit into a skb.

If the size is larger than a max size of a skb, it means that we shouldn't
go ahead building skbs anyway since we won't be able to send the buffer as
the user requested.

Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Michael S. Tsirkin m...@redhat.com
Cc: virtualizat...@lists.linux-foundation.org
Cc: net...@vger.kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 drivers/net/virtio_net.c |   13 +
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0c7321c..bde0dec 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -195,6 +195,19 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
len -= copy;
offset += copy;
 
+   /*
+* Verify that we can indeed put this data into a skb.
+* This is here to handle cases when the device erroneously
+* tries to receive more than is possible. This is usually
+* the case of a broken device.
+*/
+   if (unlikely(len  MAX_SKB_FRAGS * PAGE_SIZE)) {
+   if (net_ratelimit())
+   pr_debug(%s: too much data\n, skb-dev-name);
+   dev_kfree_skb(skb);
+   return NULL;
+   }
+
while (len) {
set_skb_frag(skb, page, offset, len);
page = (struct page *)page-private;
-- 
1.7.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/2] virtio-net: Prevent NULL dereference

2011-09-28 Thread Sasha Levin
This patch prevents a NULL dereference when the user has passed a length
longer than an actual buffer to virtio-net.

Cc: Rusty Russell ru...@rustcorp.com.au
Cc: Michael S. Tsirkin m...@redhat.com
Cc: virtualizat...@lists.linux-foundation.org
Cc: net...@vger.kernel.org
Cc: kvm@vger.kernel.org
Signed-off-by: Sasha Levin levinsasha...@gmail.com
---
 drivers/net/virtio_net.c |   12 +++-
 1 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index bde0dec..4a53d2a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -208,12 +208,22 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
*vi,
return NULL;
}
 
-   while (len) {
+   while (len  page) {
set_skb_frag(skb, page, offset, len);
page = (struct page *)page-private;
offset = 0;
}
 
+   /*
+* This is the case where we ran out of pages in our linked list, but
+* supposedly have more data to read.
+*/ 
+   if (len  0) {
+   pr_debug(%s: missing data to assemble skb\n, skb-dev-name);
+   dev_kfree_skb(skb);
+   return NULL;
+   }
+
if (page)
give_pages(vi, page);
 
-- 
1.7.6.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Jan Kiszka

On 2011-09-28 16:26, Alexander Graf wrote:


On 28.09.2011, at 16:23, Jan Kiszka wrote:


Alex,

we have this diff in qemu-kvm:

diff --git a/exec.c b/exec.c
index c1e045d..f188549 100644
--- a/exec.c
+++ b/exec.c
@@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, 
uint8_t *buf,
 cpu_physical_memory_set_dirty_flags(
 addr1, (0xff  ~CODE_DIRTY_FLAG));
 }
+   /* qemu doesn't execute guest code directly, but kvm does
+  therefore flush instruction caches */
+   if (kvm_enabled())
+   flush_icache_range((unsigned long)ptr,
+  ((unsigned long)ptr)+l);
 qemu_put_ram_ptr(ptr);
 }
 } else {


flush_icache_range() is doing something only on PPC hosts. So do we need
this upstream?


This makes sure that when device emulation overwrites code that is already 
present in the cache of a CPU, it gets flushed from the icache. I'm fairly sure 
we want that :). But let's ask Ben and David as well.


/me wondered which write scenario precisely needs this. It could only be 
something synchronous /wrt to some VCPU. Which operations could trigger 
such a write? Does PPC inject software breakpoints in form of trap 
operations or so?


Mmm, according to our ancient recordings, the hunk above was once 
introduced for the sake of IA64: 9dc99a2823. I skipped it in my removal 
patch as it has some non-IA64 effect, at least potentially.


Jan

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Jan Kiszka

On 2011-09-28 16:45, Jan Kiszka wrote:

On 2011-09-28 16:26, Alexander Graf wrote:


On 28.09.2011, at 16:23, Jan Kiszka wrote:


Alex,

we have this diff in qemu-kvm:

diff --git a/exec.c b/exec.c
index c1e045d..f188549 100644
--- a/exec.c
+++ b/exec.c
@@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t
addr, uint8_t *buf,
cpu_physical_memory_set_dirty_flags(
addr1, (0xff ~CODE_DIRTY_FLAG));
}
+ /* qemu doesn't execute guest code directly, but kvm does
+ therefore flush instruction caches */
+ if (kvm_enabled())
+ flush_icache_range((unsigned long)ptr,
+ ((unsigned long)ptr)+l);
qemu_put_ram_ptr(ptr);
}
} else {


flush_icache_range() is doing something only on PPC hosts. So do we need
this upstream?


This makes sure that when device emulation overwrites code that is
already present in the cache of a CPU, it gets flushed from the
icache. I'm fairly sure we want that :). But let's ask Ben and David
as well.


/me wondered which write scenario precisely needs this. It could only be
something synchronous /wrt to some VCPU. Which operations could trigger
such a write? Does PPC inject software breakpoints in form of trap
operations or so?

Mmm, according to our ancient recordings, the hunk above was once
introduced for the sake of IA64: 9dc99a2823. I skipped it in my removal
patch as it has some non-IA64 effect, at least potentially.


Correction: It was introduced by 6d3295f7c8, but generalized with 
f067512c06. That former commit talks about DMA operations on IA64 that 
also updates/drops the icache in reality.


Jan

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: Drop compatfd build hack

2011-09-28 Thread Jan Kiszka
No longer required as we use upstream posix-aio-compat now.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 Makefile.objs |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/Makefile.objs b/Makefile.objs
index cb10816..afd6137 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -28,7 +28,6 @@ block-obj-y += nbd.o block.o aio.o aes.o qemu-config.o 
qemu-progress.o qemu-sock
 block-obj-y += $(coroutine-obj-y)
 block-obj-$(CONFIG_POSIX) += posix-aio-compat.o
 block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
-block-obj-$(CONFIG_POSIX) += compatfd.o
 
 block-nested-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-nested-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
@@ -158,7 +157,7 @@ common-obj-y += $(addprefix ui/, $(ui-obj-y))
 common-obj-$(CONFIG_VNC) += $(addprefix ui/, $(vnc-obj-y))
 
 common-obj-y += iov.o acl.o
-#common-obj-$(CONFIG_POSIX) += compatfd.o
+common-obj-$(CONFIG_POSIX) += compatfd.o
 common-obj-y += notify.o event_notifier.o
 common-obj-y += qemu-timer.o qemu-timer-common.o
 
-- 
1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: Drop kvm optimization from code_gen_alloc

2011-09-28 Thread Jan Kiszka
We no longer get here if KVM is enabled.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 exec.c |3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/exec.c b/exec.c
index f188549..83ffa39 100644
--- a/exec.c
+++ b/exec.c
@@ -470,9 +470,6 @@ static uint8_t 
static_code_gen_buffer[DEFAULT_CODE_GEN_BUFFER_SIZE]
 
 static void code_gen_alloc(unsigned long tb_size)
 {
-if (kvm_enabled())
-return;
-
 #ifdef USE_STATIC_CODE_GEN_BUFFER
 code_gen_buffer = static_code_gen_buffer;
 code_gen_buffer_size = DEFAULT_CODE_GEN_BUFFER_SIZE;
-- 
1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Reeted

On 09/28/11 14:56, Richard W.M. Jones wrote:

On Wed, Sep 28, 2011 at 12:19:09PM +0200, Reeted wrote:

Ok that seems to work: it removes the vhost part in the virsh launch
hence cutting down 12secs of boot time.

If nobody comes out with an explanation of why, I will open another
thread on the kvm list for this. I would probably need to test disk
performance on vhost=on to see if it degrades or it's for another
reason that boot time is increased.

Is it using CPU during this time, or is the qemu-kvm process idle?

It wouldn't be the first time that a network option ROM sat around
waiting for an imaginary console user to press a key.

Rich.


Of the two qemu-kvm processes (threads?) which I see consuming CPU for 
that VM, one is at about 20%, the other at about 10%. I think it's doing 
something but maybe not much, or maybe it's really I/O bound and the I/O 
is slow (as I originarily thought). I will perform some disk benchmarks 
and follow up, but I can't do that right now...

Thank you
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: Drop redundant cpuid filtering from cpu_x86_cpuid

2011-09-28 Thread Jan Kiszka
This is already done in kvm_arch_init_vcpu based on the kernel-reported
supported features.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---

I'm not 100% sure, so please double-check.

 target-i386/cpuid.c |   22 --
 1 files changed, 0 insertions(+), 22 deletions(-)

diff --git a/target-i386/cpuid.c b/target-i386/cpuid.c
index f17..1e8bcff 100644
--- a/target-i386/cpuid.c
+++ b/target-i386/cpuid.c
@@ -1232,28 +1232,6 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
uint32_t count,
 *ecx |= 1  1;/* CmpLegacy bit */
 }
 }
-if (kvm_enabled()) {
-uint32_t h_eax, h_edx;
-
-host_cpuid(index, 0, h_eax, NULL, NULL, h_edx);
-
-/* disable CPU features that the host does not support */
-
-/* long mode */
-if ((h_edx  0x2000) == 0 /* || !lm_capable_kernel */)
-*edx = ~0x2000;
-/* syscall */
-if ((h_edx  0x0800) == 0)
-*edx = ~0x0800;
-/* nx */
-if ((h_edx  0x0010) == 0)
-*edx = ~0x0010;
-
-/* disable CPU features that KVM cannot support */
-
-/* 3dnow */
-*edx = ~0xc000;
-}
 break;
 case 0x8002:
 case 0x8003:
-- 
1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Alexander Graf

Am 28.09.2011 um 16:49 schrieb Jan Kiszka jan.kis...@siemens.com:

 On 2011-09-28 16:45, Jan Kiszka wrote:
 On 2011-09-28 16:26, Alexander Graf wrote:
 
 On 28.09.2011, at 16:23, Jan Kiszka wrote:
 
 Alex,
 
 we have this diff in qemu-kvm:
 
 diff --git a/exec.c b/exec.c
 index c1e045d..f188549 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t
 addr, uint8_t *buf,
 cpu_physical_memory_set_dirty_flags(
 addr1, (0xff ~CODE_DIRTY_FLAG));
 }
 + /* qemu doesn't execute guest code directly, but kvm does
 + therefore flush instruction caches */
 + if (kvm_enabled())
 + flush_icache_range((unsigned long)ptr,
 + ((unsigned long)ptr)+l);
 qemu_put_ram_ptr(ptr);
 }
 } else {
 
 
 flush_icache_range() is doing something only on PPC hosts. So do we need
 this upstream?
 
 This makes sure that when device emulation overwrites code that is
 already present in the cache of a CPU, it gets flushed from the
 icache. I'm fairly sure we want that :). But let's ask Ben and David
 as well.
 
 /me wondered which write scenario precisely needs this. It could only be
 something synchronous /wrt to some VCPU. Which operations could trigger
 such a write? Does PPC inject software breakpoints in form of trap
 operations or so?
 
 Mmm, according to our ancient recordings, the hunk above was once
 introduced for the sake of IA64: 9dc99a2823. I skipped it in my removal
 patch as it has some non-IA64 effect, at least potentially.
 
 Correction: It was introduced by 6d3295f7c8, but generalized with f067512c06. 
 That former commit talks about DMA operations on IA64 that also updates/drops 
 the icache in reality.

Yeah I remember discussions around the topic. IIUC DMA invalidates the cache 
lines of the CPUs in the system for the region it's writing to. At least 
potentially. But again, I'll leave this to the IBM guys to answer :). They know 
best how their hardware works.

Alex

 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


qemu-kvm: Remaining relevant bits in kvm/

2011-09-28 Thread Jan Kiszka

Hi all,

in order to reduce the diff between qemu-kvm.git and upstream, getting 
rid of the kvm subdirectory would be nice. What bits there are actually 
still in use today?


From the top of my head, I only remember running kvm_stat and 
scripts/vmxcap from time to time. Obviously, there is nothing actively 
built anymore during normal qemu production. So what is the role of, e.g.,


 - kvm/bios
 - kvm/extboot
 - kvm/vgabios
 - kvm/scripts

today? Anything there we want to push upstream? And then drop the rest?

Jan

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Linus Torvalds
On Tue, Sep 27, 2011 at 9:44 AM, Jeremy Fitzhardinge jer...@goop.org wrote:

 I guess it comes down to throwing myself on the efficiency of some kind
 of fence instruction.  I guess an lfence would be sufficient; is that
 any more efficient than a full mfence?  At least I can make it so that
 its only present when pv ticket locks are actually in use, so it won't
 affect the native case.

Please don't play with fences, just do the final addb as a locked instruction.

In fact, don't even use an addb, this whole thing is disgusting:

  movzwl (%rdi),%esi (esi:=0x0400)
  addb   $0x2,(%rdi) (LOCAL copy of lock is now: 0x0402)
  movzwl (%rdi),%eax (local forwarding from previous store: eax := 0x0402)

just use lock xaddw there too.

The fact that the PV unlock is going to be much more expensive than a
regular native unlock is just a fact of life. It comes from
fundamentally caring about the old/new value, and has nothing to do
with aliasing. You care about the other bits, and it doesn't matter
where in memory they are.

The native unlock can do a simple addb (or incb), but that doesn't
mean the PV unlock can. There are no ordering issues with the final
unlock in the native case, because the native unlock is like the honey
badger: it don't care. It only cares that the store make it out *some*
day, but it doesn't care about what order the upper/lower bits get
updated. You do. So you have to use a locked access.

Good catch by Stephan.

 Linus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jan Beulich
 On 28.09.11 at 17:38, Linus Torvalds torva...@linux-foundation.org wrote:
 On Tue, Sep 27, 2011 at 9:44 AM, Jeremy Fitzhardinge jer...@goop.org wrote:

 I guess it comes down to throwing myself on the efficiency of some kind
 of fence instruction.  I guess an lfence would be sufficient; is that
 any more efficient than a full mfence?  At least I can make it so that
 its only present when pv ticket locks are actually in use, so it won't
 affect the native case.
 
 Please don't play with fences, just do the final addb as a locked 
 instruction.
 
 In fact, don't even use an addb, this whole thing is disgusting:
 
   movzwl (%rdi),%esi (esi:=0x0400)
   addb   $0x2,(%rdi) (LOCAL copy of lock is now: 0x0402)
   movzwl (%rdi),%eax (local forwarding from previous store: eax := 0x0402)
 
 just use lock xaddw there too.

I'm afraid that's not possible, as that might carry from the low 8 bits
into the upper 8 ones, which must be avoided.

Jan

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Linus Torvalds
On Wed, Sep 28, 2011 at 8:55 AM, Jan Beulich jbeul...@suse.com wrote:

 just use lock xaddw there too.

 I'm afraid that's not possible, as that might carry from the low 8 bits
 into the upper 8 ones, which must be avoided.

Oh damn, you're right. So I guess the right way to do things is with
cmpxchg, but some nasty mfence setup could do it too.

  Linus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jeremy Fitzhardinge
On 09/28/2011 09:10 AM, Linus Torvalds wrote:
 On Wed, Sep 28, 2011 at 8:55 AM, Jan Beulich jbeul...@suse.com wrote:
 just use lock xaddw there too.
 I'm afraid that's not possible, as that might carry from the low 8 bits
 into the upper 8 ones, which must be avoided.
 Oh damn, you're right. So I guess the right way to do things is with
 cmpxchg, but some nasty mfence setup could do it too.

Could do something like:

if (ticket-head = 254)
prev = xadd(ticket-head_tail, 0xff02);
else
prev = xadd(ticket-head_tail, 0x0002);

to compensate for the overflow.

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jeremy Fitzhardinge
On 09/28/2011 06:58 AM, Stephan Diestelhorst wrote:
 I have tested this and have not seen it fail on publicly released AMD
 systems. But as I have tried to point out, this does not mean it is
 safe to do in software, because future microarchtectures may have more
 capable forwarding engines.

Sure.

 Have you tested this, or is this just from code analysis (which I
 agree with after reviewing the ordering rules in the Intel manual).
 We have found a similar issue in Novell's PV ticket lock implementation
 during internal product testing.

Jan may have picked it up from an earlier set of my patches.

 Since you want to get that addb out to global memory before the second
 read, either use a LOCK prefix for it, add an MFENCE between addb and
 movzwl, or use a LOCKed instruction that will have a fencing effect
 (e.g., to top-of-stack)between addb and movzwl.
 Hm.  I don't really want to do any of those because it will probably
 have a significant effect on the unlock performance; I was really trying
 to avoid adding any more locked instructions.  A previous version of the
 code had an mfence in here, but I hit on the idea of using aliasing to
 get the ordering I want - but overlooked the possible effect of store
 forwarding.
 Well, I'd be curious about the actual performance impact. If the store
 needs to commit to memory due to aliasing anyways, this would slow down
 execution, too. After all it is better to write working than fast code,
 no? ;-)

Rule of thumb is that AMD tends to do things like lock and fence more
efficiently than Intel - at least historically.  I don't know if that's
still true for current Intel microarchitectures.

 I guess it comes down to throwing myself on the efficiency of some kind
 of fence instruction.  I guess an lfence would be sufficient; is that
 any more efficient than a full mfence?
 An lfence should not be sufficient, since that essentially is a NOP on
 WB memory. You really want a full fence here, since the store needs to
 be published before reading the lock with the next load.

The Intel manual reads:

Reads cannot pass earlier LFENCE and MFENCE instructions.
Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
LFENCE instructions cannot pass earlier reads.

Which I interpreted as meaning that an lfence would prevent forwarding. 
But I guess it doesn't say lfence instructions cannot pass earlier
writes, which means that the lfence could logically happen before the
write, thereby allowing forwarding?  Or should I be reading this some
other way?

 Could you give me a pointer to AMD's description of the ordering rules?
 They should be in AMD64 Architecture Programmer's Manual Volume 2:
 System Programming, Section 7.2 Multiprocessor Memory Access Ordering.

 http://developer.amd.com/documentation/guides/pages/default.aspx#manuals

 Let me know if you have some clarifying suggestions. We are currently
 revising these documents...

I find the English descriptions of these kinds of things frustrating to
read because of ambiguities in the precise meaning of words like pass,
ahead, behind in these contexts.  I find the prose useful to get an
overview, but when I have a specific question I wonder if something more
formal would be useful.
I guess it's implied that anything that is not prohibited by the
ordering rules is allowed, but it wouldn't hurt to say it explicitly.
That said, the AMD description seems clearer and more explicit than the
Intel manual (esp since it specifically discusses the problem here).

Thanks,
J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Linus Torvalds
On Wed, Sep 28, 2011 at 9:47 AM, Jeremy Fitzhardinge jer...@goop.org wrote:

 Could do something like:

        if (ticket-head = 254)
                prev = xadd(ticket-head_tail, 0xff02);
        else
                prev = xadd(ticket-head_tail, 0x0002);

 to compensate for the overflow.

Oh wow. You havge an even more twisted mind than I do.

I guess that will work, exactly because we control head and thus can
know about the overflow in the low byte. But boy is that ugly ;)

But at least you wouldn't need to do the loop with cmpxchg. So it's
twisted and ugly, but migth be practical.

   Linus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Scott Wood
On 09/28/2011 09:45 AM, Jan Kiszka wrote:
 On 2011-09-28 16:26, Alexander Graf wrote:

 On 28.09.2011, at 16:23, Jan Kiszka wrote:

 Alex,

 we have this diff in qemu-kvm:

 diff --git a/exec.c b/exec.c
 index c1e045d..f188549 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -3950,6 +3955,11 @@ void cpu_physical_memory_rw(target_phys_addr_t
 addr, uint8_t *buf,
  cpu_physical_memory_set_dirty_flags(
  addr1, (0xff  ~CODE_DIRTY_FLAG));
  }
 +/* qemu doesn't execute guest code directly, but kvm does
 +   therefore flush instruction caches */
 +if (kvm_enabled())
 +flush_icache_range((unsigned long)ptr,
 +   ((unsigned long)ptr)+l);
  qemu_put_ram_ptr(ptr);
  }
  } else {

Been meaning to send a poke about this one:

http://patchwork.ozlabs.org/patch/90403/

I've seen issues with this when the executable images are initially
loaded by qemu.

 flush_icache_range() is doing something only on PPC hosts. So do we need
 this upstream?

 This makes sure that when device emulation overwrites code that is
 already present in the cache of a CPU, it gets flushed from the
 icache. I'm fairly sure we want that :). But let's ask Ben and David
 as well.
 
 /me wondered which write scenario precisely needs this. It could only be
 something synchronous /wrt to some VCPU.

Why would it need to be synchronous?  Even if it's asynchronous emulated
DMA, we don't want it sitting around only in a data cache that
instruction fetches won't snoop.

 Which operations could trigger
 such a write? Does PPC inject software breakpoints in form of trap
 operations or so?

It's not implemented yet in mainline for powerpc (we have something
internal that is on the backlog of things to be cleaned up and sent
out), but this is what we'd do for kvm_arch_insert_sw_breakpoint().

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread H. Peter Anvin
On 09/28/2011 10:22 AM, Linus Torvalds wrote:
 On Wed, Sep 28, 2011 at 9:47 AM, Jeremy Fitzhardinge jer...@goop.org wrote:

 Could do something like:

if (ticket-head = 254)
prev = xadd(ticket-head_tail, 0xff02);
else
prev = xadd(ticket-head_tail, 0x0002);

 to compensate for the overflow.
 
 Oh wow. You havge an even more twisted mind than I do.
 
 I guess that will work, exactly because we control head and thus can
 know about the overflow in the low byte. But boy is that ugly ;)
 
 But at least you wouldn't need to do the loop with cmpxchg. So it's
 twisted and ugly, but migth be practical.
 

I suspect it should be coded as -254 in order to use a short immediate
if that is even possible...

-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jeremy Fitzhardinge
On 09/28/2011 10:24 AM, H. Peter Anvin wrote:
 On 09/28/2011 10:22 AM, Linus Torvalds wrote:
 On Wed, Sep 28, 2011 at 9:47 AM, Jeremy Fitzhardinge jer...@goop.org wrote:
 Could do something like:

if (ticket-head = 254)
prev = xadd(ticket-head_tail, 0xff02);
else
prev = xadd(ticket-head_tail, 0x0002);

 to compensate for the overflow.
 Oh wow. You havge an even more twisted mind than I do.

 I guess that will work, exactly because we control head and thus can
 know about the overflow in the low byte. But boy is that ugly ;)

 But at least you wouldn't need to do the loop with cmpxchg. So it's
 twisted and ugly, but migth be practical.

 I suspect it should be coded as -254 in order to use a short immediate
 if that is even possible...

I'm about to test:

static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
{
if (TICKET_SLOWPATH_FLAG  
unlikely(arch_static_branch(paravirt_ticketlocks_enabled))) {
arch_spinlock_t prev;
__ticketpair_t inc = TICKET_LOCK_INC;

if (lock-tickets.head = (1  TICKET_SHIFT) - TICKET_LOCK_INC)
inc += -1  TICKET_SHIFT;

prev.head_tail = xadd(lock-head_tail, inc);

if (prev.tickets.tail  TICKET_SLOWPATH_FLAG)
__ticket_unlock_slowpath(lock, prev);
} else
__ticket_unlock_release(lock);
}

Which, frankly, is not something I particularly want to put my name to.

It makes gcc go into paroxysms of trickiness:

 4a8:   80 3f fecmpb   $0xfe,(%rdi)
 4ab:   19 f6   sbb%esi,%esi
 4ad:   66 81 e6 00 01  and$0x100,%si
 4b2:   66 81 ee fe 00  sub$0xfe,%si
 4b7:   f0 66 0f c1 37  lock xadd %si,(%rdi)

...which is pretty neat, actually.

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Stephan Diestelhorst
On Wednesday 28 September 2011 19:50:08 Jeremy Fitzhardinge wrote:
 On 09/28/2011 10:24 AM, H. Peter Anvin wrote:
  On 09/28/2011 10:22 AM, Linus Torvalds wrote:
  On Wed, Sep 28, 2011 at 9:47 AM, Jeremy Fitzhardinge jer...@goop.org 
  wrote:
  Could do something like:
 
 if (ticket-head = 254)
 prev = xadd(ticket-head_tail, 0xff02);
 else
 prev = xadd(ticket-head_tail, 0x0002);
 
  to compensate for the overflow.
  Oh wow. You havge an even more twisted mind than I do.
 
  I guess that will work, exactly because we control head and thus can
  know about the overflow in the low byte. But boy is that ugly ;)
 
  But at least you wouldn't need to do the loop with cmpxchg. So it's
  twisted and ugly, but migth be practical.
 
  I suspect it should be coded as -254 in order to use a short immediate
  if that is even possible...
 
 I'm about to test:
 
 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
   if (TICKET_SLOWPATH_FLAG  
 unlikely(arch_static_branch(paravirt_ticketlocks_enabled))) {
   arch_spinlock_t prev;
   __ticketpair_t inc = TICKET_LOCK_INC;
 
   if (lock-tickets.head = (1  TICKET_SHIFT) - TICKET_LOCK_INC)
   inc += -1  TICKET_SHIFT;
 
   prev.head_tail = xadd(lock-head_tail, inc);
 
   if (prev.tickets.tail  TICKET_SLOWPATH_FLAG)
   __ticket_unlock_slowpath(lock, prev);
   } else
   __ticket_unlock_release(lock);
 }
 
 Which, frankly, is not something I particularly want to put my name to.

I must have missed the part when this turned into the propose-the-
craziest-way-that-this-still-works.contest :)

What is wrong with converting the original addb into a lock addb? The
crazy wrap around tricks add a conditional and lots of headache. The
lock addb/w is clean. We are paying an atomic in both cases, so I just
don't see the benefit of the second solution.

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelho...@amd.com, Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany

Geschaeftsfuehrer: Alberto Bozzo;
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Stephan Diestelhorst
On Wednesday 28 September 2011 18:44:25 Jeremy Fitzhardinge wrote:
 On 09/28/2011 06:58 AM, Stephan Diestelhorst wrote:
  I guess it comes down to throwing myself on the efficiency of some kind
  of fence instruction.  I guess an lfence would be sufficient; is that
  any more efficient than a full mfence?
  An lfence should not be sufficient, since that essentially is a NOP on
  WB memory. You really want a full fence here, since the store needs to
  be published before reading the lock with the next load.
 
 The Intel manual reads:
 
 Reads cannot pass earlier LFENCE and MFENCE instructions.
 Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
 LFENCE instructions cannot pass earlier reads.
 
 Which I interpreted as meaning that an lfence would prevent forwarding. 
 But I guess it doesn't say lfence instructions cannot pass earlier
 writes, which means that the lfence could logically happen before the
 write, thereby allowing forwarding?  Or should I be reading this some
 other way?

Indeed. You are reading this the right way. 

  Could you give me a pointer to AMD's description of the ordering rules?
  They should be in AMD64 Architecture Programmer's Manual Volume 2:
  System Programming, Section 7.2 Multiprocessor Memory Access Ordering.
 
  http://developer.amd.com/documentation/guides/pages/default.aspx#manuals
 
  Let me know if you have some clarifying suggestions. We are currently
  revising these documents...
 
 I find the English descriptions of these kinds of things frustrating to
 read because of ambiguities in the precise meaning of words like pass,
 ahead, behind in these contexts.  I find the prose useful to get an
 overview, but when I have a specific question I wonder if something more
 formal would be useful.

It would be, and some have started this efort:

http://www.cl.cam.ac.uk/~pes20/weakmemory/

But I am not sure whether that particular nasty forwarding case is
captured properly in their model It is on my list of things to check.

 I guess it's implied that anything that is not prohibited by the
 ordering rules is allowed, but it wouldn't hurt to say it explicitly.
 That said, the AMD description seems clearer and more explicit than the
 Intel manual (esp since it specifically discusses the problem here).

Thanks! Glad you like it :)

Stephan
-- 
Stephan Diestelhorst, AMD Operating System Research Center
stephan.diestelho...@amd.com, Tel. +49 (0)351 448 356 719

Advanced Micro Devices GmbH
Einsteinring 24
85609 Aschheim
Germany

Geschaeftsfuehrer: Alberto Bozzo;
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632, WEEE-Reg-Nr: DE 12919551 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jeremy Fitzhardinge
On 09/28/2011 11:08 AM, Stephan Diestelhorst wrote:
 On Wednesday 28 September 2011 19:50:08 Jeremy Fitzhardinge wrote:
 On 09/28/2011 10:24 AM, H. Peter Anvin wrote:
 On 09/28/2011 10:22 AM, Linus Torvalds wrote:
 On Wed, Sep 28, 2011 at 9:47 AM, Jeremy Fitzhardinge jer...@goop.org 
 wrote:
 Could do something like:

if (ticket-head = 254)
prev = xadd(ticket-head_tail, 0xff02);
else
prev = xadd(ticket-head_tail, 0x0002);

 to compensate for the overflow.
 Oh wow. You havge an even more twisted mind than I do.

 I guess that will work, exactly because we control head and thus can
 know about the overflow in the low byte. But boy is that ugly ;)

 But at least you wouldn't need to do the loop with cmpxchg. So it's
 twisted and ugly, but migth be practical.

 I suspect it should be coded as -254 in order to use a short immediate
 if that is even possible...
 I'm about to test:

 static __always_inline void arch_spin_unlock(arch_spinlock_t *lock)
 {
  if (TICKET_SLOWPATH_FLAG  
 unlikely(arch_static_branch(paravirt_ticketlocks_enabled))) {
  arch_spinlock_t prev;
  __ticketpair_t inc = TICKET_LOCK_INC;

  if (lock-tickets.head = (1  TICKET_SHIFT) - TICKET_LOCK_INC)
  inc += -1  TICKET_SHIFT;

  prev.head_tail = xadd(lock-head_tail, inc);

  if (prev.tickets.tail  TICKET_SLOWPATH_FLAG)
  __ticket_unlock_slowpath(lock, prev);
  } else
  __ticket_unlock_release(lock);
 }

 Which, frankly, is not something I particularly want to put my name to.
 I must have missed the part when this turned into the propose-the-
 craziest-way-that-this-still-works.contest :)

 What is wrong with converting the original addb into a lock addb? The
 crazy wrap around tricks add a conditional and lots of headache. The
 lock addb/w is clean. We are paying an atomic in both cases, so I just
 don't see the benefit of the second solution.

Well, it does end up generating surprisingly nice code.  And to be
honest, being able to do the unlock and atomically fetch the flag as one
operation makes it much easier to reason about.

I'll do a locked add variant as well to see how it turns out.

Do you think locked add is better than unlocked + mfence?

Thanks,
J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-net: Read MAC only after initializing MSI-X

2011-09-28 Thread Sasha Levin
On Mon, 2011-09-19 at 17:19 +0930, Rusty Russell wrote:
 On Mon, 19 Sep 2011 09:01:50 +0300, Michael S. Tsirkin m...@redhat.com 
 wrote:
  On Mon, Sep 19, 2011 at 01:05:17PM +0930, Rusty Russell wrote:
   On Sat, 20 Aug 2011 23:00:44 +0300, Michael S. Tsirkin 
   m...@redhat.com wrote:
On Fri, Aug 19, 2011 at 07:33:07PM +0300, Sasha Levin wrote:
 Maybe this is better solved by copying the way it was done in PCI 
 itself
 with capability linked list?

There are any number of ways to lay out the structure.  I went for what
seemed a simplest one.  For MSI-X the train has left the station.  We
can probably still tweak where the high 32 bit features
for 64 bit features are.  No idea if it's worth it.
   
   Sorry, this has been in the back of my mind.  I think it's a good idea;
   can we use the capability linked list for pre-device specific stuff from
   now on?
   
   Thanks,
   Rusty.
  
  Do we even want capability bits then?
  We can give each capability an ack flag ...
 
 We could have, and if I'd known PCI when I designed virtio I might have.
 
 But it's not easy now to map structure offsets to that scheme, and we
 can't really force such a change on the non-PCI users.  So I'd say we
 should only do it for the non-device-specific options.  ie. we'll still
 have the MSI-X case move the device-specific config, but we'll use a
 linked list from now on, eg. for the next 32 features bits...
 
 Thoughts?
 Rusty.

What if we create a capability list but place it in the virtio-pci
config space instead of the PCI space?

It'll work fine with non-PCI users and would leave MSI-X as the only
thing that changes offsets (and we could probably deprecate and remove
it at some point in the future).

-- 

Sasha.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Linus Torvalds
On Wed, Sep 28, 2011 at 11:08 AM, Stephan Diestelhorst
stephan.diestelho...@amd.com wrote:

 I must have missed the part when this turned into the propose-the-
 craziest-way-that-this-still-works.contest :)

So doing it just with the lock addb probably works fine, but I have
to say that I personally shudder at the surround the locked addb by
reads from the word, in order to approximate an atomic read of the
upper bits.

Because what you get is not really an atomic read of the upper bits,
it's a ok, we'll get the worst case of somebody modifying the upper
bits at the same time.

Which certainly should *work*, but from a conceptual standpoint, isn't
it just *much* nicer to say we actually know *exactly* what the upper
bits were.

But I don't care all *that* deeply. I do agree that the xaddw trick is
pretty tricky. I just happen to think that it's actually *less* tricky
than read the upper bits separately and depend on subtle ordering
issues with another writer that happens at the same time on another
CPU.

So I can live with either form - as long as it works. I think it might
be easier to argue that the xaddw is guaranteed to work, because all
values at all points are unarguably atomic (yeah, we read the lower
bits nonatomically, but as the owner of the lock we know that nobody
else can write them).

 Linus
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Xen-devel] [PATCH 00/10] [PATCH RFC V2] Paravirtualized ticketlocks

2011-09-28 Thread Jeremy Fitzhardinge
On 09/28/2011 11:49 AM, Linus Torvalds wrote:
 But I don't care all *that* deeply. I do agree that the xaddw trick is
 pretty tricky. I just happen to think that it's actually *less* tricky
 than read the upper bits separately and depend on subtle ordering
 issues with another writer that happens at the same time on another
 CPU.

 So I can live with either form - as long as it works. I think it might
 be easier to argue that the xaddw is guaranteed to work, because all
 values at all points are unarguably atomic (yeah, we read the lower
 bits nonatomically, but as the owner of the lock we know that nobody
 else can write them).

Exactly.  I just did a locked add variant, and while the code looks a
little simpler, it definitely has more actual complexity to analyze.

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Benjamin Herrenschmidt
On Wed, 2011-09-28 at 16:26 +0200, Alexander Graf wrote:
 
 This makes sure that when device emulation overwrites code that is
 already present in the cache of a CPU, it gets flushed from the
 icache. I'm fairly sure we want that :). But let's ask Ben and David
 as well.

Hrm we don't need that. DMA doesn't flush the icache on power. The
kernel will take care of it if necessary.

The only case you do need it is when doing the initial load of the
kernel or SLOF image before you execute it.

Cheers,
Ben.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Benjamin Herrenschmidt
On Wed, 2011-09-28 at 12:27 -0500, Scott Wood wrote:

 Why would it need to be synchronous?  Even if it's asynchronous emulated
 DMA, we don't want it sitting around only in a data cache that
 instruction fetches won't snoop.

Except that this is exactly what happens on real HW :-)

The guest will do the necessary invalidations. DMA doesn't keep the
icache coherent on HW, why should it on kvm/qemu ?

 It's not implemented yet in mainline for powerpc (we have something
 internal that is on the backlog of things to be cleaned up and sent
 out), but this is what we'd do for kvm_arch_insert_sw_breakpoint().

Yes, breakpoints do need a flash, as does the initial program load.

Cheers,
Ben.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Scott Wood
On 09/28/2011 04:02 PM, Benjamin Herrenschmidt wrote:
 On Wed, 2011-09-28 at 12:27 -0500, Scott Wood wrote:
 
 Why would it need to be synchronous?  Even if it's asynchronous emulated
 DMA, we don't want it sitting around only in a data cache that
 instruction fetches won't snoop.
 
 Except that this is exactly what happens on real HW :-)

DMA does not normally go straight to data cache, at least on hardware
I'm familiar with.

 The guest will do the necessary invalidations. DMA doesn't keep the
 icache coherent on HW, why should it on kvm/qemu ?

Sure, if there might be stale stuff in the icache, the guest will need
to invalidate that.  But when running on real hardware, an OS does not
need to flush it out of data cache after a DMA transaction[1].  So
technically we just want a flush_dcache_range() for DMA.

It's moot unless we can distinguish DMA writes from breakpoint writes,
though.

-Scott

[1] Most OSes may do this anyway, to avoid needing to special case when
the dirtying is done entirely by DMA (or to avoid making assumptions
that could be broken by weird hardware), but that doesn't mean QEMU/KVM
should assume that -- maybe unless there's enough performance to be
gained by looking like the aforementioned weird hardware in certain
configurations.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm: Role of flush_icache_range on PPC

2011-09-28 Thread Benjamin Herrenschmidt
On Wed, 2011-09-28 at 16:20 -0500, Scott Wood wrote:

 Sure, if there might be stale stuff in the icache, the guest will need
 to invalidate that.  But when running on real hardware, an OS does not
 need to flush it out of data cache after a DMA transaction[1].  So
 technically we just want a flush_dcache_range() for DMA.

 It's moot unless we can distinguish DMA writes from breakpoint writes,
 though.
 
 -Scott
 
 [1] Most OSes may do this anyway, to avoid needing to special case when
 the dirtying is done entirely by DMA (or to avoid making assumptions
 that could be broken by weird hardware), but that doesn't mean QEMU/KVM
 should assume that -- maybe unless there's enough performance to be
 gained by looking like the aforementioned weird hardware in certain
 configurations.

I see what you mean. A DMA would have had an implicit cache flush while
qemu memcpy'ing to the guest won't. Hrm.

I'm not sure any guest relies on that since architecturally, the HW is
permitted to do cache intervention tricks, and rather than flush,
transfer the data directly to the cache that originally contained the
lines (cache injection). We do even support that on some embedded stuff.

In any case, we should then make that depend on a flag, because it's
certainly unnecessary on P5, P6 and P7 which have a snooping icache and
can be costly.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt (due to vhost=on)

2011-09-28 Thread Reeted

On 09/28/11 16:51, Reeted wrote:

On 09/28/11 14:56, Richard W.M. Jones wrote:

On Wed, Sep 28, 2011 at 12:19:09PM +0200, Reeted wrote:

Ok that seems to work: it removes the vhost part in the virsh launch
hence cutting down 12secs of boot time.

If nobody comes out with an explanation of why, I will open another
thread on the kvm list for this. I would probably need to test disk
performance on vhost=on to see if it degrades or it's for another
reason that boot time is increased.

Is it using CPU during this time, or is the qemu-kvm process idle?

It wouldn't be the first time that a network option ROM sat around
waiting for an imaginary console user to press a key.

Rich.


Of the two qemu-kvm processes (threads?) which I see consuming CPU for 
that VM, one is at about 20%, the other at about 10%. I think it's 
doing something but maybe not much, or maybe it's really I/O bound and 
the I/O is slow (as I originarily thought). I will perform some disk 
benchmarks and follow up, but I can't do that right now...

Thank you


Ok still didn't do benchmarks but I am now quite a lot convinced that 
it's either a disk performance problem or cpu problem with vhostnet on. 
Not a network performance problem or idle wait.


Because I have installed another virtual machine now, which is a fedora 
core 6 (old!),  but with a debian natty kernel vmlinuz + initrd so that 
it supports virtio devices. The initrd part from Ubuntu is extremely 
short so finishes immediately, but then the fedora core 6 boot is much 
longer than with my previous ubuntu-barebone virtual machine, and with 
more messages, and I can see the various daemons being brought up one by 
one, and I can tell you such boot (and also the teardown of services 
during shutdown) is very much faster with vhostnet disabled.


With vhostnet disabled it takes 30seconds to come up (since after grub), 
and 28 seconds to shutdown.
With vhostnet enabled it takes 1m19sec to come up (since after grub), 
and 1m04sec to shutdown.



I have some ideas about disk benchmarking, that would be fio or simple 
dd. What could I use for CPU benchmarking? Would openssl speed be too 
simple?


Thank you
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm] Re: [kvm] Re: Questions about duplicate memory work

2011-09-28 Thread Robin Lee Powell
On Tue, Sep 27, 2011 at 12:49:29PM +0300, Avi Kivity wrote:
 On 09/27/2011 12:00 PM, Robin Lee Powell wrote:
 On Tue, Sep 27, 2011 at 01:48:43AM -0700, Robin Lee Powell wrote:
   On Tue, Sep 27, 2011 at 04:41:33PM +0800, Emmanuel Noobadmin wrote:
 On 9/27/11, Robin Lee Powellrlpow...@digitalkingdom.org  wrote:
   On Mon, Sep 26, 2011 at 04:15:37PM +0800, Emmanuel Noobadmin
   wrote:
   It's unrelated to what you're actually using as the disks,
   whether file or block devices like LVs. I think it just makes
   KVM tell the host not to cache I/O done on the storage device.
 
   Wait, hold on, I think I had it backwards.
 
   It tells the *host* to not cache the device in question, or the
   *VMs* to not cache the device in question?
   
 I'm fairly certain it tells the qemu not to cache the device in
 question. If you don't want the guest to cache their i/o, then the
 guest OS should be configured if it allows that. Although I'm not
 sure if it's possible to disable disk buffering/caching system
 wide in Linux.
 
   OK, great, thanks.
 
   Now if I could just figure out how to stop the host from swapping
   out much of the VMs' qemu-kvm procs when it has almost a GiB of RAM
   left.  -_-  swappiness 0 doesn't seem to help there.
 
 Grrr.
 
 I turned swap off to clear it.  A few minutes ago, this host was at
 zero swap:
 
 top - 01:59:10 up 10 days, 15:17,  3 users,  load average: 6.39, 4.26, 3.24
 Tasks: 151 total,   1 running, 150 sleeping,   0 stopped,   0 zombie
 Cpu(s):  6.6%us,  1.0%sy,  0.0%ni, 85.9%id,  6.3%wa,  0.0%hi,  0.2%si,  
 0.0%st
 Mem:   8128772k total,  656k used,  1617656k free,14800k buffers
 Swap:  8388604k total,   672828k used,  7715776k free,97536k cached
 
PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
   2504 qemu  20   0 2425m 1.8g  448 S 10.0 23.4   3547:59 qemu-kvm
   2258 qemu  20   0 2425m 1.7g  444 S  2.7 21.7   1288:15 qemu-kvm
 18061 qemu  20   0 2433m 1.8g  428 S  2.3 23.4 401:01.99 qemu-kvm
 10335 qemu  20   0 1864m 861m  456 S  1.0 10.9   2:04.26 qemu-kvm
 [snip]
 
 Why is it doing this?!?  ;'(
 
 
 Please post the contents of /proc/meminfo and /proc/zoneinfo when
 this is happening.

I just noticed that the amount of RAM the VMs had in VIRT added up
to considerably more than the host's actual RAM; hard_limit is now
on.  So I may not be able to replicate this.  :)

-Robin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm] Re: [kvm] Re: [kvm] Re: Questions about duplicate memory work

2011-09-28 Thread Robin Lee Powell
On Wed, Sep 28, 2011 at 05:11:06PM -0700, Robin Lee Powell wrote:
 On Tue, Sep 27, 2011 at 12:49:29PM +0300, Avi Kivity wrote:
  On 09/27/2011 12:00 PM, Robin Lee Powell wrote:
  On Tue, Sep 27, 2011 at 01:48:43AM -0700, Robin Lee Powell wrote:
On Tue, Sep 27, 2011 at 04:41:33PM +0800, Emmanuel Noobadmin wrote:
  On 9/27/11, Robin Lee Powellrlpow...@digitalkingdom.org  wrote:
On Mon, Sep 26, 2011 at 04:15:37PM +0800, Emmanuel Noobadmin
wrote:
It's unrelated to what you're actually using as the disks,
whether file or block devices like LVs. I think it just makes
KVM tell the host not to cache I/O done on the storage device.
  
Wait, hold on, I think I had it backwards.
  
It tells the *host* to not cache the device in question, or the
*VMs* to not cache the device in question?

  I'm fairly certain it tells the qemu not to cache the device in
  question. If you don't want the guest to cache their i/o, then the
  guest OS should be configured if it allows that. Although I'm not
  sure if it's possible to disable disk buffering/caching system
  wide in Linux.
  
OK, great, thanks.
  
Now if I could just figure out how to stop the host from swapping
out much of the VMs' qemu-kvm procs when it has almost a GiB of RAM
left.  -_-  swappiness 0 doesn't seem to help there.
  
  Grrr.
  
  I turned swap off to clear it.  A few minutes ago, this host was at
  zero swap:
  
  top - 01:59:10 up 10 days, 15:17,  3 users,  load average: 6.39, 4.26, 3.24
  Tasks: 151 total,   1 running, 150 sleeping,   0 stopped,   0 zombie
  Cpu(s):  6.6%us,  1.0%sy,  0.0%ni, 85.9%id,  6.3%wa,  0.0%hi,  0.2%si,  
  0.0%st
  Mem:   8128772k total,  656k used,  1617656k free,14800k buffers
  Swap:  8388604k total,   672828k used,  7715776k free,97536k cached
  
 PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
2504 qemu  20   0 2425m 1.8g  448 S 10.0 23.4   3547:59 qemu-kvm
2258 qemu  20   0 2425m 1.7g  444 S  2.7 21.7   1288:15 qemu-kvm
  18061 qemu  20   0 2433m 1.8g  428 S  2.3 23.4 401:01.99 qemu-kvm
  10335 qemu  20   0 1864m 861m  456 S  1.0 10.9   2:04.26 qemu-kvm
  [snip]
  
  Why is it doing this?!?  ;'(
  
  
  Please post the contents of /proc/meminfo and /proc/zoneinfo when
  this is happening.
 
 I just noticed that the amount of RAM the VMs had in VIRT added up
 to considerably more than the host's actual RAM; hard_limit is now
 on.  So I may not be able to replicate this.  :)

Or not; even with hard_limit the VIRT value goes to hundreds of MiB
more than the limit.  Is that expected?

-Robin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu-kvm: device assignment: add 82599 PCIe Cap struct quirk

2011-09-28 Thread Donald Dutile
commit f9c29774d2174df6ffc20becec20928948198914
changed the PCIe Capability structure version check
from if  2 fail, to if ==1, size=x, if ==2, size=y,
else fail.
Turns out the 82599's VF has an errata where it's
PCIe Cap struct version is 0, which now fails device assignment
due to the else fallout, where before, it would blissfully work.

Add a quirk if version=0,  intel-82599, set size to version 2 struct.

Signed-off-by: Donald_Dutile ddut...@redhat.com
---
 hw/device-assignment.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 288f80c..ed2a883 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1261,12 +1261,20 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 
 if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_EXP, 0))) {
 uint8_t version, size;
-uint16_t type, devctl, lnkcap, lnksta;
+uint16_t type, devctl, lnkcap, lnksta, vendor, device;
 uint32_t devcap;
 
+vendor = pci_get_word(pci_dev-config + PCI_VENDOR_ID);
+device = pci_get_word(pci_dev-config + PCI_DEVICE_ID);
 version = pci_get_byte(pci_dev-config + pos + PCI_EXP_FLAGS);
 version = PCI_EXP_FLAGS_VERS;
-if (version == 1) {
+if (version == 0  vendor == 0x8086  device == 0x10ed) {
+/* 
+ * quirk for Intel 82599 VF with invalid PCIe capability version,
+ * should really be version 2 (same as PF)
+ */
+size = 0x3c;
+} else if (version == 1) {
 size = 0x14;
 } else if (version == 2) {
 /*
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: device assignment: add 82599 PCIe Cap struct quirk

2011-09-28 Thread Chris Wright
* Donald Dutile (ddut...@redhat.com) wrote:
 commit f9c29774d2174df6ffc20becec20928948198914
 changed the PCIe Capability structure version check
 from if  2 fail, to if ==1, size=x, if ==2, size=y,
 else fail.
 Turns out the 82599's VF has an errata where it's
 PCIe Cap struct version is 0, which now fails device assignment
 due to the else fallout, where before, it would blissfully work.
 
 Add a quirk if version=0,  intel-82599, set size to version 2 struct.
 
 Signed-off-by: Donald_Dutile ddut...@redhat.com

(not pretty, but neither is the hw errata...)

Acked-by: Chris Wright chr...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [libvirt] Qemu/KVM is 3x slower under libvirt

2011-09-28 Thread Chris Wright
* Reeted (ree...@shiftmail.org) wrote:
 On 09/28/11 11:28, Daniel P. Berrange wrote:
 On Wed, Sep 28, 2011 at 11:19:43AM +0200, Reeted wrote:
 On 09/28/11 09:51, Daniel P. Berrange wrote:
 You could have equivalently used
 
   -netdev tap,ifname=tap0,script=no,downscript=no,id=hostnet0,vhost=on
   -device 
  virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:05:36:60,bus=pci.0,addr=0x3
 It's this! It's this!! (thanks for the line)
 
 It raises boot time by 10-13 seconds
 Ok, that is truely bizarre and I don't really have any explanation
 for why that is. I guess you could try 'vhost=off' too and see if that
 makes the difference.
 
 YES!
 It's the vhost. With vhost=on it takes about 12 seconds more time to boot.

Can you help narrow down what is happening during the additional 12
seconds in the guest?  For example, does a quick simple boot to single
user mode happen at the same boot speed w/ and w/out vhost_net?

I'm guessing (hoping) that it's the network bring-up that is slow.
Are you using dhcp to get an IP address?  Does static IP have the same
slow down?

If it's just dhcp, can you recompile qemu with this patch and see if it
causes the same slowdown you saw w/ vhost?

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 0b03b57..0c864f7 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -496,7 +496,7 @@ static int receive_header(VirtIONet *n, struct iovec *iov, 
int iovcnt,
 if (n-has_vnet_hdr) {
 memcpy(hdr, buf, sizeof(*hdr));
 offset = sizeof(*hdr);
-work_around_broken_dhclient(hdr, buf + offset, size - offset);
+//work_around_broken_dhclient(hdr, buf + offset, size - offset);
 }
 
 /* We only ever receive a struct virtio_net_hdr from the tapfd,
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to assign a pci device to guest [with qemu.git upstream]?

2011-09-28 Thread Chris Wright
* Ren, Yongjie (yongjie@intel.com) wrote:
 I'm using kvm and qemu upstream on https://github.com/avikivity
 The following command line was right for me about three weeks ago, but now I 
 meet some error.
 # qemu-system-x86_64 -m 1024 -smp 2 -device pci-assign,host=0e:00.0 -hda 
 /root/rhel6u1.img
 output error is like following.
 qemu-system-x86_64: -device pci-assign,host=0d:00.0: Parameter 'driver' 
 expects a driver name
 Try with argument '?' for a list.

Looks like you don't have device assignment support compiled in.
Start with the basics (assuming tree has hw/device-assignment.c):

did your ./configure output show:

KVM device assig. yes

and does your binary agree?

qemu-system-x86_64 -device ? 21 | grep pci-assign

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: how to assign a pci device to guest [with qemu.git upstream]?

2011-09-28 Thread Ren, Yongjie
Chris,
Thanks very much for you kind help. 
I can't find hw/device-assignment.c in the qemu.git tree.
Avi,
I clone qemu from git://github.com/avikivity/qemu.git 
So device assignment is not available. But qemu-kvm.git has device-assignment 
code before kernel.org is down.
Any update for this issue?

Best Regards,
 Yongjie Ren  (Jay)

 -Original Message-
 From: Chris Wright [mailto:chr...@sous-sol.org]
 Sent: Thursday, September 29, 2011 8:57 AM
 To: Ren, Yongjie
 Cc: KVM General
 Subject: Re: how to assign a pci device to guest [with qemu.git upstream]?
 
 * Ren, Yongjie (yongjie@intel.com) wrote:
  I'm using kvm and qemu upstream on https://github.com/avikivity
  The following command line was right for me about three weeks ago,
 but now I meet some error.
  # qemu-system-x86_64 -m 1024 -smp 2 -device pci-assign,host=0e:00.0
 -hda /root/rhel6u1.img
  output error is like following.
  qemu-system-x86_64: -device pci-assign,host=0d:00.0: Parameter 'driver'
 expects a driver name
  Try with argument '?' for a list.
 
 Looks like you don't have device assignment support compiled in.
 Start with the basics (assuming tree has hw/device-assignment.c):
 
 did your ./configure output show:
 
 KVM device assig. yes
 
 and does your binary agree?
 
 qemu-system-x86_64 -device ? 21 | grep pci-assign
 
 thanks,
 -chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to assign a pci device to guest [with qemu.git upstream]?

2011-09-28 Thread Chris Wright
* Ren, Yongjie (yongjie@intel.com) wrote:
 Chris,
 Thanks very much for you kind help. 
 I can't find hw/device-assignment.c in the qemu.git tree.
 Avi,
 I clone qemu from git://github.com/avikivity/qemu.git 
 So device assignment is not available. But qemu-kvm.git has device-assignment 
 code before kernel.org is down.
 Any update for this issue?

Are you using the master branch?  I noticed the github web defaults to
the memory/queue branch.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] qemu-kvm: device assignment: add 82599 PCIe Cap struct quirk

2011-09-28 Thread Alex Williamson
On Wed, 2011-09-28 at 20:20 -0400, Donald Dutile wrote:
 commit f9c29774d2174df6ffc20becec20928948198914
 changed the PCIe Capability structure version check
 from if  2 fail, to if ==1, size=x, if ==2, size=y,
 else fail.
 Turns out the 82599's VF has an errata where it's
 PCIe Cap struct version is 0, which now fails device assignment
 due to the else fallout, where before, it would blissfully work.
 
 Add a quirk if version=0,  intel-82599, set size to version 2 struct.
 
 Signed-off-by: Donald_Dutile ddut...@redhat.com

Ugh.

Acked-by: Alex Williamson alex.william...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to assign a pci device to guest [with qemu.git upstream]?

2011-09-28 Thread Chris Wright
* Chris Wright (chr...@sous-sol.org) wrote:
 * Ren, Yongjie (yongjie@intel.com) wrote:
  Chris,
  Thanks very much for you kind help. 
  I can't find hw/device-assignment.c in the qemu.git tree.
  Avi,
  I clone qemu from git://github.com/avikivity/qemu.git 
  So device assignment is not available. But qemu-kvm.git has 
  device-assignment code before kernel.org is down.
  Any update for this issue?
 
 Are you using the master branch?  I noticed the github web defaults to
 the memory/queue branch.

BTW, if you hadn't used branches much before, something like this will
get you what you want:

$ git checkout -b master origin/master

Now you'll be on the master branch (and it should track upstream master
properly).

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: how to assign a pci device to guest [with qemu.git upstream]?

2011-09-28 Thread Ren, Yongjie
 -Original Message-
 From: Chris Wright [mailto:chr...@sous-sol.org]
 Sent: Thursday, September 29, 2011 12:33 PM
 To: Ren, Yongjie
 Cc: Chris Wright; Avi Kivity; KVM General
 Subject: Re: how to assign a pci device to guest [with qemu.git upstream]?
 
 * Chris Wright (chr...@sous-sol.org) wrote:
  * Ren, Yongjie (yongjie@intel.com) wrote:
   Chris,
   Thanks very much for you kind help.
   I can't find hw/device-assignment.c in the qemu.git tree.
   Avi,
   I clone qemu from git://github.com/avikivity/qemu.git
   So device assignment is not available. But qemu-kvm.git has
 device-assignment code before kernel.org is down.
   Any update for this issue?
 
  Are you using the master branch?  I noticed the github web defaults to
  the memory/queue branch.
 
 BTW, if you hadn't used branches much before, something like this will
 get you what you want:
 
 $ git checkout -b master origin/master
 
 Now you'll be on the master branch (and it should track upstream master
 properly).
Oh, thanks a lot.  I didn't notice the qemu.git is using the 'memory/queue' 
branch by default.
I've switched it to the 'master' tree. It seems device assignment works fine 
now.
 
 thanks,
 -chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html