On Mon, Feb 25, 2013 at 10:48:47AM +0200, Abel Gordon wrote: > Stefan Hajnoczi <stefa...@gmail.com> wrote on 21/02/2013 10:11:12 AM: > > > From: Stefan Hajnoczi <stefa...@gmail.com> > > To: Loic Dachary <l...@dachary.org>, > > Cc: qemu-devel <qemu-devel@nongnu.org> > > Date: 21/02/2013 10:11 AM > > Subject: Re: [Qemu-devel] Block I/O optimizations > > Sent by: qemu-devel-bounces+abelg=il.ibm....@nongnu.org > > > > On Mon, Feb 18, 2013 at 7:19 PM, Loic Dachary <l...@dachary.org> wrote: > > > I recently tried to figure out the best and easiest ways to > > increase block I/O performances with qemu. Not being a qemu expert, > > I expected to find a few optimization tricks. Much to my surprise, > > it appears that there are many significant improvements being worked > > on. This is excellent news :-) > > > > > > However, I'm not sure I understand how they all fit together. It's > > probably quite obvious from the developer point of view but I would > > very much appreciate an overview of how dataplane, vhost-blk, ELVIS > > etc. should be used or developed to maximize I/O performances. Are > > there documents I should read ? If not, would someone be willing to > > share bits of wisdom ? > > > > Hi Loic, > > There will be more information on dataplane shortly. I'll write up a > > blog post and share the link with you. > > > Hi Stefan, > > I assume dataplane could provide a significant performance boost > and approximate vhost-blk performance. If I understand properly, > that's because dataplane finally removes the dependency on > the global mutex and uses eventfd to process notifications.
Right, it's the same approach - ioeventfd for kicks and irqfd for notifies. The difference is kernel thread vs userspace thread. > However, I am concerned dataplane may not solve the scalability > problem because QEMU will be still running 1 thread > per VCPU and 1 per virtual device to handle I/O for each VM. > Assuming we run N VMs with 1 VCPU and 1 virtual I/O device, > we will have 2N threads competing for CPU cycles. In a > cloud-like environment running I/O intensive VMs that could > be a problem because the I/O threads and VCPU threads may > starve each other. Further more, the linux kernel can't make > good scheduling decisions (from I/O perspective) because it > has no information about the content of the I/O queues. The kernel knows when the dataplane thread is schedulable - when the ioeventfd is signalled. In the worst case the scheduler could allow the vcpu thread to complete an entire time slice before letting the dataplane thread run. So are you saying that the Linux scheduler wouldn't allow the dataplane thread to run on a loaded box? My first thought would be to raise the priority of the dataplane thread so it preempts the vcpu thread upon becoming schedulable. > We did some experiments with a modified vhost-blk back-end > that uses a single (or few) thread/s to process I/O for many > VMs as opposed to 1 thread per VM (I/O device). These thread/s > decide for how-long and when to process the request of each > VM based on the I/O activity of each queue. We noticed that > this model (part of what we call ELVIS) significantly improves > the scalability of the system when you run many I/O intensive > guests. When you say "this model (part of what we call ELVIS) significantly improves the scalability of the system when you run many I/O intensive guests", do you mean exit-less vs exit-based or shared thread vs 1 thread per device (without polling)? I'm not sure if you're advocating exit-less (polling) or shared thread without polling. > I was wondering if you have considered this type of threading > model for dataplane as well. With vhost-blk (or-net) it's relatively > easy to use a kernel thread to process I/O for many VMs (user-space > processes). However, with a QEMU back-end (like dataplane/virtio-blk) > the shared thread model may be challenging because it requires > a shared user-space process (for the I/O thread/s) to handle > I/O for many QEMU processes. > > Any thoughts/opinions on the share-thread direction ? For low latency polling makes sense and a shared thread is an efficient way to implement polling. But it throws away resource control and isolation - now you can no longer use cgroups and other standard resource control mechanisms to manage guests. You also create a privileged thread that has access to all guests on the host - a security bug here compromises all guests. This can be fine for private deployments where guests are trusted. For untrusted guests and public clouds it seems risky. Maybe a hybrid approach is possible where exit-less is possible but I/O emulation still happens in per-guest userspace threads. Not sure how much performance can be retained by doing that - e.g. a kernel driver that allows processes to bind an eventfd to a memory notification area. The kernel driver does polling in a single thread and signals eventfds. Userspace threads do the actual I/O emulation. Stefan