Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
On Wednesday 12 December 2007 12:40:43 Anthony Liguori wrote: > Rusty Russell wrote: > > On Sunday 09 December 2007 09:02:48 Anthony Liguori wrote: > >> If QEMU ever got true SMP support, then virtio would not work as it > >> requires 16-bit atomic writes which AFAIK is not possible on a number of > >> non-x86 architectures. > > > > Hmm? Where is this requirement coming from? > > > > I think everyone should stop using the word "atomic" in virtio > > discussions; it's confusing. > > The virtio ring queue indices are 16-bit and are readable to one end > while writable on the other end. To ensure that this can be done in a > lock-less way, it's necessary to atomically update the index. Atomic is > the right word here because if the 16-bit write gets converted into two > 8-bit writes, then very bad things could happen with SMP. Of course, but that's insane. Your assertion that it's not possible on a number of non-x86 architectures is what I'm questioning here. You're confusing the inability of architectures to atomically *modify* a 16 bit value and our requirement, where even if you found an architecture which couldn't do 16 bit writes, you can do it as a 32 bit write. Hope that clarifies, Rusty.
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
Rusty Russell wrote: On Sunday 09 December 2007 09:02:48 Anthony Liguori wrote: If QEMU ever got true SMP support, then virtio would not work as it requires 16-bit atomic writes which AFAIK is not possible on a number of non-x86 architectures. Hmm? Where is this requirement coming from? I think everyone should stop using the word "atomic" in virtio discussions; it's confusing. The virtio ring queue indices are 16-bit and are readable to one end while writable on the other end. To ensure that this can be done in a lock-less way, it's necessary to atomically update the index. Atomic is the right word here because if the 16-bit write gets converted into two 8-bit writes, then very bad things could happen with SMP. Regards, Anthony Liguori Rusty.
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
On Sunday 09 December 2007 09:02:48 Anthony Liguori wrote: > If QEMU ever got true SMP support, then virtio would not work as it > requires 16-bit atomic writes which AFAIK is not possible on a number of > non-x86 architectures. Hmm? Where is this requirement coming from? I think everyone should stop using the word "atomic" in virtio discussions; it's confusing. Rusty.
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
Jamie Lokier wrote: Paul Brook wrote: virtio makes things a bit trickier though. There's a shared ring queue between the host and guest. The ring queue is lock-less and depends on the ability to atomically increment ring queue indices to be SMP safe. Using a copy-API wouldn't be a problem for QEMU since the host and guest are always running in lock-step. A copy API is actually needed to deal with differing host/guest alignment and endianness. That seems a rather poor design choice, as many architectures don't have an atomic increment instruction. Oh well. Most have compare-and-swap or load-locked/store-conditional instructions, though, which can be used to implement atomic increment. Yes, but your "hardware" implementation has to make sure it interacts with those properly. It's certainly possible to implement lockless lists without requiring atomic increment. Most high-end hardware manages it and that doesn't even have coherent DMA. I'm inclined to agree that a lockless structure (including not using an atomic op) for the most commonly used paths, such as appending to a ring, would be better. It increases the potential portability for emulation/virtualisation across all architectures now and in the future, and it would almost certainly perform better on architectures other than x86. However, occasionally you need to enter the host for synchronisation. E.g. when a ring is empty/full. In that case, sometimes using atomic ops in the way that futexes are used in Linux/Glibc can optimise the details of those transitions, but it would be best if they were optional optimisations, for cross-platform, cross-architure portability. There's a particularly awkward problem when taking an x86 atomic op in the guest, and generating code on the non-x86 host which doesn't have any equivalent op. What's the right thing to do? Since virtio is driven by virtualisation projects rather than emulation, is it possible this hasn't been thought of at all, making virtio unusable for cross-architecture emulation? That would be really unfortunate. virtio has been designed for virtualization, yes. There aren't really restrictions that prevent it's use when doing cross-architecture emulation (yet) with QEMU. If QEMU ever got true SMP support, then virtio would not work as it requires 16-bit atomic writes which AFAIK is not possible on a number of non-x86 architectures. Regards, Anthony Liguori -- Jamie
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
Paul Brook wrote: virtio makes things a bit trickier though. There's a shared ring queue between the host and guest. The ring queue is lock-less and depends on the ability to atomically increment ring queue indices to be SMP safe. Using a copy-API wouldn't be a problem for QEMU since the host and guest are always running in lock-step. A copy API is actually needed to deal with differing host/guest alignment and endianness. That seems a rather poor design choice, as many architectures don't have an atomic increment instruction. Oh well. Sorry, I should have been more clear. An atomic increment isn't needed, just an atomic store. Regards, Anthony Liguori Paul
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
Paul Brook wrote: > > > > virtio makes things a bit trickier though. There's a shared ring queue > > > > between the host and guest. The ring queue is lock-less and depends on > > > > the ability to atomically increment ring queue indices to be SMP safe. > > > > Using a copy-API wouldn't be a problem for QEMU since the host and > > > > guest are always running in lock-step. A copy API is actually needed > > > > to deal with differing host/guest alignment and endianness. > > > > > > That seems a rather poor design choice, as many architectures don't have > > > an atomic increment instruction. Oh well. > > > > Most have compare-and-swap or load-locked/store-conditional > > instructions, though, which can be used to implement atomic increment. > > Yes, but your "hardware" implementation has to make sure it interacts with > those properly. It's certainly possible to implement lockless lists without > requiring atomic increment. Most high-end hardware manages it and that > doesn't even have coherent DMA. I'm inclined to agree that a lockless structure (including not using an atomic op) for the most commonly used paths, such as appending to a ring, would be better. It increases the potential portability for emulation/virtualisation across all architectures now and in the future, and it would almost certainly perform better on architectures other than x86. However, occasionally you need to enter the host for synchronisation. E.g. when a ring is empty/full. In that case, sometimes using atomic ops in the way that futexes are used in Linux/Glibc can optimise the details of those transitions, but it would be best if they were optional optimisations, for cross-platform, cross-architure portability. There's a particularly awkward problem when taking an x86 atomic op in the guest, and generating code on the non-x86 host which doesn't have any equivalent op. What's the right thing to do? Since virtio is driven by virtualisation projects rather than emulation, is it possible this hasn't been thought of at all, making virtio unusable for cross-architecture emulation? That would be really unfortunate. -- Jamie
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
On 12/8/07, Paul Brook <[EMAIL PROTECTED]> wrote: > On Saturday 08 December 2007, Jamie Lokier wrote: > > Paul Brook wrote: > > > > virtio makes things a bit trickier though. There's a shared ring queue > > > > between the host and guest. The ring queue is lock-less and depends on > > > > the ability to atomically increment ring queue indices to be SMP safe. > > > > Using a copy-API wouldn't be a problem for QEMU since the host and > > > > guest are always running in lock-step. A copy API is actually needed > > > > to deal with differing host/guest alignment and endianness. > > > > > > That seems a rather poor design choice, as many architectures don't have > > > an atomic increment instruction. Oh well. > > > > Most have compare-and-swap or load-locked/store-conditional > > instructions, though, which can be used to implement atomic increment. > > Yes, but your "hardware" implementation has to make sure it interacts with > those properly. It's certainly possible to implement lockless lists without > requiring atomic increment. Most high-end hardware manages it and that > doesn't even have coherent DMA. If we start adding locks for IO, could we use the same locking model more widely or make it generic so that it would support a SMP host as well?
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
On Saturday 08 December 2007, Jamie Lokier wrote: > Paul Brook wrote: > > > virtio makes things a bit trickier though. There's a shared ring queue > > > between the host and guest. The ring queue is lock-less and depends on > > > the ability to atomically increment ring queue indices to be SMP safe. > > > Using a copy-API wouldn't be a problem for QEMU since the host and > > > guest are always running in lock-step. A copy API is actually needed > > > to deal with differing host/guest alignment and endianness. > > > > That seems a rather poor design choice, as many architectures don't have > > an atomic increment instruction. Oh well. > > Most have compare-and-swap or load-locked/store-conditional > instructions, though, which can be used to implement atomic increment. Yes, but your "hardware" implementation has to make sure it interacts with those properly. It's certainly possible to implement lockless lists without requiring atomic increment. Most high-end hardware manages it and that doesn't even have coherent DMA. Paul
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
Paul Brook wrote: > > virtio makes things a bit trickier though. There's a shared ring queue > > between the host and guest. The ring queue is lock-less and depends on > > the ability to atomically increment ring queue indices to be SMP safe. > > Using a copy-API wouldn't be a problem for QEMU since the host and guest > > are always running in lock-step. A copy API is actually needed to deal > > with differing host/guest alignment and endianness. > > That seems a rather poor design choice, as many architectures don't have an > atomic increment instruction. Oh well. Most have compare-and-swap or load-locked/store-conditional instructions, though, which can be used to implement atomic increment. -- Jamie
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
> virtio makes things a bit trickier though. There's a shared ring queue > between the host and guest. The ring queue is lock-less and depends on > the ability to atomically increment ring queue indices to be SMP safe. > Using a copy-API wouldn't be a problem for QEMU since the host and guest > are always running in lock-step. A copy API is actually needed to deal > with differing host/guest alignment and endianness. That seems a rather poor design choice, as many architectures don't have an atomic increment instruction. Oh well. Paul
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
Anthony Liguori wrote: > virtio makes things a bit trickier though. There's a shared ring queue > between the host and guest. Does this work when the host and guest are different architectures, or different word sizes (x86/x86_64)? -- Jamie
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
Paul Brook wrote: Actually according to qemu's standard, one should use cpu_physical_memory_write/ cpu_physical_memory_read functions. This is true also for reading the ring values. Yes, and unfortunately, cpu_physical_memory_{read,write} are copy interfaces. We really don't want that for high speed I/O. I really don't like doing direct access to guest ram without implementing a proper API for zero-copy/scatter-gather access. There was a list thread about this not so long ago. I agree that we need a proper API for sg ram access. I'm going to look into that real soon since it's necessary to optimize the network/disk transports. virtio makes things a bit trickier though. There's a shared ring queue between the host and guest. The ring queue is lock-less and depends on the ability to atomically increment ring queue indices to be SMP safe. Using a copy-API wouldn't be a problem for QEMU since the host and guest are always running in lock-step. A copy API is actually needed to deal with differing host/guest alignment and endianness. Once you introduce KVM though, this is no longer true since KVM supports true SMP. The solution may be to implement some sort of atomic_increment function and then have that use a if (kvm) guard to do a direct access verses a copy. Regards, Anthony Liguori Paul
Re: [Qemu-devel] Re: [PATCH 2/3] virtio network device
> > Actually according to qemu's standard, one should use > > cpu_physical_memory_write/ cpu_physical_memory_read functions. > > This is true also for reading the ring values. > > Yes, and unfortunately, cpu_physical_memory_{read,write} are copy > interfaces. We really don't want that for high speed I/O. I really don't like doing direct access to guest ram without implementing a proper API for zero-copy/scatter-gather access. There was a list thread about this not so long ago. Paul
[Qemu-devel] Re: [PATCH 2/3] virtio network device
Dor Laor wrote: Anthony Liguori wrote: Index: qemu/hw/virtio-net.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ qemu/hw/virtio-net.c2007-12-04 14:17:37.0 -0600 + +static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) +{ +VirtIONet *n = opaque; +VirtQueueElement elem; +struct virtio_net_hdr *hdr; +int offset, i; + +/* FIXME: the drivers really need to set their status better */ +if (n->rx_vq->vring.avail == NULL) { + n->can_receive = 0; + return; +} + +if (virtqueue_pop(n->rx_vq, &elem) == 0) { + /* wait until the guest adds some rx bufs */ + n->can_receive = 0; + return; +} + +hdr = (void *)elem.in_sg[0].iov_base; +hdr->flags = 0; +hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE; + +/* copy in packet. ugh */ +offset = 0; +i = 1; +while (offset < size && i < elem.in_num) { + int len = MIN(elem.in_sg[i].iov_len, size - offset); + memcpy(elem.in_sg[i].iov_base, buf + offset, len); Actually according to qemu's standard, one should use cpu_physical_memory_write/ cpu_physical_memory_read functions. This is true also for reading the ring values. Yes, and unfortunately, cpu_physical_memory_{read,write} are copy interfaces. We really don't want that for high speed I/O. The only down-sides of not using cpu_physical_memory_{read,write} is that writes aren't dispatched to MMIO functions and dirty updating isn't handled automatically. I do need to go through an audit the dirty updating. It may be possible to do a 1-time check of MMIO memory and then have a fast and slow path. I'm not sure how much it matter honestly. Regards, Anthony Liguori + offset += len; + i++; +} + +/* signal other side */ +virtqueue_push(n->rx_vq, &elem, sizeof(*hdr) + offset); +virtio_notify(&n->vdev, n->rx_vq); +} + +/* TX */ +static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) +{ +VirtIONet *n = to_virtio_net(vdev); +VirtQueueElement elem; + +while (virtqueue_pop(vq, &elem)) { + int i; + size_t len = 0; + + /* ignore the header for now */ + for (i = 1; i < elem.out_num; i++) { + qemu_send_packet(n->vc, elem.out_sg[i].iov_base, +elem.out_sg[i].iov_len); + len += elem.out_sg[i].iov_len; + } + + virtqueue_push(vq, &elem, sizeof(struct virtio_net_hdr) + len); + virtio_notify(&n->vdev, vq); +} The virtio_notify should be left out of the while loop to minimize irq injection. +} + +void *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) +{ +VirtIONet *n; + +n = (VirtIONet *)virtio_init_pci(bus, "virtio-net", 6900, 0x1000, +0, VIRTIO_ID_NET, +0x02, 0x00, 0x00, +6, sizeof(VirtIONet)); + +n->vdev.update_config = virtio_net_update_config; +n->vdev.get_features = virtio_net_get_features; +n->rx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_rx); +n->tx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_tx); +n->can_receive = 0; +memcpy(n->mac, nd->macaddr, 6); +n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive, +virtio_net_can_receive, n); + +return &n->vdev; +} Index: qemu/hw/pc.h === --- qemu.orig/hw/pc.h 2007-12-04 14:15:20.0 -0600 +++ qemu/hw/pc.h2007-12-04 14:43:57.0 -0600 @@ -142,4 +142,9 @@ void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd); +/* virtio-net.c */ + +void *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn); + + #endif
[Qemu-devel] Re: [PATCH 2/3] virtio network device
Anthony Liguori wrote: Index: qemu/hw/virtio-net.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ qemu/hw/virtio-net.c2007-12-04 14:17:37.0 -0600 + +static void virtio_net_receive(void *opaque, const uint8_t *buf, int size) +{ +VirtIONet *n = opaque; +VirtQueueElement elem; +struct virtio_net_hdr *hdr; +int offset, i; + +/* FIXME: the drivers really need to set their status better */ +if (n->rx_vq->vring.avail == NULL) { + n->can_receive = 0; + return; +} + +if (virtqueue_pop(n->rx_vq, &elem) == 0) { + /* wait until the guest adds some rx bufs */ + n->can_receive = 0; + return; +} + +hdr = (void *)elem.in_sg[0].iov_base; +hdr->flags = 0; +hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE; + +/* copy in packet. ugh */ +offset = 0; +i = 1; +while (offset < size && i < elem.in_num) { + int len = MIN(elem.in_sg[i].iov_len, size - offset); + memcpy(elem.in_sg[i].iov_base, buf + offset, len); Actually according to qemu's standard, one should use cpu_physical_memory_write/ cpu_physical_memory_read functions. This is true also for reading the ring values. + offset += len; + i++; +} + +/* signal other side */ +virtqueue_push(n->rx_vq, &elem, sizeof(*hdr) + offset); +virtio_notify(&n->vdev, n->rx_vq); +} + +/* TX */ +static void virtio_net_handle_tx(VirtIODevice *vdev, VirtQueue *vq) +{ +VirtIONet *n = to_virtio_net(vdev); +VirtQueueElement elem; + +while (virtqueue_pop(vq, &elem)) { + int i; + size_t len = 0; + + /* ignore the header for now */ + for (i = 1; i < elem.out_num; i++) { + qemu_send_packet(n->vc, elem.out_sg[i].iov_base, +elem.out_sg[i].iov_len); + len += elem.out_sg[i].iov_len; + } + + virtqueue_push(vq, &elem, sizeof(struct virtio_net_hdr) + len); + virtio_notify(&n->vdev, vq); +} The virtio_notify should be left out of the while loop to minimize irq injection. +} + +void *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn) +{ +VirtIONet *n; + +n = (VirtIONet *)virtio_init_pci(bus, "virtio-net", 6900, 0x1000, +0, VIRTIO_ID_NET, +0x02, 0x00, 0x00, +6, sizeof(VirtIONet)); + +n->vdev.update_config = virtio_net_update_config; +n->vdev.get_features = virtio_net_get_features; +n->rx_vq = virtio_add_queue(&n->vdev, 512, virtio_net_handle_rx); +n->tx_vq = virtio_add_queue(&n->vdev, 128, virtio_net_handle_tx); +n->can_receive = 0; +memcpy(n->mac, nd->macaddr, 6); +n->vc = qemu_new_vlan_client(nd->vlan, virtio_net_receive, +virtio_net_can_receive, n); + +return &n->vdev; +} Index: qemu/hw/pc.h === --- qemu.orig/hw/pc.h 2007-12-04 14:15:20.0 -0600 +++ qemu/hw/pc.h2007-12-04 14:43:57.0 -0600 @@ -142,4 +142,9 @@ void isa_ne2000_init(int base, qemu_irq irq, NICInfo *nd); +/* virtio-net.c */ + +void *virtio_net_init(PCIBus *bus, NICInfo *nd, int devfn); + + #endif