from:"Michael S. Tsirkin"

[dpdk-dev] dpdk/vpp and cross-version migration for vhost

2016-11-17 Thread Michael S. Tsirkin

On Thu, Nov 17, 2016 at 05:49:36PM +0800, Yuanhan Liu wrote:
> On Thu, Nov 17, 2016 at 09:47:09AM +0100, Maxime Coquelin wrote:
> > 
> > 
> > On 11/17/2016 09:29 AM, Yuanhan Liu wrote:
> > >As usaual, sorry for late response :/
> > >
> > >On Thu, Oct 13, 2016 at 08:50:52PM +0300, Michael S. Tsirkin wrote:
> > >>Hi!
> > >>So it looks like we face a problem with cross-version
> > >>migration when using vhost. It's not new but became more
> > >>acute with the advent of vhost user.
> > >>
> > >>For users to be able to migrate between different versions
> > >>of the hypervisor the interface exposed to guests
> > >>by hypervisor must stay unchanged.
> > >>
> > >>The problem is that a qemu device is connected
> > >>to a backend in another process, so the interface
> > >>exposed to guests depends on the capabilities of that
> > >>process.
> > >>
> > >>Specifically, for vhost user interface based on virtio, this includes
> > >>the "host features" bitmap that defines the interface, as well as more
> > >>host values such as the max ring size.  Adding new features/changing
> > >>values to this interface is required to make progress, but on the other
> > >>hand we need ability to get the old host features to be compatible.
> > >
> > >It looks like to the same issue of vhost-user reconnect to me. For example,
> > >
> > >- start dpdk 16.07 & qemu 2.5
> > >- kill dpdk
> > >- start dpdk 16.11
> > >
> > >Though DPDK 16.11 has more features comparing to dpdk 16.07 (say, 
> > >indirect),
> > >above should work. Because qemu saves the negotiated features before the
> > >disconnect and stores it back after the reconnection.
> > >
> > >commit a463215b087c41d7ca94e51aa347cde523831873
> > >Author: Marc-Andr? Lureau 
> > >Date:   Mon Jun 6 18:45:05 2016 +0200
> > >
> > >vhost-net: save & restore vhost-user acked features
> > >
> > >The initial vhost-user connection sets the features to be 
> > > negotiated
> > >with the driver. Renegotiation isn't possible without device reset.
> > >
> > >To handle reconnection of vhost-user backend, ensure the same set 
> > > of
> > >features are provided, and reuse already acked features.
> > >
> > >Signed-off-by: Marc-Andr? Lureau 
> > >
> > >
> > >So we could do similar to vhost-user? I mean, save the acked features
> > >before migration and store it back after it. This should be able to
> > >keep the compatibility. If user downgrades DPDK version, it also could
> > >be easily detected, and then exit with an error to user: migration
> > >failed due to un-compatible vhost features.
> > >
> > >Just some rough thoughts. Makes tiny sense?
> > 
> > My understanding is that the management tool has to know whether
> > versions are compatible before initiating the migration:
> 
> Makes sense. How about getting and restoring the acked features through
> qemu command lines then, say, through the monitor interface?
> 
> With that, it would be something like:
> 
> - start vhost-user backend (DPDK, VPP, or whatever) & qemu in the src host
> 
> - read the acked features (through monitor interface)
> 
> - start vhost-user backend in the dst host
> 
> - start qemu in the dst host with the just queried acked features
> 
>   QEMU then is expected to use this feature set for the later vhost-user
>   feature negotitation. Exit if features compatibility is broken.
> 
> Thoughts?
> 
>   --yliu


You keep assuming that you have the VM started first and
figure out things afterwards, but this does not work.

Think about a cluster of machines. You want to start a VM in
a way that will ensure compatibility with all hosts
in a cluster.

If you don't, guest visible interface will change
and you won't be able to migrate.

It does not make sense to discuss feature bits specifically
since that is not the only part of interface.
For example, max ring size supported might change.


Let me describe how it works in qemu/libvirt.
When you install a VM, you can specify compatibility
level (aka "machine type"), and you can query the supported compatibility
levels. Management uses that to find the supported compatibility
and stores the compatibility in XML that is migrated with the VM.
There's also a way to find the latest level which is the
default unl

[dpdk-dev] [PATCH 0/2] enables vhost/virtio any layout feature

2016-11-10 Thread Michael S. Tsirkin

On Mon, Sep 26, 2016 at 02:40:54PM +0800, Yuanhan Liu wrote:
> The feature is actually supported both in virtio PMD and vhost lib.
> We just haven't enabled it yet. This patchset simply enables it.

Any input on handling versioning? Do people prefer to handle it
completely at the backend, or through libvirt?

> ---
> Yuanhan Liu (2):
>   vhost: enable any layout feature
>   net/virtio: enable any layout feature
> 
>  drivers/net/virtio/virtio_ethdev.h | 1 +
>  lib/librte_vhost/vhost.c   | 1 +
>  lib/librte_vhost/vhost.h   | 3 +++
>  3 files changed, 5 insertions(+)
> 
> -- 
> 1.9.0

[dpdk-dev] dpdk/vpp and cross-version migration for vhost

2016-10-13 Thread Michael S. Tsirkin

Hi!
So it looks like we face a problem with cross-version
migration when using vhost. It's not new but became more
acute with the advent of vhost user.

For users to be able to migrate between different versions
of the hypervisor the interface exposed to guests
by hypervisor must stay unchanged.

The problem is that a qemu device is connected
to a backend in another process, so the interface
exposed to guests depends on the capabilities of that
process.

Specifically, for vhost user interface based on virtio, this includes
the "host features" bitmap that defines the interface, as well as more
host values such as the max ring size.  Adding new features/changing
values to this interface is required to make progress, but on the other
hand we need ability to get the old host features to be compatible.

To solve this problem within qemu, qemu has a versioning system based on
a machine type concept which fundamentally is a version string, by
specifying that string one can get hardware compatible with a previous
qemu version. QEMU also reports the latest version and list of versions
supported so libvirt records the version at VM creation and then is
careful to use this machine version whenever it migrates a VM.

One might wonder how is this solved with a kernel vhost backend. The
answer is that it mostly isn't - instead an assumption is made, that
qemu versions are deployed together with the kernel - this is generally
true for downstreams.  Thus whenever qemu gains a new feature, it is
already supported by the kernel as well.  However, if one attempts
migration with a new qemu from a system with a new to old kernel, one
would get a failure.

In the world where we have multiple userspace backends, with some of
these supplied by ISVs, this seems non-realistic.

IMO we need to support vhost backend versioning, ideally
in a way that will also work for vhost kernel backends.

So I'd like to get some input from both backend and management
developers on what a good solution would look like.

If we want to emulate the qemu solution, this involves adding the
concept of interface versions to dpdk.  For example, dpdk could supply a
file (or utility printing?) with list of versions: latest and versions
supported. libvirt could read that and
- store latest version at vm creation
- pass it around with the vm
- pass it to qemu

>From here, qemu could pass this over the vhost-user channel,
thus making sure it's initialized with the correct
compatible interface.

As version here is an opaque string for libvirt and qemu,
anything can be used - but I suggest either a list
of values defining the interface, e.g.
any_layout=on,max_ring=256
or a version including the name and vendor of the backend,
e.g. "org.dpdk.v4.5.6".

Note that typically the list of supported versions can only be
extended, not shrunk. Also, if the host/guest interface
does not change, don't change the current version as
this just creates work for everyone.

Thoughts? Would this work well for management? dpdk? vpp?

Thanks!

-- 
MST

[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin

On Mon, Oct 10, 2016 at 04:16:19AM +, Wang, Zhihong wrote:
> 
> 
> > -Original Message-
> > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> > Sent: Monday, October 10, 2016 11:59 AM
> > To: Michael S. Tsirkin 
> > Cc: Maxime Coquelin ; Stephen Hemminger
> > ; dev at dpdk.org; qemu-
> > devel at nongnu.org; Wang, Zhihong 
> > Subject: Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature
> > 
> > On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > > > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > > >
> > > > > >
> > > > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > > > Yes but two points.
> > > > >
> > > > > 1. why is this memset expensive?
> > > >
> > > > I don't have the exact answer, but just some rough thoughts:
> > > >
> > > > It's an external clib function: there is a call stack and the
> > > > IP register will bounch back and forth.
> > >
> > > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> > 
> > Good to know!
> > 
> > > > overkill to use that for resetting 14 bytes structure.
> > > >
> > > > Some trick like
> > > > *(struct virtio_net_hdr *)hdr = {0, };
> > > >
> > > > Or even
> > > > hdr->xxx = 0;
> > > > hdr->yyy = 0;
> > > >
> > > > should behaviour better.
> > > >
> > > > There was an example: the vhost enqueue optmization patchset from
> > > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > > on my Ivybridge server: it has no such issue on his server though.
> > > >
> > > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > > >
> > > > --yliu
> > >
> > > I'd say that's weird. what's your config? any chance you
> > > are using an old compiler?
> > 
> > Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> > he said the memset is not well optimized for Ivybridge server.
> 
> The dst is remote in that case. It's fine on Haswell but has complication
> in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.
> 
> I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.


So try something like this then:

Signed-off-by: Michael S. Tsirkin 


diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
index dd7693f..7a3f88e 100644
--- a/drivers/net/virtio/virtio_pci.h
+++ b/drivers/net/virtio/virtio_pci.h
@@ -292,6 +292,16 @@ vtpci_with_feature(struct virtio_hw *hw, uint64_t bit)
return (hw->guest_features & (1ULL << bit)) != 0;
 }

+static inline int
+vtnet_hdr_size(struct virtio_hw *hw)
+{
+   if (vtpci_with_feature(hw, VIRTIO_NET_F_MRG_RXBUF) ||
+   vtpci_with_feature(hw, VIRTIO_F_VERSION_1))
+   return sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   else
+   return sizeof(struct virtio_net_hdr);
+}
+
 /*
  * Function declaration from virtio_pci.c
  */
diff --git a/drivers/net/virtio/virtio_rxtx.c b/drivers/net/virtio/virtio_rxtx.c
index a27208e..21a45e1 100644
--- a/drivers/net/virtio/virtio_rxtx.c
+++ b/drivers/net/virtio/virtio_rxtx.c
@@ -216,7 +216,7 @@ virtqueue_enqueue_xmit(struct virtnet_tx *txvq, struct 
rte_mbuf *cookie,
struct vring_desc *start_dp;
uint16_t seg_num = cookie->nb_segs;
uint16_t head_idx, idx;
-   uint16_t head_size = vq->hw->vtnet_hdr_size;
+   uint16_t head_size = vtnet_hdr_size(vq->hw);
unsigned long offs;

head_idx = vq->vq_desc_head_idx;

Generally pointer chasing in vq->hw->vtnet_hdr_size can't be good
for performance. Move fields used on data path into vq
and use from there to avoid indirections?

[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin

On Mon, Oct 10, 2016 at 12:22:09PM +0800, Yuanhan Liu wrote:
> On Mon, Oct 10, 2016 at 07:17:06AM +0300, Michael S. Tsirkin wrote:
> > On Mon, Oct 10, 2016 at 12:05:31PM +0800, Yuanhan Liu wrote:
> > > On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
> > > > > > And the same is done is done in DPDK:
> > > > > > 
> > > > > > static inline int __attribute__((always_inline))
> > > > > > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > > > > >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > > > > >   struct rte_mempool *mbuf_pool)
> > > > > > {
> > > > > > ...
> > > > > > /*
> > > > > >  * A virtio driver normally uses at least 2 desc buffers
> > > > > >  * for Tx: the first for storing the header, and others
> > > > > >  * for storing the data.
> > > > > >  */
> > > > > > if (likely((desc->len == dev->vhost_hlen) &&
> > > > > >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > > > > desc = [desc->next];
> > > > > > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > > > return -1;
> > > > > > 
> > > > > > desc_addr = gpa_to_vva(dev, desc->addr);
> > > > > > if (unlikely(!desc_addr))
> > > > > > return -1;
> > > > > > 
> > > > > > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > > > > > 
> > > > > > desc_offset = 0;
> > > > > > desc_avail  = desc->len;
> > > > > > nr_desc+= 1;
> > > > > > 
> > > > > > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > > > > > } else {
> > > > > > desc_avail  = desc->len - dev->vhost_hlen;
> > > > > > desc_offset = dev->vhost_hlen;
> > > > > > }
> > > > > 
> > > > > Actually, the header is parsed in DPDK vhost implementation.
> > > > > But as Virtio PMD provides a zero'ed header, we could just parse
> > > > > the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.
> > > > 
> > > > host can always skip the header parse if it wants to.
> > > > It didn't seem worth it to add branches there but
> > > > if I'm wrong, by all means code it up.
> > > 
> > > It's added by following commit, which yields about 10% performance
> > > boosts for PVP case (with 64B packet size).
> > > 
> > > At that time, a packet always use 2 descs. Since indirect desc is
> > > enabled (by default) now, the assumption is not true then. What's
> > > worse, it might even slow things a bit down. That should also be
> > > part of the reason why performance is slightly worse than before.
> > > 
> > >   --yliu
> > 
> > I'm not sure I get what you are saying
> > 
> > > commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> > > Author: Yuanhan Liu 
> > > Date:   Mon May 2 17:46:17 2016 -0700
> > > 
> > > vhost: optimize dequeue for small packets
> > > 
> > > A virtio driver normally uses at least 2 desc buffers for Tx: the
> > > first for storing the header, and the others for storing the data.
> > > 
> > > Therefore, we could fetch the first data desc buf before the main
> > > loop, and do the copy first before the check of "are we done yet?".
> > > This could save one check for small packets that just have one data
> > > desc buffer and need one mbuf to store it.
> > > 
> > > Signed-off-by: Yuanhan Liu 
> > > Acked-by: Huawei Xie 
> > > Tested-by: Rich Lane 
> > 
> > This fast-paths the 2-descriptors format but it's not active
> > for indirect descriptors. Is this what you mean?
> 
> Yes. It's also not active when ANY_LAYOUT is actually turned on.

It's not needed there though - you only use 1 desc, no point in
fetching the next one.


> > Should be a simple matter to apply this optimization for indirect.
> 
> Might be.
> 
>   --yliu

[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin

On Mon, Oct 10, 2016 at 04:16:19AM +, Wang, Zhihong wrote:
> 
> 
> > -Original Message-
> > From: Yuanhan Liu [mailto:yuanhan.liu at linux.intel.com]
> > Sent: Monday, October 10, 2016 11:59 AM
> > To: Michael S. Tsirkin 
> > Cc: Maxime Coquelin ; Stephen Hemminger
> > ; dev at dpdk.org; qemu-
> > devel at nongnu.org; Wang, Zhihong 
> > Subject: Re: [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature
> > 
> > On Mon, Oct 10, 2016 at 06:46:44AM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> > > > On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > > > > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > > > >
> > > > > >
> > > > > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > > > > Yes but two points.
> > > > >
> > > > > 1. why is this memset expensive?
> > > >
> > > > I don't have the exact answer, but just some rough thoughts:
> > > >
> > > > It's an external clib function: there is a call stack and the
> > > > IP register will bounch back and forth.
> > >
> > > for memset 0?  gcc 5.3.1 on fedora happily inlines it.
> > 
> > Good to know!
> > 
> > > > overkill to use that for resetting 14 bytes structure.
> > > >
> > > > Some trick like
> > > > *(struct virtio_net_hdr *)hdr = {0, };
> > > >
> > > > Or even
> > > > hdr->xxx = 0;
> > > > hdr->yyy = 0;
> > > >
> > > > should behaviour better.
> > > >
> > > > There was an example: the vhost enqueue optmization patchset from
> > > > Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> > > > on my Ivybridge server: it has no such issue on his server though.
> > > >
> > > > [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> > > >
> > > > --yliu
> > >
> > > I'd say that's weird. what's your config? any chance you
> > > are using an old compiler?
> > 
> > Not really, it's gcc 5.3.1. Maybe Zhihong could explain more. IIRC,
> > he said the memset is not well optimized for Ivybridge server.
> 
> The dst is remote in that case. It's fine on Haswell but has complication
> in Ivy Bridge which (wasn't supposed to but) causes serious frontend issue.
> 
> I don't think gcc inlined it there. I'm using fc24 gcc 6.1.1.

I just wrote some test code and compiled with gcc -O2,
it did get inlined.

It's probably this:
uint16_t head_size = vq->hw->vtnet_hdr_size;
...
memset(hdr, 0, head_size);
IOW head_size is not known to compiler.

Try sticking a bool there instead of vtnet_hdr_size, and
 memset(hdr, 0, bigheader ? 10 : 12);


> > 
> > --yliu

[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin

On Mon, Oct 10, 2016 at 12:05:31PM +0800, Yuanhan Liu wrote:
> On Fri, Sep 30, 2016 at 10:16:43PM +0300, Michael S. Tsirkin wrote:
> > > > And the same is done is done in DPDK:
> > > > 
> > > > static inline int __attribute__((always_inline))
> > > > copy_desc_to_mbuf(struct virtio_net *dev, struct vring_desc *descs,
> > > >   uint16_t max_desc, struct rte_mbuf *m, uint16_t desc_idx,
> > > >   struct rte_mempool *mbuf_pool)
> > > > {
> > > > ...
> > > > /*
> > > >  * A virtio driver normally uses at least 2 desc buffers
> > > >  * for Tx: the first for storing the header, and others
> > > >  * for storing the data.
> > > >  */
> > > > if (likely((desc->len == dev->vhost_hlen) &&
> > > >(desc->flags & VRING_DESC_F_NEXT) != 0)) {
> > > > desc = [desc->next];
> > > > if (unlikely(desc->flags & VRING_DESC_F_INDIRECT))
> > > > return -1;
> > > > 
> > > > desc_addr = gpa_to_vva(dev, desc->addr);
> > > > if (unlikely(!desc_addr))
> > > > return -1;
> > > > 
> > > > rte_prefetch0((void *)(uintptr_t)desc_addr);
> > > > 
> > > > desc_offset = 0;
> > > > desc_avail  = desc->len;
> > > > nr_desc+= 1;
> > > > 
> > > > PRINT_PACKET(dev, (uintptr_t)desc_addr, desc->len, 0);
> > > > } else {
> > > > desc_avail  = desc->len - dev->vhost_hlen;
> > > > desc_offset = dev->vhost_hlen;
> > > > }
> > > 
> > > Actually, the header is parsed in DPDK vhost implementation.
> > > But as Virtio PMD provides a zero'ed header, we could just parse
> > > the header only if VIRTIO_NET_F_NO_TX_HEADER is not negotiated.
> > 
> > host can always skip the header parse if it wants to.
> > It didn't seem worth it to add branches there but
> > if I'm wrong, by all means code it up.
> 
> It's added by following commit, which yields about 10% performance
> boosts for PVP case (with 64B packet size).
> 
> At that time, a packet always use 2 descs. Since indirect desc is
> enabled (by default) now, the assumption is not true then. What's
> worse, it might even slow things a bit down. That should also be
> part of the reason why performance is slightly worse than before.
> 
>   --yliu

I'm not sure I get what you are saying

> commit 1d41d77cf81c448c1b09e1e859bfd300e2054a98
> Author: Yuanhan Liu 
> Date:   Mon May 2 17:46:17 2016 -0700
> 
> vhost: optimize dequeue for small packets
> 
> A virtio driver normally uses at least 2 desc buffers for Tx: the
> first for storing the header, and the others for storing the data.
> 
> Therefore, we could fetch the first data desc buf before the main
> loop, and do the copy first before the check of "are we done yet?".
> This could save one check for small packets that just have one data
> desc buffer and need one mbuf to store it.
> 
> Signed-off-by: Yuanhan Liu 
> Acked-by: Huawei Xie 
> Tested-by: Rich Lane 

This fast-paths the 2-descriptors format but it's not active
for indirect descriptors. Is this what you mean?
Should be a simple matter to apply this optimization for indirect.

[dpdk-dev] [Qemu-devel] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin

On Mon, Oct 10, 2016 at 11:37:44AM +0800, Yuanhan Liu wrote:
> On Thu, Sep 29, 2016 at 11:21:48PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Sep 29, 2016 at 10:05:22PM +0200, Maxime Coquelin wrote:
> > > 
> > > 
> > > On 09/29/2016 07:57 PM, Michael S. Tsirkin wrote:
> > Yes but two points.
> > 
> > 1. why is this memset expensive?
> 
> I don't have the exact answer, but just some rough thoughts:
> 
> It's an external clib function: there is a call stack and the
> IP register will bounch back and forth.

for memset 0?  gcc 5.3.1 on fedora happily inlines it.

> BTW, It's kind of an
> overkill to use that for resetting 14 bytes structure.
> 
> Some trick like
> *(struct virtio_net_hdr *)hdr = {0, };
> 
> Or even 
> hdr->xxx = 0;
> hdr->yyy = 0;
> 
> should behaviour better.
> 
> There was an example: the vhost enqueue optmization patchset from
> Zhihong [0] uses memset, and it introduces more than 15% drop (IIRC)
> on my Ivybridge server: it has no such issue on his server though.
> 
> [0]: http://dpdk.org/ml/archives/dev/2016-August/045272.html
> 
>   --yliu

I'd say that's weird. what's your config? any chance you
are using an old compiler?


> > Is the test completely skipping looking
> >at the packet otherwise?
> > 
> > 2. As long as we are doing this, see
> > Alignment vs. Networking
> > 
> > in Documentation/unaligned-memory-access.txt
> > 
> > 
> > > From the micro-benchmarks results, we can expect +10% compared to
> > > indirect descriptors, and + 5% compared to using 2 descs in the
> > > virtqueue.
> > > Also, it should have the same benefits as indirect descriptors for 0%
> > > pkt loss (as we can fill 2x more packets in the virtqueue).
> > > 
> > > What do you think?
> > > 
> > > Thanks,
> > > Maxime

[dpdk-dev] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin

On Mon, Oct 10, 2016 at 11:03:33AM +0800, Yuanhan Liu wrote:
> On Mon, Oct 10, 2016 at 02:20:22AM +0300, Michael S. Tsirkin wrote:
> > On Wed, Sep 28, 2016 at 10:28:48AM +0800, Yuanhan Liu wrote:
> > > On Tue, Sep 27, 2016 at 10:56:40PM +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Sep 27, 2016 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > > > > On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:
> > > > > > On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:
> > > > > > > I assume that if using Version 1 that the bit will be ignored
> > > > > 
> > > > > Yes, but I will just quote what you just said: what if the guest
> > > > > virtio device is a legacy device? I also gave my reasons in another
> > > > > email why I consistently set this flag:
> > > > > 
> > > > >   - we have to return all features we support to the guest.
> > > > >   
> > > > > We don't know the guest is a modern or legacy device. That means
> > > > > we should claim we support both: VERSION_1 and ANY_LAYOUT.
> > > > >   
> > > > > Assume guest is a legacy device and we just set VERSION_1 (the 
> > > > > current
> > > > > case), ANY_LAYOUT will never be negotiated.
> > > > >   
> > > > >   - I'm following the way Linux kernel takes: it also set both 
> > > > > features.
> > > > >   
> > > > >   Maybe, we could unset ANY_LAYOUT when VERSION_1 is _negotiated_?
> > > > > 
> > > > > The unset after negotiation I proposed turned out it won't work: the
> > > > > feature is already negotiated; unsetting it only in vhost side doesn't
> > > > > change anything. Besides, it may break the migration as Michael stated
> > > > > below.
> > > > 
> > > > I think the reverse. Teach vhost user that for future machine types
> > > > only VERSION_1 implies ANY_LAYOUT.
> > 
> > So I guess at this point, we can teach vhost-user in qemu
> > that version 1 implies any_layout but only for machine types
> > qemu 2.8 and up. It sets a bad precedent but oh well.
> 
> It should work.
> 
>   --yliu

Cool. Want to post a patch?

-- 
MST

[dpdk-dev] [PATCH 1/2] vhost: enable any layout feature

2016-10-10 Thread Michael S. Tsirkin

On Wed, Sep 28, 2016 at 10:28:48AM +0800, Yuanhan Liu wrote:
> On Tue, Sep 27, 2016 at 10:56:40PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Sep 27, 2016 at 11:11:58AM +0800, Yuanhan Liu wrote:
> > > On Mon, Sep 26, 2016 at 10:24:55PM +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Sep 26, 2016 at 11:01:58AM -0700, Stephen Hemminger wrote:
> > > > > I assume that if using Version 1 that the bit will be ignored
> > > 
> > > Yes, but I will just quote what you just said: what if the guest
> > > virtio device is a legacy device? I also gave my reasons in another
> > > email why I consistently set this flag:
> > > 
> > >   - we have to return all features we support to the guest.
> > >   
> > > We don't know the guest is a modern or legacy device. That means
> > > we should claim we support both: VERSION_1 and ANY_LAYOUT.
> > >   
> > > Assume guest is a legacy device and we just set VERSION_1 (the current
> > > case), ANY_LAYOUT will never be negotiated.
> > >   
> > >   - I'm following the way Linux kernel takes: it also set both features.
> > >   
> > >   Maybe, we could unset ANY_LAYOUT when VERSION_1 is _negotiated_?
> > > 
> > > The unset after negotiation I proposed turned out it won't work: the
> > > feature is already negotiated; unsetting it only in vhost side doesn't
> > > change anything. Besides, it may break the migration as Michael stated
> > > below.
> > 
> > I think the reverse. Teach vhost user that for future machine types
> > only VERSION_1 implies ANY_LAYOUT.

So I guess at this point, we can teach vhost-user in qemu
that version 1 implies any_layout but only for machine types
qemu 2.8 and up. It sets a bad precedent but oh well.

-- 
MST

[dpdk-dev] [PATCH V2 2/2] virtio: support IOMMU platform

2016-10-07 Thread Michael S. Tsirkin

On Wed, Sep 28, 2016 at 04:25:12PM +0800, Jason Wang wrote:
> Negotiate VIRTIO_F_IOMMU_PLATFORM to have IOMMU support.
> 
> Signed-off-by: Jason Wang 
> ---
> Changes from v1:
> - remove unnecessary NEED_MAPPING flag

One thing we probably should do is enable this flag
with VFIO but not with UIO or VFIO-noiommu.

> ---
>  drivers/net/virtio/virtio_ethdev.h | 3 ++-
>  drivers/net/virtio/virtio_pci.h| 3 ++-
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio/virtio_ethdev.h 
> b/drivers/net/virtio/virtio_ethdev.h
> index 2ecec6e..04a06e2 100644
> --- a/drivers/net/virtio/virtio_ethdev.h
> +++ b/drivers/net/virtio/virtio_ethdev.h
> @@ -63,7 +63,8 @@
>1u << VIRTIO_NET_F_CTRL_RX   | \
>1u << VIRTIO_NET_F_CTRL_VLAN | \
>1u << VIRTIO_NET_F_MRG_RXBUF | \
> -  1ULL << VIRTIO_F_VERSION_1)
> +  1ULL << VIRTIO_F_VERSION_1   | \
> +  1ULL << VIRTIO_F_IOMMU_PLATFORM )

Space before ) looks kind of ugly.

>  
>  /*
>   * CQ function prototype
> diff --git a/drivers/net/virtio/virtio_pci.h b/drivers/net/virtio/virtio_pci.h
> index 3430a39..0aa0015 100644
> --- a/drivers/net/virtio/virtio_pci.h
> +++ b/drivers/net/virtio/virtio_pci.h
> @@ -138,6 +138,7 @@ struct virtnet_ctl;
>  #define VIRTIO_RING_F_INDIRECT_DESC  28
>  
>  #define VIRTIO_F_VERSION_1   32
> +#define VIRTIO_F_IOMMU_PLATFORM  33
>  
>  /*
>   * Some VirtIO feature bits (currently bits 28 through 31) are
> @@ -145,7 +146,7 @@ struct virtnet_ctl;
>   * rest are per-device feature bits.
>   */
>  #define VIRTIO_TRANSPORT_F_START 28
> -#define VIRTIO_TRANSPORT_F_END   32
> +#define VIRTIO_TRANSPORT_F_END   34
>  

This seems unused. Drop it?

>  /* The Guest publishes the used index for which it expects an interrupt
>   * at the end of the avail ring. Host should ignore the avail->flags field. 
> */
> -- 
> 2.7.4

[dpdk-dev] [PATCH 2/2] virtio: support IOMMU platform

2016-09-02 Thread Michael S. Tsirkin

On Fri, Sep 02, 2016 at 03:04:56PM +0200, Thomas Monjalon wrote:
> 2016-09-02 14:37, Jason Wang:
> > Virtio pmd doesn't support VFIO in the past since devices bypass IOMMU
> > completely. But recently, the work of making virtio device work with
> > IOMMU is near to complete. 
> 
> Good news!
> What are the requirements for Qemu and Linux version numbers please?

I expect QEMU 2.8 and Linux 4.8 to have the support.

[dpdk-dev] [PATCH 4/6] vhost: workaround stale vring base

2016-06-14 Thread Michael S. Tsirkin

On Mon, May 09, 2016 at 11:22:04AM -0700, Yuanhan Liu wrote:
> On Mon, May 09, 2016 at 04:25:38PM +, Xie, Huawei wrote:
> > On 5/7/2016 2:36 PM, Yuanhan Liu wrote:
> > > However, Michael claims some concerns: he made a good point: a crash
> > > is happening means some memory is corrupted, and it could be the virtio
> > > memory being corrupted. In such case, nothing will work without the
> > > reset.
> > 
> > I don't get this point. What is the scenario?
> 
> It's not a specific scenario, just a hypothetic one.
> 
> > For the crash of virtio frontend driver, i remember we discussed before,
> > we have no good recipe but some workaround. The user space frontend
> > driver crashes, and its memory is reallocated to other instances, but
> > vhost is still writing to that memory. However this has nothing to do
> > with vhost reconnect.
> 
> Hmm, yes, seems like another good point to me. This patch seems like 
> a fix but not a workaround then :)
> 
>   --yliu

I think it's a good idea to make sure the ring is
already setup by this time though.
Some future QEMU version might send base idx before
it will setup the ring, we would not want to crash.
Hopefully it will have fixed the base value sent by then.

-- 
MST

[dpdk-dev] [PATCH 1/3] vhost: pre update used ring for Tx and Rx

2016-06-01 Thread Michael S. Tsirkin

On Wed, Jun 01, 2016 at 06:40:41AM +, Xie, Huawei wrote:
> On 5/3/2016 8:42 AM, Yuanhan Liu wrote:
> > Pre update and update used ring in batch for Tx and Rx at the stage
> > while fetching all avail desc idx. This would reduce some cache misses
> > and hence, increase the performance a bit.
> >
> > Pre update would be feasible as guest driver will not start processing
> > those entries as far as we don't update "used->idx". (I'm not 100%
> > certain I don't miss anything, though).
> >
> > Cc: Michael S. Tsirkin 
> > Signed-off-by: Yuanhan Liu 
> > ---
> >  lib/librte_vhost/vhost_rxtx.c | 58 
> > +--
> >  1 file changed, 28 insertions(+), 30 deletions(-)
> >
> > diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> > index c9cd1c5..2c3b810 100644
> > --- a/lib/librte_vhost/vhost_rxtx.c
> > +++ b/lib/librte_vhost/vhost_rxtx.c
> > @@ -137,7 +137,7 @@ copy_virtio_net_hdr(struct virtio_net *dev, uint64_t 
> > desc_addr,
> >  
> >  static inline int __attribute__((always_inline))
> >  copy_mbuf_to_desc(struct virtio_net *dev, struct vhost_virtqueue *vq,
> > - struct rte_mbuf *m, uint16_t desc_idx, uint32_t *copied)
> > + struct rte_mbuf *m, uint16_t desc_idx)
> >  {
> > uint32_t desc_avail, desc_offset;
> > uint32_t mbuf_avail, mbuf_offset;
> > @@ -161,7 +161,6 @@ copy_mbuf_to_desc(struct virtio_net *dev, struct 
> > vhost_virtqueue *vq,
> > desc_offset = dev->vhost_hlen;
> > desc_avail  = desc->len - dev->vhost_hlen;
> >  
> > -   *copied = rte_pktmbuf_pkt_len(m);
> > mbuf_avail  = rte_pktmbuf_data_len(m);
> > mbuf_offset = 0;
> > while (mbuf_avail != 0 || m->next != NULL) {
> > @@ -262,6 +261,7 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
> > struct vhost_virtqueue *vq;
> > uint16_t res_start_idx, res_end_idx;
> > uint16_t desc_indexes[MAX_PKT_BURST];
> > +   uint16_t used_idx;
> > uint32_t i;
> >  
> > LOG_DEBUG(VHOST_DATA, "(%d) %s\n", dev->vid, __func__);
> > @@ -285,27 +285,29 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t 
> > queue_id,
> > /* Retrieve all of the desc indexes first to avoid caching issues. */
> > rte_prefetch0(>avail->ring[res_start_idx & (vq->size - 1)]);
> > for (i = 0; i < count; i++) {
> > -   desc_indexes[i] = vq->avail->ring[(res_start_idx + i) &
> > - (vq->size - 1)];
> > +   used_idx = (res_start_idx + i) & (vq->size - 1);
> > +   desc_indexes[i] = vq->avail->ring[used_idx];
> > +   vq->used->ring[used_idx].id = desc_indexes[i];
> > +   vq->used->ring[used_idx].len = pkts[i]->pkt_len +
> > +  dev->vhost_hlen;
> > +   vhost_log_used_vring(dev, vq,
> > +   offsetof(struct vring_used, ring[used_idx]),
> > +   sizeof(vq->used->ring[used_idx]));
> > }
> >  
> > rte_prefetch0(>desc[desc_indexes[0]]);
> > for (i = 0; i < count; i++) {
> > uint16_t desc_idx = desc_indexes[i];
> > -   uint16_t used_idx = (res_start_idx + i) & (vq->size - 1);
> > -   uint32_t copied;
> > int err;
> >  
> > -   err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx, );
> > -
> > -   vq->used->ring[used_idx].id = desc_idx;
> > -   if (unlikely(err))
> > +   err = copy_mbuf_to_desc(dev, vq, pkts[i], desc_idx);
> > +   if (unlikely(err)) {
> > +   used_idx = (res_start_idx + i) & (vq->size - 1);
> > vq->used->ring[used_idx].len = dev->vhost_hlen;
> > -   else
> > -   vq->used->ring[used_idx].len = copied + dev->vhost_hlen;
> > -   vhost_log_used_vring(dev, vq,
> > -   offsetof(struct vring_used, ring[used_idx]),
> > -   sizeof(vq->used->ring[used_idx]));
> > +   vhost_log_used_vring(dev, vq,
> > +   offsetof(struct vring_used, ring[used_idx]),
> > +   sizeof(vq->used->ring[used_idx]));
> > +   }
> >  
> > if (i + 1 < count)
> > rte_prefetch0(>desc[desc_indexes[i+1]]);
> &g

[dpdk-dev] [PATCH] virtio: use volatile to get used->idx in the loop

2016-05-25 Thread Michael S. Tsirkin

On Wed, May 25, 2016 at 10:47:30AM +0100, Bruce Richardson wrote:
> On Wed, May 25, 2016 at 11:34:24AM +0300, Michael S. Tsirkin wrote:
> > On Wed, May 25, 2016 at 08:25:20AM +, Xie, Huawei wrote:
> > > On 5/25/2016 4:12 PM, Xie, Huawei wrote:
> > > > There is no external function call or any barrier in the loop,
> > > > the used->idx would only be retrieved once.
> > > >
> > > > Signed-off-by: Huawei Xie 
> > > > ---
> > > >  drivers/net/virtio/virtio_ethdev.c | 3 ++-
> > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/net/virtio/virtio_ethdev.c 
> > > > b/drivers/net/virtio/virtio_ethdev.c
> > > > index c3fb628..f6d6305 100644
> > > > --- a/drivers/net/virtio/virtio_ethdev.c
> > > > +++ b/drivers/net/virtio/virtio_ethdev.c
> > > > @@ -204,7 +204,8 @@ virtio_send_command(struct virtqueue *vq, struct 
> > > > virtio_pmd_ctrl *ctrl,
> > > > usleep(100);
> > > > }
> > > >  
> > > > -   while (vq->vq_used_cons_idx != vq->vq_ring.used->idx) {
> > > > +   while (vq->vq_used_cons_idx !=
> > > > +  *((volatile uint16_t *)(>vq_ring.used->idx))) {
> > > > uint32_t idx, desc_idx, used_idx;
> > > > struct vring_used_elem *uep;
> > > >  
> > > 
> > > Find this issue when do the code rework of RX/TX queue.
> > > As in other places, we also have loop retrieving the value of avial->idx
> > > or used->idx, i prefer to declare the index in vq structure as volatile
> > > to avoid potential issue.
> 
> Is there a reason why the value is not always volatile? I would have thought
> it would be generally safer to mark the actual value as volatile inside the
> structure definition itself? In any cases where we do want to store the value
> locally and not re-access the structure, a local variable can be used.
> 
> Regards,
> /Bruce

Linux generally discourages volatile as a general style guidance:
https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
it doesn't have to apply to dpdk which has a different coding style
but IIUC this structure is inherited from linux, deviating
will make keeping things up to date harder.

> > 
> > It might be a good idea to wrap this in a macro
> > similar to ACCESS_ONCE in Linux.
> > 
> > > 
> > > Stephen:
> > > Another question is why we need a loop here?
> > > 
> > > /huawei
> > 
> > -- 
> > MST

[dpdk-dev] [PATCH] virtio: use volatile to get used->idx in the loop

2016-05-25 Thread Michael S. Tsirkin

On Wed, May 25, 2016 at 08:25:20AM +, Xie, Huawei wrote:
> On 5/25/2016 4:12 PM, Xie, Huawei wrote:
> > There is no external function call or any barrier in the loop,
> > the used->idx would only be retrieved once.
> >
> > Signed-off-by: Huawei Xie 
> > ---
> >  drivers/net/virtio/virtio_ethdev.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio/virtio_ethdev.c 
> > b/drivers/net/virtio/virtio_ethdev.c
> > index c3fb628..f6d6305 100644
> > --- a/drivers/net/virtio/virtio_ethdev.c
> > +++ b/drivers/net/virtio/virtio_ethdev.c
> > @@ -204,7 +204,8 @@ virtio_send_command(struct virtqueue *vq, struct 
> > virtio_pmd_ctrl *ctrl,
> > usleep(100);
> > }
> >  
> > -   while (vq->vq_used_cons_idx != vq->vq_ring.used->idx) {
> > +   while (vq->vq_used_cons_idx !=
> > +  *((volatile uint16_t *)(>vq_ring.used->idx))) {
> > uint32_t idx, desc_idx, used_idx;
> > struct vring_used_elem *uep;
> >  
> 
> Find this issue when do the code rework of RX/TX queue.
> As in other places, we also have loop retrieving the value of avial->idx
> or used->idx, i prefer to declare the index in vq structure as volatile
> to avoid potential issue.

It might be a good idea to wrap this in a macro
similar to ACCESS_ONCE in Linux.

> 
> Stephen:
> Another question is why we need a loop here?
> 
> /huawei

-- 
MST

[dpdk-dev] [PATCH v4 6/8] virtio-user: add new virtual pci driver for virtio

2016-05-12 Thread Michael S. Tsirkin

On Thu, May 12, 2016 at 03:08:05PM +0800, Tan, Jianfeng wrote:
> (2) It's more aligned to previous logic to hide the detail to differentiate
> modern/legacy device.

Why is there a need to support legacy interfaces at all?  It's a container
so if it's in use one can be reasonably sure you have a new kernel.

-- 
MST

[dpdk-dev] [PATCH 3/6] vhost: add reconnect ability

2016-05-10 Thread Michael S. Tsirkin

On Tue, May 10, 2016 at 09:00:45AM +, Xie, Huawei wrote:
> On 5/10/2016 4:42 PM, Michael S. Tsirkin wrote:
> > On Tue, May 10, 2016 at 08:07:00AM +, Xie, Huawei wrote:
> >> On 5/10/2016 3:56 PM, Michael S. Tsirkin wrote:
> >>> On Tue, May 10, 2016 at 07:24:10AM +, Xie, Huawei wrote:
> >>>> On 5/10/2016 2:08 AM, Yuanhan Liu wrote:
> >>>>> On Mon, May 09, 2016 at 04:47:02PM +, Xie, Huawei wrote:
> >>>>>> On 5/7/2016 2:36 PM, Yuanhan Liu wrote:
> >>>>>>> +static void *
> >>>>>>> +vhost_user_client_reconnect(void *arg)
> >>>>>>> +{
> >>>>>>> + struct reconnect_info *reconn = arg;
> >>>>>>> + int ret;
> >>>>>>> +
> >>>>>>> + RTE_LOG(ERR, VHOST_CONFIG, "reconnecting...\n");
> >>>>>>> + while (1) {
> >>>>>>> + ret = connect(reconn->fd, (struct sockaddr 
> >>>>>>> *)>un,
> >>>>>>> + sizeof(reconn->un));
> >>>>>>> + if (ret == 0)
> >>>>>>> + break;
> >>>>>>> + sleep(1);
> >>>>>>> + }
> >>>>>>> +
> >>>>>>> + vhost_user_add_connection(reconn->fd, reconn->vsocket);
> >>>>>>> + free(reconn);
> >>>>>>> +
> >>>>>>> + return NULL;
> >>>>>>> +}
> >>>>>>> +
> >>>>>> We could create hundreds of vhost-user ports in OVS. Wihout connections
> >>>>>> with QEMU established, those ports are just inactive. This works fine 
> >>>>>> in
> >>>>>> server mode.
> >>>>>> With client modes, do we need to create hundreds of vhost threads? This
> >>>>>> would be too interruptible.
> >>>>>> How about we create only one thread and do the reconnections for all 
> >>>>>> the
> >>>>>> unconnected socket?
> >>>>> Yes, good point and good suggestion. Will do it in v2.
> >>>> Hi Michael:
> >>>> This reminds me another irrelevant issue.
> >>>> In OVS, currently for each vhost port, we create an unix domain socket,
> >>>> and QEMU vhost proxy connects to this socket, and we use this to
> >>>> identify the connection. This works fine but is our workaround,
> >>>> otherwise we have no way to identify the connection.
> >>>> Do you think if this is an issue?
> >> Let us say vhost creates one unix domain socket, with path as "sockpath",
> >> and two virtio ports in two VMS both connect to the same socket with the
> >> following command line
> >> -chardev socket,id=char0,path=sockpath
> >> How could vhost identify the connection?
> > getpeername(2)?
> 
> getpeer name returns host/port? then it isn't useful.

Maybe but I'm still in the dark. Useful for what?

> The typical scenario in my mind is:
> We create a OVS port with the name "port1", and when we receive an
> virtio connection with ID "port1", we attach this virtio interface to
> the OVS port "port1".

If you are going to listen on a socket, you can create ports
and attach socket fds to it dynamically as long as you get connections.
What is wrong with that?


> 
> >
> >
> >> Workarounds:
> >> vhost creates two unix domain sockets, with path as "sockpath1" and
> >> "sockpath2" respectively,
> >> and the virtio ports in two VMS respectively connect to "sockpath1" and
> >> "sockpath2".
> >>
> >> If we have some name string from QEMU over vhost, as you mentioned, we
> >> could create only one socket with path "sockpath".
> >> User ensure that the names are unique, just as how they do with multiple
> >> sockets.
> >>
> > Seems rather fragile.
> 
> >From the scenario above, it is enough. That is also how it works today
> in DPDK OVS implementation with multiple sockets.
> Any other idea?
> 
> >
> >>> I'm sorry, I have trouble understanding what you wrote above.
> >>> What is the issue you are trying to work around?
> >>>
> >>>> Do we have plan to support identification in VHOST_USER_MESSAGE? With
> >>>> the identification, if vhost as server, we only need to create one
> >>>> socket which receives multiple connections, and use the ID in the
> >>>> message to identify the connection.
> >>>>
> >>>> /huawei
> >>> Sending e.g. -name string from qemu over vhost might be useful
> >>> for debugging, but I'm not sure it's a good idea to
> >>> rely on it being unique.
> >>>
> >>>>> Thanks.
> >>>>>
> >>>>> --yliu
> >>>>>
>

[dpdk-dev] [PATCH 3/6] vhost: add reconnect ability

2016-05-10 Thread Michael S. Tsirkin

On Tue, May 10, 2016 at 08:07:00AM +, Xie, Huawei wrote:
> On 5/10/2016 3:56 PM, Michael S. Tsirkin wrote:
> > On Tue, May 10, 2016 at 07:24:10AM +, Xie, Huawei wrote:
> >> On 5/10/2016 2:08 AM, Yuanhan Liu wrote:
> >>> On Mon, May 09, 2016 at 04:47:02PM +, Xie, Huawei wrote:
> >>>> On 5/7/2016 2:36 PM, Yuanhan Liu wrote:
> >>>>> +static void *
> >>>>> +vhost_user_client_reconnect(void *arg)
> >>>>> +{
> >>>>> +   struct reconnect_info *reconn = arg;
> >>>>> +   int ret;
> >>>>> +
> >>>>> +   RTE_LOG(ERR, VHOST_CONFIG, "reconnecting...\n");
> >>>>> +   while (1) {
> >>>>> +   ret = connect(reconn->fd, (struct sockaddr 
> >>>>> *)>un,
> >>>>> +   sizeof(reconn->un));
> >>>>> +   if (ret == 0)
> >>>>> +   break;
> >>>>> +   sleep(1);
> >>>>> +   }
> >>>>> +
> >>>>> +   vhost_user_add_connection(reconn->fd, reconn->vsocket);
> >>>>> +   free(reconn);
> >>>>> +
> >>>>> +   return NULL;
> >>>>> +}
> >>>>> +
> >>>> We could create hundreds of vhost-user ports in OVS. Wihout connections
> >>>> with QEMU established, those ports are just inactive. This works fine in
> >>>> server mode.
> >>>> With client modes, do we need to create hundreds of vhost threads? This
> >>>> would be too interruptible.
> >>>> How about we create only one thread and do the reconnections for all the
> >>>> unconnected socket?
> >>> Yes, good point and good suggestion. Will do it in v2.
> >> Hi Michael:
> >> This reminds me another irrelevant issue.
> >> In OVS, currently for each vhost port, we create an unix domain socket,
> >> and QEMU vhost proxy connects to this socket, and we use this to
> >> identify the connection. This works fine but is our workaround,
> >> otherwise we have no way to identify the connection.
> >> Do you think if this is an issue?
> 
> Let us say vhost creates one unix domain socket, with path as "sockpath",
> and two virtio ports in two VMS both connect to the same socket with the
> following command line
> -chardev socket,id=char0,path=sockpath
> How could vhost identify the connection?

getpeername(2)?


> 
> Workarounds:
> vhost creates two unix domain sockets, with path as "sockpath1" and
> "sockpath2" respectively,
> and the virtio ports in two VMS respectively connect to "sockpath1" and
> "sockpath2".
> 
> If we have some name string from QEMU over vhost, as you mentioned, we
> could create only one socket with path "sockpath".
> User ensure that the names are unique, just as how they do with multiple
> sockets.
> 

Seems rather fragile.

> > I'm sorry, I have trouble understanding what you wrote above.
> > What is the issue you are trying to work around?
> >
> >> Do we have plan to support identification in VHOST_USER_MESSAGE? With
> >> the identification, if vhost as server, we only need to create one
> >> socket which receives multiple connections, and use the ID in the
> >> message to identify the connection.
> >>
> >> /huawei
> > Sending e.g. -name string from qemu over vhost might be useful
> > for debugging, but I'm not sure it's a good idea to
> > rely on it being unique.
> >
> >>> Thanks.
> >>>
> >>>   --yliu
> >>>
>

[dpdk-dev] [PATCH 3/6] vhost: add reconnect ability

2016-05-10 Thread Michael S. Tsirkin

On Tue, May 10, 2016 at 07:24:10AM +, Xie, Huawei wrote:
> On 5/10/2016 2:08 AM, Yuanhan Liu wrote:
> > On Mon, May 09, 2016 at 04:47:02PM +, Xie, Huawei wrote:
> >> On 5/7/2016 2:36 PM, Yuanhan Liu wrote:
> >>> +static void *
> >>> +vhost_user_client_reconnect(void *arg)
> >>> +{
> >>> + struct reconnect_info *reconn = arg;
> >>> + int ret;
> >>> +
> >>> + RTE_LOG(ERR, VHOST_CONFIG, "reconnecting...\n");
> >>> + while (1) {
> >>> + ret = connect(reconn->fd, (struct sockaddr *)>un,
> >>> + sizeof(reconn->un));
> >>> + if (ret == 0)
> >>> + break;
> >>> + sleep(1);
> >>> + }
> >>> +
> >>> + vhost_user_add_connection(reconn->fd, reconn->vsocket);
> >>> + free(reconn);
> >>> +
> >>> + return NULL;
> >>> +}
> >>> +
> >> We could create hundreds of vhost-user ports in OVS. Wihout connections
> >> with QEMU established, those ports are just inactive. This works fine in
> >> server mode.
> >> With client modes, do we need to create hundreds of vhost threads? This
> >> would be too interruptible.
> >> How about we create only one thread and do the reconnections for all the
> >> unconnected socket?
> > Yes, good point and good suggestion. Will do it in v2.
> 
> Hi Michael:
> This reminds me another irrelevant issue.
> In OVS, currently for each vhost port, we create an unix domain socket,
> and QEMU vhost proxy connects to this socket, and we use this to
> identify the connection. This works fine but is our workaround,
> otherwise we have no way to identify the connection.
> Do you think if this is an issue?

I'm sorry, I have trouble understanding what you wrote above.
What is the issue you are trying to work around?

> Do we have plan to support identification in VHOST_USER_MESSAGE? With
> the identification, if vhost as server, we only need to create one
> socket which receives multiple connections, and use the ID in the
> message to identify the connection.
> 
> /huawei

Sending e.g. -name string from qemu over vhost might be useful
for debugging, but I'm not sure it's a good idea to
rely on it being unique.

> 
> >
> > Thanks.
> >
> > --yliu
> >
>

[dpdk-dev] [PATCH 2/6] vhost: add vhost-user client mode

2016-05-10 Thread Michael S. Tsirkin

On Mon, May 09, 2016 at 01:33:08PM -0700, Yuanhan Liu wrote:
> On Mon, May 09, 2016 at 06:33:45AM -0400, Victor Kaplansky wrote:
> > Adding a flag for a future extension seems fine to me.
> > Could we manage without adding a flag?
> > For example, could we always try a client mode for a while at first,
> > and if unsuccessful, then come up with server mode?
> 
> It's hard to define "how long is for a while". And I don't think
> there is a way to switch it back to client mode from server mode
> if you do so.
> 
> So, assume you create a vhost-user port (as client), and you
> start QEMU (as server) later, after the gap you mentioned of
> auto-turn into server, it simply doesn't work then.
> 
>   --yliu

I think I agree - I don't know of other software doing
such an automatic switch. Correctly setting server/client mode
seems easy enough.


> > - Original Message -
> > > From: "Yuanhan Liu" 
> > > To: dev at dpdk.org
> > > Cc: "huawei xie" , "Yuanhan Liu"  > > linux.intel.com>
> > > Sent: Saturday, May 7, 2016 9:40:20 AM
> > > Subject: [dpdk-dev] [PATCH 2/6] vhost: add vhost-user client mode
> > > 
> > > Add a new paramter (flags) to rte_vhost_driver_register(). DPDK
> > > vhost-user acts as client mode when RTE_VHOST_USER_CLIENT flag
> > > is set.
> > > 
> > > The flags would also allow future extensions without breaking the
> > > API (again).
> > > 
> > > The rest is straingfoward then: allocate a unix socket, and
> > > bind/listen for server, connect for client.
> > > 
> > > Signed-off-by: Yuanhan Liu 
> > > ---
> > >  drivers/net/vhost/rte_eth_vhost.c|   2 +-
> > >  examples/vhost/main.c|   2 +-
> > >  lib/librte_vhost/rte_virtio_net.h|  11 +-
> > >  lib/librte_vhost/vhost_user/vhost-net-user.c | 215
> > >  ---
> > >  4 files changed, 142 insertions(+), 88 deletions(-)
> > > 
> > > diff --git a/drivers/net/vhost/rte_eth_vhost.c
> > > b/drivers/net/vhost/rte_eth_vhost.c
> > > index a9dada5..36697cf 100644
> > > --- a/drivers/net/vhost/rte_eth_vhost.c
> > > +++ b/drivers/net/vhost/rte_eth_vhost.c
> > > @@ -456,7 +456,7 @@ eth_dev_start(struct rte_eth_dev *dev)
> > >   int ret = 0;
> > >  
> > >   if (rte_atomic16_cmpset(>once, 0, 1)) {
> > > - ret = rte_vhost_driver_register(internal->iface_name);
> > > + ret = rte_vhost_driver_register(internal->iface_name, 0);
> > >   if (ret)
> > >   return ret;
> > >   }
> > > diff --git a/examples/vhost/main.c b/examples/vhost/main.c
> > > index bbf0d28..6899189 100644
> > > --- a/examples/vhost/main.c
> > > +++ b/examples/vhost/main.c
> > > @@ -1499,7 +1499,7 @@ main(int argc, char *argv[])
> > >   rte_vhost_feature_disable(1ULL << VIRTIO_NET_F_MRG_RXBUF);
> > >  
> > >   /* Register vhost(cuse or user) driver to handle vhost messages. */
> > > - ret = rte_vhost_driver_register((char *)_basename);
> > > + ret = rte_vhost_driver_register(dev_basename, 0);
> > >   if (ret != 0)
> > >   rte_exit(EXIT_FAILURE, "vhost driver register failure.\n");
> > >  
> > > diff --git a/lib/librte_vhost/rte_virtio_net.h
> > > b/lib/librte_vhost/rte_virtio_net.h
> > > index 4e50425..c84e7ab 100644
> > > --- a/lib/librte_vhost/rte_virtio_net.h
> > > +++ b/lib/librte_vhost/rte_virtio_net.h
> > > @@ -51,6 +51,8 @@
> > >  #include 
> > >  #include 
> > >  
> > > +#define RTE_VHOST_USER_CLIENT(1ULL << 0)
> > > +
> > >  struct rte_mbuf;
> > >  
> > >  #define VHOST_MEMORY_MAX_NREGIONS 8
> > > @@ -96,11 +98,14 @@ uint64_t rte_vhost_feature_get(void);
> > >  
> > >  int rte_vhost_enable_guest_notification(int vid, uint16_t queue_id, int
> > >  enable);
> > >  
> > > -/* Register vhost driver. dev_name could be different for multiple 
> > > instance
> > > support. */
> > > -int rte_vhost_driver_register(const char *dev_name);
> > > +/**
> > > + * Register vhost driver. path could be different for multiple
> > > + * instance support.
> > > + */
> > > +int rte_vhost_driver_register(const char *path, uint64_t flags);
> > >  
> > >  /* Unregister vhost driver. This is only meaningful to vhost user. */
> > > -int rte_vhost_driver_unregister(const char *dev_name);
> > > +int rte_vhost_driver_unregister(const char *path);
> > >  
> > >  /* Register callbacks. */
> > >  int rte_vhost_driver_callback_register(struct virtio_net_device_ops 
> > > const *
> > >  const);
> > > diff --git a/lib/librte_vhost/vhost_user/vhost-net-user.c
> > > b/lib/librte_vhost/vhost_user/vhost-net-user.c
> > > index f485a3b..aa98717 100644
> > > --- a/lib/librte_vhost/vhost_user/vhost-net-user.c
> > > +++ b/lib/librte_vhost/vhost_user/vhost-net-user.c
> > > @@ -58,6 +58,7 @@
> > >  struct vhost_user_socket {
> > >   char *path;
> > >   int listenfd;
> > > + int is_server;
> > >  };
> > >  
> > >  struct vhost_user_connection {
> > > @@ -75,7 +76,7 @@ struct vhost_user {
> > >  
> > >  #define MAX_VIRTIO_BACKLOG 128
> > >  
> > > -static void vhost_user_new_connection(int fd,

[dpdk-dev] [PATCH 4/6] vhost: workaround stale vring base

2016-05-09 Thread Michael S. Tsirkin

On Fri, May 06, 2016 at 11:40:22PM -0700, Yuanhan Liu wrote:
> When DPDK app crashes (or quits, or gets killed), and when QEMU supports
> reconnecting (patches have been sent, not merged yet), a restart of DPDK
> app would get stale vring base from QEMU. That would break the kernel
> virtio net completely, making it non-work any more, unless a driver reset
> is done.
> 
> So, instead of getting the stale vring base from QEMU, Huawei suggested
> we could get a proper one from used->idx.
> 
> Cc: "Michael S. Tsirkin" 
> Suggested-by: "Xie, Huawei" 
> Signed-off-by: Yuanhan Liu 
> ---
> 
> Note that a right way to handle reconnect from abnormal quit would be
> let the guest driver to initiate a reset when reconnect is detected
> from QEMU. As a reset from the virtio net driver would resets all virtio
> related meta datas, and trigger the vhost-user re-initiation again,
> therefore, it would make the reconnect work as expected.
> 
> What's unforunate is that driver reset on reconnect, as the term "driver"
> implies, need driver hackings, which could be a linux kernel virtio net
> driver, DPDK pmd driver, or even, the windows driver. Apparently, it will
> not work so far, or even for a long while, until we are adapted to the
> new kernel.
> 
> The fix (or more precisely, the workaround) from this patch would make
> reconnect work in most case, including the following two hypothesis that
> might corrupt virtio memory:
> 
> - vring_used_elem might be in stale state when we are in the middle of
>   updating used vring. Say, we might have updated the "id" field, but
>   leaving "len" untouched.
> 
> - vring desc buff might also be in stale state, when we are copying
>   data into it.
> 
> The reason it still works is that we haven't updated used->idx yet,
> which means guest driver will not start processing those buggy used
> entries. Therefore, no issues.
> 
> However, Michael claims some concerns: he made a good point: a crash
> is happening means some memory is corrupted, and it could be the virtio
> memory being corrupted. In such case, nothing will work without the
> reset.
> 
> But I would say, that an app in product use would rare crash, and even
> if it crashes, the chance that virtio memory being corrupted would be
> relatively smaller then. Besides that, it would work, say when the app
> is killed by ctrl-c or kill command. So, it works in the most cases.
> But still, I would say it's just a workaround so far.
> 
> On the other hand, without this patch, it would be 100% not working
> from abnormal quit. But with this patch, it works in most cases, just
> don't work in a rare case that when virtio memory is corrupted. Therefore,
> I'd say, it is still good to have, until we have a perfect solution.
> ---
>  lib/librte_vhost/virtio-net.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index c88aaa3..df103aa 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -560,6 +560,14 @@ vhost_set_vring_addr(int vid, struct vhost_vring_addr 
> *addr)
>   return -1;
>   }
>  
> + if (vq->last_used_idx != vq->used->idx) {
> + RTE_LOG(WARNING, VHOST_CONFIG,
> + "last_used_idx (%u) and vq->used->idx (%u) mismatch\n",
> +     vq->last_used_idx, vq->used->idx);
> + vq->last_used_idx = vq->used->idx;
> + vq->last_used_idx_res = vq->used->idx;
> + }
> +
>   vq->log_guest_addr = addr->log_guest_addr;
>  
>   LOG_DEBUG(VHOST_CONFIG, "(%d) mapped address desc: %p\n",

Acked-by: Michael S. Tsirkin 

And I would go further and just ignore the index coming from QEMU.
dpdk processes packets in order so it does not need it.

> -- 
> 1.9.0

[dpdk-dev] [RFC PATCH] avail idx update optimizations

2016-04-24 Thread Michael S. Tsirkin

On Sun, Apr 24, 2016 at 02:45:22AM +, Xie, Huawei wrote:
> Forget to cc the mailing list.
> 
> On 4/22/2016 9:53 PM, Xie, Huawei wrote:
> > Hi:
> >
> > This is a series of virtio/vhost idx/ring update optimizations for cache
> > to cache transfer. Actually I don't expect many of them as virtio/vhost
> > has already done quite right.

Hmm - is it a series or a single patch?

> > For this patch, in a VM2VM test, i observed ~6% performance increase.

Interesting. In that case, it seems likely that new ring layout
would give you an even bigger performance gain.
Could you take a look at tools/virtio/ringtest/ring.c
in latest Linux and tell me what do you think?
In particular, I know you looked at using vectored instructions
to do ring updates - would the layout in tools/virtio/ringtest/ring.c
interfere with that?

> > In VM1, run testpmd with txonly mode
> > In VM2, run testpmd with rxonly mode
> > In host, run testpmd(with two vhostpmds) with io forward
> >
> > Michael:
> > We have talked about this method when i tried the fixed ring.
> >
> >
> > On 4/22/2016 5:12 PM, Xie, Huawei wrote:
> >> eliminate unnecessary cache to cache transfer between virtio and vhost
> >> core

Yes I remember proposing this, but you probably should include the
explanation about why this works in he commit log:

- pre-format avail ring with expected descriptor index values
- as long as entries are consumed in-order, there's no
  need to modify the avail ring
- as long as avail ring is not modified, it can be
  valid in caches of both consumer and producer


> >>
> >> ---
> >>  drivers/net/virtio/virtqueue.h | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/virtio/virtqueue.h 
> >> b/drivers/net/virtio/virtqueue.h
> >> index 4e9239e..8c46a83 100644
> >> --- a/drivers/net/virtio/virtqueue.h
> >> +++ b/drivers/net/virtio/virtqueue.h
> >> @@ -302,7 +302,8 @@ vq_update_avail_ring(struct virtqueue *vq, uint16_t 
> >> desc_idx)
> >> * descriptor.
> >> */
> >>avail_idx = (uint16_t)(vq->vq_avail_idx & (vq->vq_nentries - 1));
> >> -  vq->vq_ring.avail->ring[avail_idx] = desc_idx;
> >> +  if (unlikely(vq->vq_ring.avail->ring[avail_idx] != desc_idx))
> >> +  vq->vq_ring.avail->ring[avail_idx] = desc_idx;
> >>vq->vq_avail_idx++;
> >>  }
> >>  
> >
>

[dpdk-dev] [PATCH v2 3/5] virtio/vdev: add embeded device emulation

2016-02-07 Thread Michael S. Tsirkin

On Fri, Feb 05, 2016 at 07:20:26PM +0800, Jianfeng Tan wrote:
> diff --git a/drivers/net/virtio/vhost.h b/drivers/net/virtio/vhost.h
> new file mode 100644
> index 000..73d4f5c
> --- /dev/null
> +++ b/drivers/net/virtio/vhost.h
> @@ -0,0 +1,194 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2016 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + * * Redistributions of source code must retain the above copyright
> + *   notice, this list of conditions and the following disclaimer.
> + * * Redistributions in binary form must reproduce the above copyright
> + *   notice, this list of conditions and the following disclaimer in
> + *   the documentation and/or other materials provided with the
> + *   distribution.
> + * * Neither the name of Intel Corporation nor the names of its
> + *   contributors may be used to endorse or promote products derived
> + *   from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _VHOST_NET_USER_H
> +#define _VHOST_NET_USER_H
> +
> +#include 
> +#include 
> +#include 
> +
> +#define VHOST_MEMORY_MAX_NREGIONS 8

Don't hard-code this, it's not nice.

> +
> +struct vhost_vring_state {
> + unsigned int index;
> + unsigned int num;
> +};
> +
> +struct vhost_vring_file {
> + unsigned int index;
> + int fd;
> +};
> +
> +struct vhost_vring_addr {
> + unsigned int index;
> + /* Option flags. */
> + unsigned int flags;
> + /* Flag values: */
> + /* Whether log address is valid. If set enables logging. */
> +#define VHOST_VRING_F_LOG 0
> +
> + /* Start of array of descriptors (virtually contiguous) */
> + uint64_t desc_user_addr;
> + /* Used structure address. Must be 32 bit aligned */
> + uint64_t used_user_addr;
> + /* Available structure address. Must be 16 bit aligned */
> + uint64_t avail_user_addr;
> + /* Logging support. */
> + /* Log writes to used structure, at offset calculated from specified
> +  * address. Address must be 32 bit aligned.
> +  */
> + uint64_t log_guest_addr;
> +};
> +
> +#define VIRTIO_CONFIG_S_DRIVER_OK   4
> +
> +enum vhost_user_request {
> + VHOST_USER_NONE = 0,
> + VHOST_USER_GET_FEATURES = 1,
> + VHOST_USER_SET_FEATURES = 2,
> + VHOST_USER_SET_OWNER = 3,
> + VHOST_USER_RESET_OWNER = 4,
> + VHOST_USER_SET_MEM_TABLE = 5,
> + VHOST_USER_SET_LOG_BASE = 6,
> + VHOST_USER_SET_LOG_FD = 7,
> + VHOST_USER_SET_VRING_NUM = 8,
> + VHOST_USER_SET_VRING_ADDR = 9,
> + VHOST_USER_SET_VRING_BASE = 10,
> + VHOST_USER_GET_VRING_BASE = 11,
> + VHOST_USER_SET_VRING_KICK = 12,
> + VHOST_USER_SET_VRING_CALL = 13,
> + VHOST_USER_SET_VRING_ERR = 14,
> + VHOST_USER_GET_PROTOCOL_FEATURES = 15,
> + VHOST_USER_SET_PROTOCOL_FEATURES = 16,
> + VHOST_USER_GET_QUEUE_NUM = 17,
> + VHOST_USER_SET_VRING_ENABLE = 18,
> + VHOST_USER_MAX
> +};
> +
> +struct vhost_memory_region {
> + uint64_t guest_phys_addr;
> + uint64_t memory_size; /* bytes */
> + uint64_t userspace_addr;
> + uint64_t mmap_offset;
> +};
> +
> +struct vhost_memory_kernel {
> + uint32_t nregions;
> + uint32_t padding;
> + struct vhost_memory_region regions[0];
> +};
> +
> +struct vhost_memory {
> + uint32_t nregions;
> + uint32_t padding;
> + struct vhost_memory_region regions[VHOST_MEMORY_MAX_NREGIONS];
> +};
> +
> +struct vhost_user_msg {
> + enum vhost_user_request request;
> +
> +#define VHOST_USER_VERSION_MASK 0x3
> +#define VHOST_USER_REPLY_MASK   (0x1 << 2)
> + uint32_t flags;
> + uint32_t size; /* the following payload size */
> + union {
> +#define VHOST_USER_VRING_IDX_MASK   0xff
> +#define VHOST_USER_VRING_NOFD_MASK  (0x1 << 8)
> + uint64_t u64;
> + struct vhost_vring_state state;
> + struct vhost_vring_addr addr;
> + struct vhost_memory

[dpdk-dev] virtio ring layout changes for optimal single-stream performance

2016-01-21 Thread Michael S. Tsirkin

On Thu, Jan 21, 2016 at 04:38:36PM +0100, Cornelia Huck wrote:
> On Thu, 21 Jan 2016 15:39:26 +0200
> "Michael S. Tsirkin"  wrote:
> 
> > Hi all!
> > I have been experimenting with alternative virtio ring layouts,
> > in order to speed up single stream performance.
> > 
> > I have just posted a benchmark I wrote for the purpose, and a (partial)
> > alternative layout implementation.  This achieves 20-40% reduction in
> > virtio overhead in the (default) polling mode.
> > 
> > http://article.gmane.org/gmane.linux.kernel.virtualization/26889
> > 
> > The layout is trying to be as simple as possible, to reduce
> > the number of cache lines bouncing between CPUs.
> 
> Some kind of diagram or textual description would really help to review
> this.
> 
> > 
> > For benchmarking, the idea is to emulate virtio in user-space,
> > artificially adding overhead for e.g. signalling to match what happens
> > in case of a VM.
> 
> Hm... is this overhead comparable enough between different platform so
> that you can get a halfway realistic scenario?

On x86 is seems pretty stable.
It's a question of setting VMEXIT_CYCLES and VMENTRY_CYCLES correctly.

> What about things like
> endianness conversions?

I didn't bother with them yet.

> > 
> > I'd be very curious to get feedback on this, in particular, some people
> > discussed using vectored operations to format virtio ring - would it
> > conflict with this work?
> > 
> > You are all welcome to post enhancements or more layout alternatives as
> > patches.
> 
> Let me see if I can find time to experiment a bit.

[dpdk-dev] virtio ring layout changes for optimal single-stream performance

2016-01-21 Thread Michael S. Tsirkin

Hi all!
I have been experimenting with alternative virtio ring layouts,
in order to speed up single stream performance.

I have just posted a benchmark I wrote for the purpose, and a (partial)
alternative layout implementation.  This achieves 20-40% reduction in
virtio overhead in the (default) polling mode.

http://article.gmane.org/gmane.linux.kernel.virtualization/26889

The layout is trying to be as simple as possible, to reduce
the number of cache lines bouncing between CPUs.

For benchmarking, the idea is to emulate virtio in user-space,
artificially adding overhead for e.g. signalling to match what happens
in case of a VM.

I'd be very curious to get feedback on this, in particular, some people
discussed using vectored operations to format virtio ring - would it
conflict with this work?

You are all welcome to post enhancements or more layout alternatives as
patches.

TODO:
- documentation+discussion of interaction with CPU caching
- thorough benchmarking of different configurations/hosts
- experiment with event index replacements
- better emulate vmexit/vmentry cost overhead
- virtio spec proposal

Thanks!
-- 
MST

[dpdk-dev] [PATCH 1/4] vhost: handle VHOST_USER_SET_LOG_BASE request

2015-12-02 Thread Michael S. Tsirkin

On Wed, Dec 02, 2015 at 06:58:03PM +0200, Panu Matilainen wrote:
> On 12/02/2015 05:09 PM, Yuanhan Liu wrote:
> >On Wed, Dec 02, 2015 at 04:48:14PM +0200, Panu Matilainen wrote:
> >...
> >diff --git a/lib/librte_vhost/rte_virtio_net.h 
> >b/lib/librte_vhost/rte_virtio_net.h
> >index 5687452..416dac2 100644
> >--- a/lib/librte_vhost/rte_virtio_net.h
> >+++ b/lib/librte_vhost/rte_virtio_net.h
> >@@ -127,6 +127,8 @@ struct virtio_net {
> >  #define IF_NAME_SZ (PATH_MAX > IFNAMSIZ ? PATH_MAX : IFNAMSIZ)
> > charifname[IF_NAME_SZ]; /**< Name of 
> > the tap device or socket path. */
> > uint32_tvirt_qp_nb; /**< number of queue 
> > pair we have allocated */
> >+uint64_tlog_size;   /**< Size of log area */
> >+uint8_t *log_base;  /**< Where dirty pages 
> >are logged */
> > void*priv;  /**< private context */
> > struct vhost_virtqueue  *virtqueue[VHOST_MAX_QUEUE_PAIRS * 2];  
> > /**< Contains all virtqueue information. */
> >  } __rte_cache_aligned;
> 
> This (and other changes in patch 2 breaks the librte_vhost ABI
> again, so you'd need to at least add a deprecation note to 2.2 to be
> able to do it in 2.3 at all according to the ABI policy.
> >>>
> >>>I was thinking that adding a new field (instead of renaming it or
> >>>removing it) isn't an ABI break. So, I was wrong?
> >>
> >>Adding or removing a field in the middle of a public struct is
> >>always an ABI break. Adding to the end often is too, but not always.
> >>Renaming a field is an API break but not an ABI break - the compiler
> >>cares but the cpu does not.
> >
> >Good to know. Thanks.
> >
> >>
> 
> Perhaps a better option would be adding some padding to the structs
> now for 2.2 since the vhost ABI is broken there anyway. That would
> at least give a chance to keep it compatible from 2.2 to 2.3.
> >>>
> >>>It will not be compatible, unless we add exact same fields (not
> >>>something like uint8_t pad[xx]). Otherwise, the pad field renaming
> >>>is also an ABI break, right?
> >>
> >>There's no ABI (or API) break in changing reserved unused fields to
> >>something else, as long as care is taken with sizes and alignment.
> >
> >as long as we don't reference the reserved unused fields?
> 
> That would be the definition of an unused field I think :)
> Call it "reserved" if you want, it doesn't really matter as long as its
> clear its something you shouldn't be using.
> 
> >
> >>In any case padding is best added to the end of a struct to minimize
> >>risks and keep things simple.
> >
> >The thing is that isn't it a bit aweful to (always) add pads to
> >the end of a struct, especially when you don't know how many
> >need to be padded?
> 
> Then you pad for what you think you need, plus a bit extra, and maybe some
> more for others who might want to extend it. What is a reasonable amount
> needs deciding case by case - if a struct is alloced in the millions then be
> (very) conservative, but if there are one or 50 such structs within an app
> lifetime then who cares if its bit larger?
> 
> And yeah padding may be annoying, but that's pretty much the only option in
> a project where most of the structs are out in the open.
> 
>   - Panu -

Functions versioning is another option.
For a sufficiently widely used struct, it's a lot of work, padding
is easier.  But it might be better than breaking ABI
if e.g. you didn't pad enough.

> >
> > --yliu
> >

[dpdk-dev] [PATCH] vfio: Include No-IOMMU mode

2015-12-02 Thread Michael S. Tsirkin

On Wed, Dec 02, 2015 at 05:19:18PM +0100, Thomas Monjalon wrote:
> Hi,
> 
> 2015-12-02 08:28, Alex Williamson:
> > On Mon, 2015-11-16 at 19:12 +0200, Avi Kivity wrote:
> > > On 11/16/2015 07:06 PM, Alex Williamson wrote:
> > > > FYI, this is now in v4.4-rc1 (the slightly modified v2 version).  I want
> > > > to give fair warning though that while we seem to agree on this idea, it
> > > > hasn't been proven with a userspace driver port.  I've opted to include
> > > > it in this merge window rather than delaying it until v4.5, but I really
> > > > need to see a user for this before the end of the v4.4 cycle or I think
> > > > we'll need to revert and revisit for v4.5 anyway.  I don't really have
> > > > any interest in adding and maintaining code that has no users.  Please
> > > > keep me informed of progress with a dpdk port.  Thanks,
> > > 
> > > Thanks Alex.  Copying the dpdk mailing list, where the users live.
> > > 
> > > dpdk-ers: vfio-noiommu is a replacement for uio_pci_generic and 
> > > uio_igb.  It supports MSI-X and so can be used on SR/IOV VF devices.  
> > > The intent is that you can use dpdk without an external module, using 
> > > vfio, whether you are on bare metal with an iommu, bare metal without an 
> > > iommu, or virtualized.  However, dpdk needs modification to support this.
> > 
> > Still no users for this that I'm aware of.  I'm going to revert this in
> > rc5 unless something changes.  Thanks,
> 
> Removing needs for out-of-tree modules is a really nice achievement.
> Yes, we (in the DPDK project) should check how to use this no-iommu VFIO
> and to replace igb_uio.
> 
> I'm sorry we failed to catch your email and follow up.
> Is it really too late? What is the risk of keeping it in Linux 4.4?
> Advertising a new feature and removing it would be frustrating.
> 
> Have you tried this VFIO mode with DPDK?
> How complex would be the patch to support it?
> 
> Thanks

These things need to be developed together, one can't be sure it meets
userspace needs if no one tried.  And then where would we be?
Supporting a broken interface forever.  If someone writes the userspace
code, then this feature can come back for 4.5.

-- 
MST

[dpdk-dev] [PATCH 3/4] vhost: log vring changes

2015-12-02 Thread Michael S. Tsirkin

On Wed, Dec 02, 2015 at 05:58:24PM +0200, Victor Kaplansky wrote:
> On Wed, Dec 02, 2015 at 10:38:02PM +0800, Yuanhan Liu wrote:
> > On Wed, Dec 02, 2015 at 04:07:02PM +0200, Victor Kaplansky wrote:
> > > On Wed, Dec 02, 2015 at 11:43:12AM +0800, Yuanhan Liu wrote:
> > > > Invoking vhost_log_write() to mark corresponding page as dirty while
> > > > updating used vring.
> > > 
> > > Looks good, thanks!
> > > 
> > > I didn't find where you log the dirty pages in result of data
> > > written to the buffers pointed by the descriptors in RX vring.
> > > AFAIU, the buffers of RX queue reside in guest's memory and have
> > > to be marked as dirty if they are written. What do you say?
> > 
> > Yeah, we should. I got a question then: why log_guest_addr is set
> > to the physical address of used vring in guest? I mean, apparently,
> > we need log more changes other than used vring only.
> 
> The physical address of used vring sent to the back-end, since
> otherwise back-end has to perform virtual to physical
> translation, and we want to avoid this. The dirty buffers has to
> be marked as well, but their guest's physical address is known
> directly from the descriptors.

Yes, people wanted to be able to do multiple physical
addresses to one virtual so you do not want to translate
virt to phys.

> > 
> > --yliu

[dpdk-dev] [PATCH v7 4/8] vhost: rxtx: use queue id instead of constant ring index

2015-11-17 Thread Michael S. Tsirkin

On Mon, Nov 16, 2015 at 02:20:57PM -0800, Flavio Leitner wrote:
> On Wed, Oct 28, 2015 at 11:12:25PM +0200, Michael S. Tsirkin wrote:
> > On Wed, Oct 28, 2015 at 06:30:41PM -0200, Flavio Leitner wrote:
> > > On Sat, Oct 24, 2015 at 08:47:10PM +0300, Michael S. Tsirkin wrote:
> > > > On Sat, Oct 24, 2015 at 12:34:08AM -0200, Flavio Leitner wrote:
> > > > > On Thu, Oct 22, 2015 at 02:32:31PM +0300, Michael S. Tsirkin wrote:
> > > > > > On Thu, Oct 22, 2015 at 05:49:55PM +0800, Yuanhan Liu wrote:
> > > > > > > On Wed, Oct 21, 2015 at 05:26:18PM +0300, Michael S. Tsirkin 
> > > > > > > wrote:
> > > > > > > > On Wed, Oct 21, 2015 at 08:48:15PM +0800, Yuanhan Liu wrote:
> > > > > > > > > > Please note that for virtio devices, guest is supposed to
> > > > > > > > > > control the placement of incoming packets in RX queues.
> > > > > > > > > 
> > > > > > > > > I may not follow you.
> > > > > > > > > 
> > > > > > > > > Enqueuing packets to a RX queue is done at vhost lib, outside 
> > > > > > > > > the
> > > > > > > > > guest, how could the guest take the control here?
> > > > > > > > > 
> > > > > > > > >   --yliu
> > > > > > > > 
> > > > > > > > vhost should do what guest told it to.
> > > > > > > > 
> > > > > > > > See virtio spec:
> > > > > > > > 5.1.6.5.5 Automatic receive steering in multiqueue mode
> > > > > > > 
> > > > > > > Spec says:
> > > > > > > 
> > > > > > > After the driver transmitted a packet of a flow on transmitqX,
> > > > > > > the device SHOULD cause incoming packets for that flow to be
> > > > > > > steered to receiveqX.
> > > > > > > 
> > > > > > > 
> > > > > > > Michael, I still have no idea how vhost could know the flow even
> > > > > > > after discussion with Huawei. Could you be more specific about
> > > > > > > this? Say, how could guest know that? And how could guest tell
> > > > > > > vhost which RX is gonna to use?
> > > > > > > 
> > > > > > > Thanks.
> > > > > > > 
> > > > > > >   --yliu
> > > > > > 
> > > > > > I don't really understand the question.
> > > > > > 
> > > > > > When guests transmits a packet, it makes a decision
> > > > > > about the flow to use, and maps that to a tx/rx pair of queues.
> > > > > > 
> > > > > > It sends packets out on the tx queue and expects device to
> > > > > > return packets from the same flow on the rx queue.
> > > > > 
> > > > > Why? I can understand that there should be a mapping between
> > > > > flows and queues in a way that there is no re-ordering, but
> > > > > I can't see the relation of receiving a flow with a TX queue.
> > > > > 
> > > > > fbl
> > > > 
> > > > That's the way virtio chose to program the rx steering logic.
> > > > 
> > > > It's low overhead (no special commands), and
> > > > works well for TCP when user is an endpoint since rx and tx
> > > > for tcp are generally tied (because of ack handling).
> 
> It is low overhead for the control plane, but not for the data plane.

Well, there's zero data plane overhead within the guest.
You can't go lower :)

> > > > We can discuss other ways, e.g. special commands for guests to
> > > > program steering.
> > > > We'd have to first see some data showing the current scheme
> > > > is problematic somehow.
> 
> The issue is that the spec assumes the packets are coming in
> a serialized way and the distribution will be made by vhost-user
> but that isn't necessarily true.
> 

Making the distribution guest controlled is obviously the right
thing to do if guest is the endpoint: we need guest scheduler to
make the decisions, it's the only entity that knows
how are tasks distributed across VCPUs.

It's possible that this is not the right thing for when guest
is just doing bridging between two VNICs:
are you saying packets should just go from RX queue N
o

[dpdk-dev] [PATCH] vhost-user: enable virtio 1.0

2015-11-09 Thread Michael S. Tsirkin

On Fri, Oct 30, 2015 at 06:48:09PM +0100, Thomas Monjalon wrote:
> 2015-10-18 10:04, Michael S. Tsirkin:
> > On Fri, Oct 16, 2015 at 02:52:30PM +0100, Bruce Richardson wrote:
> > > On Thu, Oct 15, 2015 at 04:18:59PM +0300, Michael S. Tsirkin wrote:
> > > > On Thu, Oct 15, 2015 at 02:08:39PM +0300, Marcel Apfelbaum wrote:
> > > > > Make vhost-user virtio 1.0 compatible by adding it to the
> > > > > supported features and keeping the header length
> > > > > the same as for mergeable RX buffers.
> > > > > 
> > > > > Signed-off-by: Marcel Apfelbaum 
> > > > 
> > > > Looks good to me
> > > > 
> > > > Acked-by: Michael S. Tsirkin 
> > > > 
> > > > Just one question: dpdk is only supported on little-endian
> > > > platforms at the moment, right?
> > > 
> > > A recent release added in support for PPC (patches supplied by IBM). For 
> > > example, see: 
> > > http://dpdk.org/browse/dpdk/commit/?id=704ba3770032c5a901719d3837845581d5a56b58
> > > 
> > > /Bruce
> > 
> > This will require more work then as 1.0 is a different
> > endian-ness from 0.9. It's up to you guys to decide
> > whether correct BE support is now a requirement for all
> > new dpdk code. Let us know.
> 
> I'm not sure to understand.
> Yes DPDK must work on big endian platforms.
> Does this patch prevent from using virtio 0.9 with big endian?

No it doesn't. virtio 0.9 with big endian most likely does not work
reliably with vhost-user at the moment since qemu does not tell vhost-user
about guest endian-ness (though it's common to have it match host,
then it will work by luck).

> Does it work with old QEMU not supporting virtio 1.0?

Yes it does.

virtio 1 requires a bunch of fields to be little endian.
If someone uses this patch with new QEMU on BE machine,
things won't work: you need
if (virtio_1)
cpu_to_le
On LE, cpu_to_le is a NOP so things work by luck.

In other words this patch is fine, more is needed to make virtio 1
work on BE hosts.

-- 
MST

[dpdk-dev] [PATCH v7 4/8] vhost: rxtx: use queue id instead of constant ring index

2015-10-29 Thread Michael S. Tsirkin

On Wed, Oct 28, 2015 at 06:30:41PM -0200, Flavio Leitner wrote:
> On Sat, Oct 24, 2015 at 08:47:10PM +0300, Michael S. Tsirkin wrote:
> > On Sat, Oct 24, 2015 at 12:34:08AM -0200, Flavio Leitner wrote:
> > > On Thu, Oct 22, 2015 at 02:32:31PM +0300, Michael S. Tsirkin wrote:
> > > > On Thu, Oct 22, 2015 at 05:49:55PM +0800, Yuanhan Liu wrote:
> > > > > On Wed, Oct 21, 2015 at 05:26:18PM +0300, Michael S. Tsirkin wrote:
> > > > > > On Wed, Oct 21, 2015 at 08:48:15PM +0800, Yuanhan Liu wrote:
> > > > > > > > Please note that for virtio devices, guest is supposed to
> > > > > > > > control the placement of incoming packets in RX queues.
> > > > > > > 
> > > > > > > I may not follow you.
> > > > > > > 
> > > > > > > Enqueuing packets to a RX queue is done at vhost lib, outside the
> > > > > > > guest, how could the guest take the control here?
> > > > > > > 
> > > > > > >   --yliu
> > > > > > 
> > > > > > vhost should do what guest told it to.
> > > > > > 
> > > > > > See virtio spec:
> > > > > > 5.1.6.5.5 Automatic receive steering in multiqueue mode
> > > > > 
> > > > > Spec says:
> > > > > 
> > > > > After the driver transmitted a packet of a flow on transmitqX,
> > > > > the device SHOULD cause incoming packets for that flow to be
> > > > > steered to receiveqX.
> > > > > 
> > > > > 
> > > > > Michael, I still have no idea how vhost could know the flow even
> > > > > after discussion with Huawei. Could you be more specific about
> > > > > this? Say, how could guest know that? And how could guest tell
> > > > > vhost which RX is gonna to use?
> > > > > 
> > > > > Thanks.
> > > > > 
> > > > >   --yliu
> > > > 
> > > > I don't really understand the question.
> > > > 
> > > > When guests transmits a packet, it makes a decision
> > > > about the flow to use, and maps that to a tx/rx pair of queues.
> > > > 
> > > > It sends packets out on the tx queue and expects device to
> > > > return packets from the same flow on the rx queue.
> > > 
> > > Why? I can understand that there should be a mapping between
> > > flows and queues in a way that there is no re-ordering, but
> > > I can't see the relation of receiving a flow with a TX queue.
> > > 
> > > fbl
> > 
> > That's the way virtio chose to program the rx steering logic.
> > 
> > It's low overhead (no special commands), and
> > works well for TCP when user is an endpoint since rx and tx
> > for tcp are generally tied (because of ack handling).
> > 
> > We can discuss other ways, e.g. special commands for guests to
> > program steering.
> > We'd have to first see some data showing the current scheme
> > is problematic somehow.
> 
> The issue is that the restriction imposes operations to be done in the
> data path.  For instance, Open vSwitch has N number of threads to manage
> X RX queues. We distribute them in round-robin fashion.  So, the thread
> polling one RX queue will do all the packet processing and push it to the
> TX queue of the other device (vhost-user or not) using the same 'id'.
> 
> Doing so we can avoid locking between threads and TX queues and any other
> extra computation while still keeping the packet ordering/distribution fine.
> 
> However, if vhost-user has to send packets according with guest mapping,
> it will require locking between queues and additional operations to select
> the appropriate queue.  Those actions will cause performance issues.


You only need to send updates if guest moves a flow to another queue.
This is very rare since guest must avoid reordering.

Oh and you don't have to have locking.  Just update the table and make
the target pick up the new value at leasure, worst case a packet ends up
in the wrong queue.


> I see no real benefit from enforcing the guest mapping outside to
> justify all the computation cost, so my suggestion is to change the
> spec to suggest that behavior, but not to require that to be compliant.
> 
> Does that make sense?
> 
> Thanks,
> fbl

It's not a question of what the spec says, it's a question of the
quality of implementation: guest needs to be able to balance load
between CPUs serving the queues, this means it needs a way to control
steering.

IMO having dpdk control it makes no sense in the scenario.

This is different from dpdk sending packets to real NIC
queues which all operate in parallel.

-- 
MST

[dpdk-dev] [PATCH v8 3/8] vhost: vring queue setup for multiple queue support

2015-10-27 Thread Michael S. Tsirkin

On Tue, Oct 27, 2015 at 10:51:14AM +0100, Thomas Monjalon wrote:
> 2015-10-27 11:42, Michael S. Tsirkin:
> > On Tue, Oct 27, 2015 at 05:30:41PM +0800, Yuanhan Liu wrote:
> > > On Tue, Oct 27, 2015 at 11:17:16AM +0200, Michael S. Tsirkin wrote:
> > > > Looking at that, at least when MQ is enabled, please don't key
> > > > stopping queues off GET_VRING_BASE.
> > > 
> > > Yes, that's only a workaround. I guess it has been there for quite a
> > > while, maybe at the time qemu doesn't send RESET_OWNER message.
> > 
> > RESET_OWNER was a bad idea since it basically closes
> > everything.
> > 
> > > > There are ENABLE/DISABLE messages for that.
> > > 
> > > That's something new,
> > 
> > That's part of multiqueue support. If you ignore them,
> > nothing works properly.
> > 
> > > though I have plan to use them instead, we still
> > > need to make sure our code work with old qemu, without ENABLE/DISABLE
> > > messages.
> > 
> > OK but don't rely on this for new code.
> > 
> > > And I will think more while enabling live migration: I should have
> > > more time to address issues like this at that time.
> > > 
> > > > Generally guys, don't take whatever QEMU happens to do for
> > > > granted! Look at the protocol spec under doc/specs directory,
> > > > if you are making more assumptions you must document them!
> > > 
> > > Indeed. And we will try to address them bit by bit in future.
> > > 
> > >   --yliu
> > 
> > But don't pile up these workarounds meanwhile.  I'm very worried.  The
> > way you are carrying on, each new QEMU is likely to break your
> > assumptions.
> 
> I think it may be saner to increase the minimum QEMU version supported in
> each DPDK release, dropping old stuff progressively.
> Michael, you are welcome to suggest how to move precisely.
> Thanks

This doesn't work for downstreams which need to backport fixes and
features.

Just go by the spec, and if you find issues, fix them at the
source instead of working around them - the code is open.

For new features, we have protocol feature bits.

-- 
MST

[dpdk-dev] [PATCH v8 3/8] vhost: vring queue setup for multiple queue support

2015-10-27 Thread Michael S. Tsirkin

On Tue, Oct 27, 2015 at 05:30:41PM +0800, Yuanhan Liu wrote:
> On Tue, Oct 27, 2015 at 11:17:16AM +0200, Michael S. Tsirkin wrote:
> > On Tue, Oct 27, 2015 at 03:20:40PM +0900, Tetsuya Mukawa wrote:
> > > On 2015/10/26 14:42, Yuanhan Liu wrote:
> > > > On Mon, Oct 26, 2015 at 02:24:08PM +0900, Tetsuya Mukawa wrote:
> > > >> On 2015/10/22 21:35, Yuanhan Liu wrote:
> > > > ...
> > > >>> @@ -292,13 +300,13 @@ user_get_vring_base(struct vhost_device_ctx ctx,
> > > >>>* sent and only sent in vhost_vring_stop.
> > > >>>* TODO: cleanup the vring, it isn't usable since here.
> > > >>>*/
> > > >>> - if ((dev->virtqueue[VIRTIO_RXQ]->kickfd) >= 0) {
> > > >>> - close(dev->virtqueue[VIRTIO_RXQ]->kickfd);
> > > >>> - dev->virtqueue[VIRTIO_RXQ]->kickfd = -1;
> > > >>> + if ((dev->virtqueue[state->index]->kickfd + VIRTIO_RXQ) >= 0) {
> > > >>> + close(dev->virtqueue[state->index + 
> > > >>> VIRTIO_RXQ]->kickfd);
> > > >>> + dev->virtqueue[state->index + VIRTIO_RXQ]->kickfd = -1;
> > > >>>   }
> > > >> Hi Yuanhan,
> > > >>
> > > >> Please let me make sure whether below is correct.
> > > >> if ((dev->virtqueue[state->index]->kickfd + VIRTIO_RXQ) >= 0) {
> > > >>
> > > >>> - if ((dev->virtqueue[VIRTIO_TXQ]->kickfd) >= 0) {
> > > >>> - close(dev->virtqueue[VIRTIO_TXQ]->kickfd);
> > > >>> - dev->virtqueue[VIRTIO_TXQ]->kickfd = -1;
> > > >>> + if ((dev->virtqueue[state->index]->kickfd + VIRTIO_TXQ) >= 0) {
> > > >>> + close(dev->virtqueue[state->index + 
> > > >>> VIRTIO_TXQ]->kickfd);
> > > >>> + dev->virtqueue[state->index + VIRTIO_TXQ]->kickfd = -1;
> > > >> Also, same question here.
> > > > Oops, silly typos... Thanks for catching it out!
> > > >
> > > > Here is an update patch (Thomas, please let me know if you prefer me
> > > > to send the whole patchset for you to apply):
> > > 
> > > Hi Yuanhan,
> > > 
> > > I may miss one more issue here.
> > > Could you please see below patch I've submitted today?
> > > (I may find a similar issue, so I've fixed it also in below patch.)
> > >  
> > > - http://dpdk.org/dev/patchwork/patch/8038/
> > >  
> > > Thanks,
> > > Tetsuya
> > 
> > Looking at that, at least when MQ is enabled, please don't key
> > stopping queues off GET_VRING_BASE.
> 
> Yes, that's only a workaround. I guess it has been there for quite a
> while, maybe at the time qemu doesn't send RESET_OWNER message.

RESET_OWNER was a bad idea since it basically closes
everything.

> > There are ENABLE/DISABLE messages for that.
> 
> That's something new,

That's part of multiqueue support. If you ignore them,
nothing works properly.

> though I have plan to use them instead, we still
> need to make sure our code work with old qemu, without ENABLE/DISABLE
> messages.

OK but don't rely on this for new code.

> And I will think more while enabling live migration: I should have
> more time to address issues like this at that time.
> 
> > Generally guys, don't take whatever QEMU happens to do for
> > granted! Look at the protocol spec under doc/specs directory,
> > if you are making more assumptions you must document them!
> 
> Indeed. And we will try to address them bit by bit in future.
> 
>   --yliu

But don't pile up these workarounds meanwhile.  I'm very worried.  The
way you are carrying on, each new QEMU is likely to break your
assumptions.

-- 
MST

[dpdk-dev] [PATCH v8 3/8] vhost: vring queue setup for multiple queue support

2015-10-27 Thread Michael S. Tsirkin

On Tue, Oct 27, 2015 at 03:20:40PM +0900, Tetsuya Mukawa wrote:
> On 2015/10/26 14:42, Yuanhan Liu wrote:
> > On Mon, Oct 26, 2015 at 02:24:08PM +0900, Tetsuya Mukawa wrote:
> >> On 2015/10/22 21:35, Yuanhan Liu wrote:
> > ...
> >>> @@ -292,13 +300,13 @@ user_get_vring_base(struct vhost_device_ctx ctx,
> >>>* sent and only sent in vhost_vring_stop.
> >>>* TODO: cleanup the vring, it isn't usable since here.
> >>>*/
> >>> - if ((dev->virtqueue[VIRTIO_RXQ]->kickfd) >= 0) {
> >>> - close(dev->virtqueue[VIRTIO_RXQ]->kickfd);
> >>> - dev->virtqueue[VIRTIO_RXQ]->kickfd = -1;
> >>> + if ((dev->virtqueue[state->index]->kickfd + VIRTIO_RXQ) >= 0) {
> >>> + close(dev->virtqueue[state->index + VIRTIO_RXQ]->kickfd);
> >>> + dev->virtqueue[state->index + VIRTIO_RXQ]->kickfd = -1;
> >>>   }
> >> Hi Yuanhan,
> >>
> >> Please let me make sure whether below is correct.
> >> if ((dev->virtqueue[state->index]->kickfd + VIRTIO_RXQ) >= 0) {
> >>
> >>> - if ((dev->virtqueue[VIRTIO_TXQ]->kickfd) >= 0) {
> >>> - close(dev->virtqueue[VIRTIO_TXQ]->kickfd);
> >>> - dev->virtqueue[VIRTIO_TXQ]->kickfd = -1;
> >>> + if ((dev->virtqueue[state->index]->kickfd + VIRTIO_TXQ) >= 0) {
> >>> + close(dev->virtqueue[state->index + VIRTIO_TXQ]->kickfd);
> >>> + dev->virtqueue[state->index + VIRTIO_TXQ]->kickfd = -1;
> >> Also, same question here.
> > Oops, silly typos... Thanks for catching it out!
> >
> > Here is an update patch (Thomas, please let me know if you prefer me
> > to send the whole patchset for you to apply):
> 
> Hi Yuanhan,
> 
> I may miss one more issue here.
> Could you please see below patch I've submitted today?
> (I may find a similar issue, so I've fixed it also in below patch.)
>  
> - http://dpdk.org/dev/patchwork/patch/8038/
>  
> Thanks,
> Tetsuya

Looking at that, at least when MQ is enabled, please don't key
stopping queues off GET_VRING_BASE.

There are ENABLE/DISABLE messages for that.

Generally guys, don't take whatever QEMU happens to do for
granted! Look at the protocol spec under doc/specs directory,
if you are making more assumptions you must document them!

-- 
MST

[dpdk-dev] [PATCH v7 4/8] vhost: rxtx: use queue id instead of constant ring index

2015-10-24 Thread Michael S. Tsirkin

On Sat, Oct 24, 2015 at 12:34:08AM -0200, Flavio Leitner wrote:
> On Thu, Oct 22, 2015 at 02:32:31PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 22, 2015 at 05:49:55PM +0800, Yuanhan Liu wrote:
> > > On Wed, Oct 21, 2015 at 05:26:18PM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Oct 21, 2015 at 08:48:15PM +0800, Yuanhan Liu wrote:
> > > > > > Please note that for virtio devices, guest is supposed to
> > > > > > control the placement of incoming packets in RX queues.
> > > > > 
> > > > > I may not follow you.
> > > > > 
> > > > > Enqueuing packets to a RX queue is done at vhost lib, outside the
> > > > > guest, how could the guest take the control here?
> > > > > 
> > > > >   --yliu
> > > > 
> > > > vhost should do what guest told it to.
> > > > 
> > > > See virtio spec:
> > > > 5.1.6.5.5 Automatic receive steering in multiqueue mode
> > > 
> > > Spec says:
> > > 
> > > After the driver transmitted a packet of a flow on transmitqX,
> > > the device SHOULD cause incoming packets for that flow to be
> > > steered to receiveqX.
> > > 
> > > 
> > > Michael, I still have no idea how vhost could know the flow even
> > > after discussion with Huawei. Could you be more specific about
> > > this? Say, how could guest know that? And how could guest tell
> > > vhost which RX is gonna to use?
> > > 
> > > Thanks.
> > > 
> > >   --yliu
> > 
> > I don't really understand the question.
> > 
> > When guests transmits a packet, it makes a decision
> > about the flow to use, and maps that to a tx/rx pair of queues.
> > 
> > It sends packets out on the tx queue and expects device to
> > return packets from the same flow on the rx queue.
> 
> Why? I can understand that there should be a mapping between
> flows and queues in a way that there is no re-ordering, but
> I can't see the relation of receiving a flow with a TX queue.
> 
> fbl

That's the way virtio chose to program the rx steering logic.

It's low overhead (no special commands), and
works well for TCP when user is an endpoint since rx and tx
for tcp are generally tied (because of ack handling).

We can discuss other ways, e.g. special commands for guests to
program steering.
We'd have to first see some data showing the current scheme
is problematic somehow.


> > During transmit, device needs to figure out the flow
> > of packets as they are received from guest, and track
> > which flows go on which tx queue.
> > When it selects the rx queue, it has to use the same table.
> > 
> > There is currently no provision for controlling
> > steering for uni-directional
> > flows which are possible e.g. with UDP.
> > 
> > We might solve this in a future spec - for example, set a flag notifying
> > guest that steering information is missing for a given flow, for example
> > by setting a flag in a packet, or using the command queue, and have
> > guest send a dummy empty packet to set steering rule for this flow.
> > 
> > 
> > -- 
> > MST
> >

[dpdk-dev] [PATCH v7 4/8] vhost: rxtx: use queue id instead of constant ring index

2015-10-22 Thread Michael S. Tsirkin

On Thu, Oct 22, 2015 at 10:07:10PM +0800, Yuanhan Liu wrote:
> On Thu, Oct 22, 2015 at 02:32:31PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 22, 2015 at 05:49:55PM +0800, Yuanhan Liu wrote:
> > > On Wed, Oct 21, 2015 at 05:26:18PM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Oct 21, 2015 at 08:48:15PM +0800, Yuanhan Liu wrote:
> > > > > > Please note that for virtio devices, guest is supposed to
> > > > > > control the placement of incoming packets in RX queues.
> > > > > 
> > > > > I may not follow you.
> > > > > 
> > > > > Enqueuing packets to a RX queue is done at vhost lib, outside the
> > > > > guest, how could the guest take the control here?
> > > > > 
> > > > >   --yliu
> > > > 
> > > > vhost should do what guest told it to.
> > > > 
> > > > See virtio spec:
> > > > 5.1.6.5.5 Automatic receive steering in multiqueue mode
> > > 
> > > Spec says:
> > > 
> > > After the driver transmitted a packet of a flow on transmitqX,
> > > the device SHOULD cause incoming packets for that flow to be
> > > steered to receiveqX.
> > > 
> > > 
> > > Michael, I still have no idea how vhost could know the flow even
> > > after discussion with Huawei. Could you be more specific about
> > > this? Say, how could guest know that? And how could guest tell
> > > vhost which RX is gonna to use?
> > > 
> > > Thanks.
> > > 
> > >   --yliu
> > 
> > I don't really understand the question.
> > 
> > When guests transmits a packet, it makes a decision
> > about the flow to use, and maps that to a tx/rx pair of queues.
> > 
> > It sends packets out on the tx queue and expects device to
> > return packets from the same flow on the rx queue.
> > 
> > During transmit, device needs to figure out the flow
> > of packets as they are received from guest, and track
> > which flows go on which tx queue.
> > When it selects the rx queue, it has to use the same table.
> 
> Thanks for the length explanation, Michael!
> 
> I guess the key is are we able to get the table inside vhost-user
> lib? And, are you looking for something like following?
> 
>   static int rte_vhost_enqueue_burst(pkts)
>   {
>   for_each_pkts(pkt) {
>   int rxq = get_rxq_from_table(pkt);
>   
>   queue_to_rxq(pkt, rxq);
>   }
>   }
> 
> BTW, there should be such implementation at some where, right?
> If so, would you please point it to me?

See tun_flow_update in drivers/net/tun.c in Linux.


> In the meanwhile, I will read more doc/code to try to understand
> it.
> 
>   --yliu
> 
> > 
> > There is currently no provision for controlling
> > steering for uni-directional
> > flows which are possible e.g. with UDP.
> > 
> > We might solve this in a future spec - for example, set a flag notifying
> > guest that steering information is missing for a given flow, for example
> > by setting a flag in a packet, or using the command queue, and have
> > guest send a dummy empty packet to set steering rule for this flow.
> > 
> > 
> > -- 
> > MST

[dpdk-dev] [PATCH v7 4/8] vhost: rxtx: use queue id instead of constant ring index

2015-10-22 Thread Michael S. Tsirkin

On Thu, Oct 22, 2015 at 05:49:55PM +0800, Yuanhan Liu wrote:
> On Wed, Oct 21, 2015 at 05:26:18PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Oct 21, 2015 at 08:48:15PM +0800, Yuanhan Liu wrote:
> > > > Please note that for virtio devices, guest is supposed to
> > > > control the placement of incoming packets in RX queues.
> > > 
> > > I may not follow you.
> > > 
> > > Enqueuing packets to a RX queue is done at vhost lib, outside the
> > > guest, how could the guest take the control here?
> > > 
> > >   --yliu
> > 
> > vhost should do what guest told it to.
> > 
> > See virtio spec:
> > 5.1.6.5.5 Automatic receive steering in multiqueue mode
> 
> Spec says:
> 
> After the driver transmitted a packet of a flow on transmitqX,
> the device SHOULD cause incoming packets for that flow to be
> steered to receiveqX.
> 
> 
> Michael, I still have no idea how vhost could know the flow even
> after discussion with Huawei. Could you be more specific about
> this? Say, how could guest know that? And how could guest tell
> vhost which RX is gonna to use?
> 
> Thanks.
> 
>   --yliu

I don't really understand the question.

When guests transmits a packet, it makes a decision
about the flow to use, and maps that to a tx/rx pair of queues.

It sends packets out on the tx queue and expects device to
return packets from the same flow on the rx queue.

During transmit, device needs to figure out the flow
of packets as they are received from guest, and track
which flows go on which tx queue.
When it selects the rx queue, it has to use the same table.

There is currently no provision for controlling
steering for uni-directional
flows which are possible e.g. with UDP.

We might solve this in a future spec - for example, set a flag notifying
guest that steering information is missing for a given flow, for example
by setting a flag in a packet, or using the command queue, and have
guest send a dummy empty packet to set steering rule for this flow.

-- 
MST

[dpdk-dev] [PATCH v7 4/8] vhost: rxtx: use queue id instead of constant ring index

2015-10-21 Thread Michael S. Tsirkin

On Wed, Oct 21, 2015 at 11:48:10AM +0800, Yuanhan Liu wrote:
> From: Changchun Ouyang 
> 
> Do not use VIRTIO_RXQ or VIRTIO_TXQ anymore; use the queue_id
> instead, which will be set to a proper value for a specific queue
> when we have multiple queue support enabled.
> 
> For now, queue_id is still set with VIRTIO_RXQ or VIRTIO_TXQ,
> so it should not break anything.
> 
> Signed-off-by: Changchun Ouyang 
> Signed-off-by: Yuanhan Liu 
> Acked-by: Flavio Leitner 

I tried to figure out how is queue_id set and I couldn't.
Please note that for virtio devices, guest is supposed to
control the placement of incoming packets in RX queues.


> ---
> 
> v7: commit title fix
> ---
>  lib/librte_vhost/vhost_rxtx.c | 46 
> ++-
>  1 file changed, 32 insertions(+), 14 deletions(-)
> 
> diff --git a/lib/librte_vhost/vhost_rxtx.c b/lib/librte_vhost/vhost_rxtx.c
> index 7026bfa..14e00ef 100644
> --- a/lib/librte_vhost/vhost_rxtx.c
> +++ b/lib/librte_vhost/vhost_rxtx.c
> @@ -42,6 +42,16 @@
>  
>  #define MAX_PKT_BURST 32
>  
> +static inline int __attribute__((always_inline))
> +is_valid_virt_queue_idx(uint32_t virtq_idx, int is_tx, uint32_t max_qp_idx)
> +{
> + if ((is_tx ^ (virtq_idx & 0x1)) ||
> + (virtq_idx >= max_qp_idx * VIRTIO_QNUM))
> + return 0;
> +
> + return 1;
> +}
> +
>  /**
>   * This function adds buffers to the virtio devices RX virtqueue. Buffers can
>   * be received from the physical port or from another virtio device. A packet
> @@ -68,12 +78,14 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>   uint8_t success = 0;
>  
>   LOG_DEBUG(VHOST_DATA, "(%"PRIu64") virtio_dev_rx()\n", dev->device_fh);
> - if (unlikely(queue_id != VIRTIO_RXQ)) {
> - LOG_DEBUG(VHOST_DATA, "mq isn't supported in this version.\n");
> + if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->virt_qp_nb))) {
> + RTE_LOG(ERR, VHOST_DATA,
> + "%s (%"PRIu64"): virtqueue idx:%d invalid.\n",
> + __func__, dev->device_fh, queue_id);
>   return 0;
>   }
>  
> - vq = dev->virtqueue[VIRTIO_RXQ];
> + vq = dev->virtqueue[queue_id];
>   count = (count > MAX_PKT_BURST) ? MAX_PKT_BURST : count;
>  
>   /*
> @@ -235,8 +247,9 @@ virtio_dev_rx(struct virtio_net *dev, uint16_t queue_id,
>  }
>  
>  static inline uint32_t __attribute__((always_inline))
> -copy_from_mbuf_to_vring(struct virtio_net *dev, uint16_t res_base_idx,
> - uint16_t res_end_idx, struct rte_mbuf *pkt)
> +copy_from_mbuf_to_vring(struct virtio_net *dev, uint32_t queue_id,
> + uint16_t res_base_idx, uint16_t res_end_idx,
> + struct rte_mbuf *pkt)
>  {
>   uint32_t vec_idx = 0;
>   uint32_t entry_success = 0;
> @@ -264,7 +277,7 @@ copy_from_mbuf_to_vring(struct virtio_net *dev, uint16_t 
> res_base_idx,
>* Convert from gpa to vva
>* (guest physical addr -> vhost virtual addr)
>*/
> - vq = dev->virtqueue[VIRTIO_RXQ];
> + vq = dev->virtqueue[queue_id];
>   vb_addr = gpa_to_vva(dev, vq->buf_vec[vec_idx].buf_addr);
>   vb_hdr_addr = vb_addr;
>  
> @@ -464,11 +477,14 @@ virtio_dev_merge_rx(struct virtio_net *dev, uint16_t 
> queue_id,
>  
>   LOG_DEBUG(VHOST_DATA, "(%"PRIu64") virtio_dev_merge_rx()\n",
>   dev->device_fh);
> - if (unlikely(queue_id != VIRTIO_RXQ)) {
> - LOG_DEBUG(VHOST_DATA, "mq isn't supported in this version.\n");
> + if (unlikely(!is_valid_virt_queue_idx(queue_id, 0, dev->virt_qp_nb))) {
> + RTE_LOG(ERR, VHOST_DATA,
> + "%s (%"PRIu64"): virtqueue idx:%d invalid.\n",
> + __func__, dev->device_fh, queue_id);
> + return 0;
>   }
>  
> - vq = dev->virtqueue[VIRTIO_RXQ];
> + vq = dev->virtqueue[queue_id];
>   count = RTE_MIN((uint32_t)MAX_PKT_BURST, count);
>  
>   if (count == 0)
> @@ -509,8 +525,8 @@ virtio_dev_merge_rx(struct virtio_net *dev, uint16_t 
> queue_id,
>   res_cur_idx);
>   } while (success == 0);
>  
> - entry_success = copy_from_mbuf_to_vring(dev, res_base_idx,
> - res_cur_idx, pkts[pkt_idx]);
> + entry_success = copy_from_mbuf_to_vring(dev, queue_id,
> + res_base_idx, res_cur_idx, pkts[pkt_idx]);
>  
>   rte_compiler_barrier();
>  
> @@ -562,12 +578,14 @@ rte_vhost_dequeue_burst(struct virtio_net *dev, 
> uint16_t queue_id,
>   uint16_t free_entries, entry_success = 0;
>   uint16_t avail_idx;
>  
> - if (unlikely(queue_id != VIRTIO_TXQ)) {
> - LOG_DEBUG(VHOST_DATA, "mq isn't supported in this version.\n");
> + if (unlikely(!is_valid_virt_queue_idx(queue_id, 1, dev->virt_qp_nb))) {
> + RTE_LOG(ERR, VHOST_DATA,
> + "%s (%"PRIu64"): virtqueue idx:%d invalid.\n",
> +

[dpdk-dev] [PATCH] vhost-user: enable virtio 1.0

2015-10-18 Thread Michael S. Tsirkin

On Fri, Oct 16, 2015 at 02:52:30PM +0100, Bruce Richardson wrote:
> On Thu, Oct 15, 2015 at 04:18:59PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 15, 2015 at 02:08:39PM +0300, Marcel Apfelbaum wrote:
> > > Make vhost-user virtio 1.0 compatible by adding it to the
> > > supported features and keeping the header length
> > > the same as for mergeable RX buffers.
> > > 
> > > Signed-off-by: Marcel Apfelbaum 
> > 
> > Looks good to me
> > 
> > Acked-by: Michael S. Tsirkin 
> > 
> > Just one question: dpdk is only supported on little-endian
> > platforms at the moment, right?
> 
> A recent release added in support for PPC (patches supplied by IBM). For 
> example, see: 
> http://dpdk.org/browse/dpdk/commit/?id=704ba3770032c5a901719d3837845581d5a56b58
> 
> /Bruce

This will require more work then as 1.0 is a different
endian-ness from 0.9. It's up to you guys to decide
whether correct BE support is now a requirement for all
new dpdk code. Let us know.

-- 
MST

[dpdk-dev] [PATCH] vhost-user: enable virtio 1.0

2015-10-16 Thread Michael S. Tsirkin

On Fri, Oct 16, 2015 at 09:43:09AM +0200, Andriy Berestovskyy wrote:
> Hi guys,
> Just a minor note: ARM is bi-endian in fact. For instance, there are
> both endians tool chains available on Linaro.
> 
> Andriy

Yea. BE support is around for legacy stuff.  So I'm not sure it's all
that important to support that mode in NFV/DPDK applications.

If it becomes important, for virtio 0.9 you won't be able to
support transparently in dpdk anyway - you'll need a new
vhost-user message to get the endian-ness of the guest,
and do byteswaps conditionally, adding overhead.

Maybe going virtio 1 only is the sane thing to do for these
applications.

-- 
MST

[dpdk-dev] [PATCH] vhost-user: enable virtio 1.0

2015-10-16 Thread Michael S. Tsirkin

On Fri, Oct 16, 2015 at 10:24:38AM +0800, Yuanhan Liu wrote:
> On Thu, Oct 15, 2015 at 04:18:59PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 15, 2015 at 02:08:39PM +0300, Marcel Apfelbaum wrote:
> > > Make vhost-user virtio 1.0 compatible by adding it to the
> > > supported features and keeping the header length
> > > the same as for mergeable RX buffers.
> > > 
> > > Signed-off-by: Marcel Apfelbaum 
> 
> Marcel, that's actually one of my TODOs in this quarter. So, thank
> you! :)
> 
> > 
> > Looks good to me
> > 
> > Acked-by: Michael S. Tsirkin 
> > 
> > Just one question: dpdk is only supported on little-endian
> > platforms at the moment, right?
> 
> AFAIK, yes. But you might also see that there are some patch to add
> ARM arch support showed up in the mailing list few weeks ago.

Luckily, that's also little-endian.

> > virtio 1 spec requires little endian.
> 
> I made a quick list of the difference between virtio v0.95 and v1.0
> months ago just by reading your kernel commits of adding v1.0 support:
> 
> +---+-+--+
> |   | v0.95   |  v1.0|
> +---+-+--+
> 1)  | features bits | 32  |  64  |
> +---+-+--+
> 2)  | Endianness| nature  |  Little Endian   |
> +---+-+--+
> 3)  | vring space   | contiguous  |  avail and used buffer could |
> |   | memory  |  be on a separate memory |

And desc buffer, too.

> +---+-+--+
> 4)  | FEATURE_OK status | No  |   Yes|
> +---+-+--+
> 
> 
> 
> For 1), I guess we have been using 64 bit for storing features bits
> for vhost since long time ago. So, there should be no extra effort.
> 
> For 2), as stated, there might be no issue as far as DPDK is little
> endian only. But we'd better add a wrapper for that, as it seems
> adding big endian support would come in near future.

OK, but it probably doesn't

> For 3), are we actually doing that? I just saw that there is a kernel
> patch to introduce two functions for getting the avail and used buffer
> address, respectively.  But I didn't see that the two buffer are
> allocated in non-contiguous memory.

That will soon happen, anyone claiming support for virtio 1

But vhost user already sends each ring part separately.
Does dpdk assume they are contigious?

> For 4), it's a work we should do at virtio PMD driver. And it seems
> that there are far more work need to be done at virtio PDM driver than
> at vhost lib, say, adding the virtio morden PCI support.
> 
> Besides those 4 differs, did I miss anyting?


>From virtio PMD point of view? There are more
differences. The trick is to find "legacy interface"
sections and go over them, that compares 0.9 to 1.0.

> 
> BTW, since we already have same TODOs, I guess it'd be better to
> share what we have in our TODO list. Here are what I got till the
> time writing this email (in order of priority):
> 
> - a vhost performance issue (it might last long; it might not).
> 
> - vhost-user live migration support
> 
> - virtio 1.0 support, including PMD and vhost lib (and you guys have
>   already done that :)
> 
> Thanks.

This patch only touches the vhost lib, though.

>   --yliu
> 
> 
> > > ---
> > > 
> > > To be applied on top of:
> > >[dpdk-dev] [PATCH v6 00/13] vhost-user multiple queues enabling
> > > 
> > > Thanks,
> > > Marcel
> > > 
> > >  lib/librte_vhost/virtio-net.c | 15 ---
> > >  1 file changed, 8 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> > > index a51327d..ee4650e 100644
> > > --- a/lib/librte_vhost/virtio-net.c
> > > +++ b/lib/librte_vhost/virtio-net.c
> > > @@ -75,6 +75,7 @@ static struct virtio_net_config_ll *ll_root;
> > >   (1ULL << VIRTIO_NET_F_CTRL_VQ) | \
> > >   (1ULL << VIRTIO_NET_F_CTRL_RX) | \
> > >   (1ULL << VIRTIO_NET_F_MQ)  | \
> > > + (1ULL << VIRTIO_F_VERSION_1)   | \
> > >

[dpdk-dev] [PATCH] vhost-user: enable virtio 1.0

2015-10-15 Thread Michael S. Tsirkin

On Thu, Oct 15, 2015 at 02:08:39PM +0300, Marcel Apfelbaum wrote:
> Make vhost-user virtio 1.0 compatible by adding it to the
> supported features and keeping the header length
> the same as for mergeable RX buffers.
> 
> Signed-off-by: Marcel Apfelbaum 

Looks good to me

Acked-by: Michael S. Tsirkin 

Just one question: dpdk is only supported on little-endian
platforms at the moment, right?
virtio 1 spec requires little endian.

> ---
> 
> To be applied on top of:
>[dpdk-dev] [PATCH v6 00/13] vhost-user multiple queues enabling
> 
> Thanks,
> Marcel
> 
>  lib/librte_vhost/virtio-net.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/librte_vhost/virtio-net.c b/lib/librte_vhost/virtio-net.c
> index a51327d..ee4650e 100644
> --- a/lib/librte_vhost/virtio-net.c
> +++ b/lib/librte_vhost/virtio-net.c
> @@ -75,6 +75,7 @@ static struct virtio_net_config_ll *ll_root;
>   (1ULL << VIRTIO_NET_F_CTRL_VQ) | \
>   (1ULL << VIRTIO_NET_F_CTRL_RX) | \
>   (1ULL << VIRTIO_NET_F_MQ)  | \
> + (1ULL << VIRTIO_F_VERSION_1)   | \
>   (1ULL << VHOST_F_LOG_ALL)  | \
>   (1ULL << VHOST_USER_F_PROTOCOL_FEATURES))
>  static uint64_t VHOST_FEATURES = VHOST_SUPPORTED_FEATURES;
> @@ -477,17 +478,17 @@ set_features(struct vhost_device_ctx ctx, uint64_t *pu)
>   return -1;
>  
>   dev->features = *pu;
> - if (dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF)) {
> - LOG_DEBUG(VHOST_CONFIG,
> - "(%"PRIu64") Mergeable RX buffers enabled\n",
> - dev->device_fh);
> + if (dev->features &
> + ((1 << VIRTIO_NET_F_MRG_RXBUF) | (1ULL << VIRTIO_F_VERSION_1))) {
>   vhost_hlen = sizeof(struct virtio_net_hdr_mrg_rxbuf);
>   } else {
> - LOG_DEBUG(VHOST_CONFIG,
> - "(%"PRIu64") Mergeable RX buffers disabled\n",
> - dev->device_fh);
>   vhost_hlen = sizeof(struct virtio_net_hdr);
>   }
> + LOG_DEBUG(VHOST_CONFIG,
> +  "(%"PRIu64") Mergeable RX buffers %s, virtio 1 %s\n",
> +   dev->device_fh,
> +  (dev->features & (1 << VIRTIO_NET_F_MRG_RXBUF)) ? "on" : "off",
> +  (dev->features & (1ULL << VIRTIO_F_VERSION_1)) ? "on" : "off");
>  
>   for (i = 0; i < dev->virt_qp_nb; i++) {
>   uint16_t base_idx = i * VIRTIO_QNUM;
> -- 
> 2.1.0

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Michael S. Tsirkin

On Tue, Oct 06, 2015 at 05:49:21PM +0300, Vlad Zolotarov wrote:
> >and read/write the config space.
> >This means that a single userspace bug is enough to corrupt kernel
> >memory.
> 
> Could u, pls., provide and example of this simple bug? Because it's
> absolutely not obvious...

Stick a value that happens to match a kernel address in Msg Addr field
in an unmasked MSI-X entry.

-- 
MST

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Michael S. Tsirkin

On Tue, Oct 06, 2015 at 03:15:57PM +0300, Avi Kivity wrote:
> btw, (2) doesn't really add any insecurity.  The user could already poke at
> the msix tables (as well as perform DMA); they just couldn't get a useful
> interrupt out of them.

Poking at msix tables won't cause memory corruption unless msix and bus
mastering is enabled.  It's true root can enable msix and bus mastering
through sysfs - but that's easy to block or detect. Even if you don't
buy a security story, it seems less likely to trigger as a result
of a userspace bug.

-- 
MST

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Michael S. Tsirkin

On Tue, Oct 06, 2015 at 11:23:11AM +0300, Vlad Zolotarov wrote:
> Michael, how this or any other related patch is related to the problem u r
> describing?
> The above ability is there for years and if memory serves me
> well it was u who wrote uio_pci_generic with this "security flaw".  ;)

I answered all this already.

This patch enables bus mastering, enables MSI or MSI-X, and requires
userspace to map the MSI-X table and read/write the config space.
This means that a single userspace bug is enough to corrupt kernel
memory.

uio_pci_generic does not enable bus mastering or MSI, and
it might be a good idea to have uio_pci_generic block
access to MSI/MSI-X config.
-- 
MST

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Michael S. Tsirkin

On Tue, Oct 06, 2015 at 08:33:56AM +0100, Stephen Hemminger wrote:
> Other than implementation objections, so far the two main arguments
> against this reduce to:
>   1. If you allow UIO ioctl then it opens an API hook for all the crap out
>  of tree UIO drivers to do what they want.
>   2. If you allow UIO MSI-X then you are expanding the usage of userspace
>  device access in an insecure manner.

That's not all. Without MSI one can detect insecure usage by detecting
userspace enabling bus mastering.  This can be detected simply using
lspci.  Or one can also imagine a configuration where this ability is
disabled, is logged, or taints kernel.  This seems like something that
might be worth having for some locked-down systems.

OTOH enabling MSI requires enabling bus mastering so suddenly we have no
idea whether device can be/is used in a safe way.

> 
> Another alternative which I explored was making a version of VFIO that
> works without IOMMU. It solves #1 but actually increases the likely negative
> response to arguent #2.

No - because VFIO has limited protection against device misuse by
userspace, by limiting access to sub-ranges of device BARs and config
space.  For a device that doesn't do DMA, that will be enough to make it
secure to use.

That's a pretty weak excuse to support userspace drivers for PCI devices
without an IOMMU, but it's the best I heard so far.

Is that worth the security trade-off? I'm still not sure.

> This would keep same API, and avoid having to
> modify UIO. But we would still have the same (if not more resistance)
> from IOMMU developers who believe all systems have to be secure against
> root.

"Secure against root" is a confusing way to put it IMHO. We are talking
about memory protection.

So that's not IOMMU developers IIUC. I believe most kernel developers will
agree it's not a good idea to let userspace corrupt kernel memory.
Otherwise, the driver can't be supported, and maintaining upstream
drivers that can't be supported serves no useful purpose.  Anyone can
load out of tree ones just as well.

VFIO already supports MSI so VFIO developers already have a lot of
experience with these issues. Getting their input would be valuable.

-- 
MST

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Michael S. Tsirkin

On Tue, Oct 06, 2015 at 01:09:55AM +0300, Vladislav Zolotarov wrote:
> How about instead of trying to invent the wheel just go and attack the problem
> directly just like i've proposed already a few times in the last days: instead
> of limiting the UIO limit the users that are allowed to use UIO to privileged
> users only (e.g. root). This would solve all clearly unresolvable issues u are
> raising here all together, wouldn't it?

No - root or no root, if the user can modify the addresses in the MSI-X
table and make the chip corrupt random memory, this is IMHO a non-starter.

And tainting kernel is not a solution - your patch adds a pile of
code that either goes completely unused or taints the kernel.
Not just that - it's a dedicated userspace API that either
goes completely unused or taints the kernel.

> >
> > --
> > MST
>

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-06 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.

Thinking about it some more, maybe some devices don't do DMA, and merely
signal events with MSI/MSI-X.

The fact you mention igb_uio in the cover letter seems to hint that this
isn't the case, and that the real intent is to abuse it for DMA-capable
devices, but still ...

If we assume such a simple device, we need to block userspace from
tweaking at least the MSI control and the MSI-X table.
And changing BARs might make someone else corrupt the MSI-X
table, so we need to block it from changing BARs, too.

Things like device reset will clear the table.  I guess this means we
need to track access to reset, too, make sure we restore the
table to a sane config.

PM  capability can be used to reset things tooI think. Better be
careful about that.

And a bunch of devices could be doing weird things that need
to be special-cased.

All of this is what VFIO is already dealing with.

Maybe extending VFIO for this usecase, or finding another way to share
code might be a better idea than duplicating the code within uio?

-- 
MST

[dpdk-dev] Unlinking hugepage backing file after initialiation

2015-10-06 Thread Michael S. Tsirkin

On Mon, Oct 05, 2015 at 01:08:52PM +, Xie, Huawei wrote:
> On 9/30/2015 5:36 AM, Michael S. Tsirkin wrote:
> > On Tue, Sep 29, 2015 at 05:50:00PM +, shesha Sreenivasamurthy (shesha) 
> > wrote:
> >> Sure. Then, is there any real reason why the backing files should not be
> >> unlinked ?
> > AFAIK qemu unlinks them already.
> Sorry, i didn't make it clear. Let us take the physical Ethernet
> controller in the host for example
> 
> 1)  DPDK app1 unlinked huge page after initialization.
> 2)  DPDK app1 crashed or got killed unexpectedly.
> 3)  The nic device is just DMAing to the buffer memory allocated from
> the huge page.
> 4)  Another app2 started, allocated memory from the hugetlbfs, and the
> memory allocated happened to be the buffer memory.
> Ok, the nic device dmaed to memory of app2, which corrupted app2.
> Btw, the window opened is very very narrow, but we could avoid this
> corruption if we don't unlink huge page immediately.  We could
> reinitialize the nic through binding operation and then remove the huge
> page.
> 
> I mentioned virtio at the first time. For its case, the one who does DMA
> is vhost and i am talking about the guest huge page not the huge pages
> used to back guest memory.
> 
> So we had better not unlink huge pages unless we have other solution to
> avoid the corruption.

Oh, I get it now. It's when you (ab)use UIO to bypass all normal kernel
protections.  There's no problem when using VFIO.

So kernel doesn't protect you in case of a crash, but I guess you
can try to protect yourself.

For example, write a separate service that you can pass the hugepage FDs
and the device FDs to. Have it hold on to them, and when it detects your
app crashed, have it reset the device before closing the FDs.

Just make sure that one doesn't crash :).

But really, people should just use VFIO.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-04 Thread Michael S. Tsirkin

On Fri, Oct 02, 2015 at 03:07:24PM +0100, Bruce Richardson wrote:
> On Fri, Oct 02, 2015 at 05:00:14PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
> > > validation and translation would add 10s if not 100s of nanoseconds to the
> > > time needed to process each packet.  In addition we are talking about 
> > > doing
> > > this in kernel space which means we wouldn't really be able to take
> > > advantage of things like SSE or AVX instructions.
> > 
> > Yes. But the nice thing is that it's rearming so it can happen on
> > a separate core, in parallel with packet processing.
> > It does not need to add to latency.
> > 
> > You will burn up more CPU, but again, all this for boxes/hypervisors
> > without an IOMMU.
> > 
> > I'm sure people can come up with even better approaches, once enough
> > people get it that kernel absolutely needs to be protected from
> > userspace.
> > 
> > Long term, the right thing to do is to focus on IOMMU support. This
> > gives you hardware-based memory protection without need to burn up CPU
> > cycles.
> > 
> > -- 
> > MST
> 
> Running it on another will have it's own problems. The main one that springs 
> to
> mind for me is the performance impact of having all those cache lines shared
> between the two cores.
> 
> /Bruce

The cache line is currently invalidated by the device write
packet processing core -> device -> packet processing core
We are adding another stage
packet processing core -> rearming core -> device -> packet processing core
but the amount of sharing per core isn't increased.

This is something that can be tried immediately without kernel changes.
Who knows, maybe you will actually be able to push more pps this way.

Further, rearming is not doing a lot besides moving bits around in
memory, and it's in kernel so using very limited resources - maybe we
can efficiently use an HT logical core for this task.
This remains to be seen.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-02 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
> validation and translation would add 10s if not 100s of nanoseconds to the
> time needed to process each packet.  In addition we are talking about doing
> this in kernel space which means we wouldn't really be able to take
> advantage of things like SSE or AVX instructions.

Yes. But the nice thing is that it's rearming so it can happen on
a separate core, in parallel with packet processing.
It does not need to add to latency.

You will burn up more CPU, but again, all this for boxes/hypervisors
without an IOMMU.

I'm sure people can come up with even better approaches, once enough
people get it that kernel absolutely needs to be protected from
userspace.

Long term, the right thing to do is to focus on IOMMU support. This
gives you hardware-based memory protection without need to burn up CPU
cycles.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-02 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 02:17:49PM -0700, Alexander Duyck wrote:
> On 10/01/2015 02:42 AM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> >>even when they are some users
> >>prefer to avoid the performance penalty.
> >I don't think there's a measureable penalty from passing through the
> >IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
> >never saw any numbers that show such.
> 
> It depends on the IOMMU.  I believe Intel had a performance penalty on all
> CPUs prior to Ivy Bridge.  Since then things have improved to where they are
> comparable to bare metal.
> 
> The graph on page 5 of
> https://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf
> shows the penalty clear as day.  Pretty much anything before Ivy Bridge w/
> small packets is slowed to a crawl with an IOMMU enabled.
> 
> - Alex

VMs are running with IOMMU enabled anyway.
Avi here tells us no one uses SRIOV on bare metal so ...
we don't need to argue about that.

-- 
MST

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 10:26:19AM -0700, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 19:31:08 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > > This driver allows using PCI device with Message Signalled Interrupt
> > > > from userspace. The API is similar to the igb_uio driver used by the 
> > > > DPDK.
> > > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > > file descriptors similar to VFIO.
> > > >
> > > > VFIO is a better choice if IOMMU is available, but often userspace 
> > > > drivers
> > > > have to work in environments where IOMMU support (real or emulated) is
> > > > not available.  All UIO drivers that support DMA are not secure against
> > > > rogue userspace applications programming DMA hardware to access
> > > > private memory; this driver is no less secure than existing code.
> > > > 
> > > > Signed-off-by: Stephen Hemminger 
> > > 
> > > I don't think copying the igb_uio interface is a good idea.
> > > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > > is abusing the sysfs BAR access to provide unlimited
> > > access to hardware.
> > > 
> > > MSI messages are memory writes so any generic device capable
> > > of MSI is capable of corrupting kernel memory.
> > > This means that a bug in userspace will lead to kernel memory corruption
> > > and crashes.  This is something distributions can't support.
> > > 
> > > uio_pci_generic is already abused like that, mostly
> > > because when I wrote it, I didn't add enough protections
> > > against using it with DMA capable devices,
> > > and we can't go back and break working userspace.
> > > But at least it does not bind to VFs which all of
> > > them are capable of DMA.
> > > 
> > > The result of merging this driver will be userspace abusing the
> > > sysfs BAR access with VFs as well, and we do not want that.
> > > 
> > > 
> > > Just forwarding events is not enough to make a valid driver.
> > > What is missing is a way to access the device in a safe way.
> > > 
> > > On a more positive note:
> > > 
> > > What would be a reasonable interface? One that does the following
> > > in kernel:
> > > 
> > > 1. initializes device rings (can be in pinned userspace memory,
> > >but can not be writeable by userspace), brings up interface link
> > > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > > 3. gets request, make sure it's valid and belongs to
> > >the correct task, put it in the ring
> > > 4. in the reverse direction, notify userspace when buffers
> > >are available in the ring
> > > 5. notify userspace about MSI (what this driver does)
> > > 
> > > What userspace can be allowed to do:
> > > 
> > >   format requests (e.g. transmit, receive) in userspace
> > >   read ring contents
> > > 
> > > What userspace can't be allowed to do:
> > > 
> > >   access BAR
> > >   write rings
> > > 
> > > 
> > > This means that the driver can not be a generic one,
> > > and there will be a system call overhead when you
> > > write the ring, but that's the price you have to
> > > pay for ability to run on systems without an IOMMU.
> > > 
> > 
> > 
> > The device specific parts can be taken from John Fastabend's patches
> > BTW:
> > 
> > https://patchwork.ozlabs.org/patch/396713/
> > 
> > IIUC what was missing there was exactly the memory protection
> > we are looking for here.
> 
> The bifuricated drivers are interesting from an architecture
> point of view, but do nothing to solve the immediate use case.
> The problem is not on bare metal environment, most of those already have 
> IOMMU.
> The issues are on environments like VMWare with SRIOV or vmxnet3,
> neither of those are really helped by bifirucated driver or VFIO.

Two points I tried to make (and apparently failed, so I'm trying again,
more verbosely):
- bifurcated drivers do DMA into both kernel and userspace
  memory from same pci address (bus/dev/fn). As IOMMU uses this
  source address to validated accesses, there is no way to
  have IOMMU prevent userspace from accessing kernel memory.
  If you are prepared to use dynamic mappings for kern

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > This driver allows using PCI device with Message Signalled Interrupt
> > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > file descriptors similar to VFIO.
> >
> > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > have to work in environments where IOMMU support (real or emulated) is
> > not available.  All UIO drivers that support DMA are not secure against
> > rogue userspace applications programming DMA hardware to access
> > private memory; this driver is no less secure than existing code.
> > 
> > Signed-off-by: Stephen Hemminger 
> 
> I don't think copying the igb_uio interface is a good idea.
> What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> is abusing the sysfs BAR access to provide unlimited
> access to hardware.
> 
> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.
> 
> uio_pci_generic is already abused like that, mostly
> because when I wrote it, I didn't add enough protections
> against using it with DMA capable devices,
> and we can't go back and break working userspace.
> But at least it does not bind to VFs which all of
> them are capable of DMA.
> 
> The result of merging this driver will be userspace abusing the
> sysfs BAR access with VFs as well, and we do not want that.
> 
> 
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.
> 
> On a more positive note:
> 
> What would be a reasonable interface? One that does the following
> in kernel:
> 
> 1. initializes device rings (can be in pinned userspace memory,
>but can not be writeable by userspace), brings up interface link
> 2. pins userspace memory (unless using e.g. hugetlbfs)
> 3. gets request, make sure it's valid and belongs to
>the correct task, put it in the ring
> 4. in the reverse direction, notify userspace when buffers
>are available in the ring
> 5. notify userspace about MSI (what this driver does)
> 
> What userspace can be allowed to do:
> 
>   format requests (e.g. transmit, receive) in userspace
>   read ring contents
> 
> What userspace can't be allowed to do:
> 
>   access BAR
>   write rings
> 
> 
> This means that the driver can not be a generic one,
> and there will be a system call overhead when you
> write the ring, but that's the price you have to
> pay for ability to run on systems without an IOMMU.
> 


The device specific parts can be taken from John Fastabend's patches
BTW:

https://patchwork.ozlabs.org/patch/396713/

IIUC what was missing there was exactly the memory protection
we are looking for here.


> 
> 
> > ---
> >  drivers/uio/Kconfig  |   9 ++
> >  drivers/uio/Makefile |   1 +
> >  drivers/uio/uio_msi.c| 378 
> > +++
> >  include/uapi/linux/Kbuild|   1 +
> >  include/uapi/linux/uio_msi.h |  22 +++
> >  5 files changed, 411 insertions(+)
> >  create mode 100644 drivers/uio/uio_msi.c
> >  create mode 100644 include/uapi/linux/uio_msi.h
> > 
> > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> > index 52c98ce..04adfa0 100644
> > --- a/drivers/uio/Kconfig
> > +++ b/drivers/uio/Kconfig
> > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
> >   primarily, for virtualization scenarios.
> >   If you compile this as a module, it will be called uio_pci_generic.
> >  
> > +config UIO_PCI_MSI
> > +   tristate "Generic driver supporting MSI-x on PCI Express cards"
> > +   depends on PCI
> > +   help
> > + Generic driver that provides Message Signalled IRQ events
> > + similar to VFIO. If IOMMMU is available please use VFIO
> > + instead since it provides more security.
> > + If you compile this as a module, it will be called uio_msi.
> > +
> >  config UIO_NETX
> > tristate "Hilscher NetX Card driver"
> > depends on PCI
> > diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> > index 8560dad..62fc44b 100644
> > --- a/drivers/uio/Makefile
> > +++ b/drivers/uio/Makefile
> > @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)  += uio_netx.o
> >  obj-$(CONFIG_

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 01:37:12PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > This driver allows using PCI device with Message Signalled Interrupt
> > > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > file descriptors similar to VFIO.
> > >
> > > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > > have to work in environments where IOMMU support (real or emulated) is
> > > not available.  All UIO drivers that support DMA are not secure against
> > > rogue userspace applications programming DMA hardware to access
> > > private memory; this driver is no less secure than existing code.
> > > 
> > > Signed-off-by: Stephen Hemminger 
> > 
> > I don't think copying the igb_uio interface is a good idea.
> > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > is abusing the sysfs BAR access to provide unlimited
> > access to hardware.
> > 
> > MSI messages are memory writes so any generic device capable
> > of MSI is capable of corrupting kernel memory.
> > This means that a bug in userspace will lead to kernel memory corruption
> > and crashes.  This is something distributions can't support.
> > 
> > uio_pci_generic is already abused like that, mostly
> > because when I wrote it, I didn't add enough protections
> > against using it with DMA capable devices,
> > and we can't go back and break working userspace.
> > But at least it does not bind to VFs which all of
> > them are capable of DMA.
> > 
> > The result of merging this driver will be userspace abusing the
> > sysfs BAR access with VFs as well, and we do not want that.
> > 
> > 
> > Just forwarding events is not enough to make a valid driver.
> > What is missing is a way to access the device in a safe way.
> > 
> > On a more positive note:
> > 
> > What would be a reasonable interface? One that does the following
> > in kernel:
> > 
> > 1. initializes device rings (can be in pinned userspace memory,
> >but can not be writeable by userspace), brings up interface link
> > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > 3. gets request, make sure it's valid and belongs to
> >the correct task, put it in the ring
> > 4. in the reverse direction, notify userspace when buffers
> >are available in the ring
> > 5. notify userspace about MSI (what this driver does)
> > 
> > What userspace can be allowed to do:
> > 
> > format requests (e.g. transmit, receive) in userspace
> > read ring contents
> > 
> > What userspace can't be allowed to do:
> > 
> > access BAR
> > write rings
> 
> Thinking about it some more, many devices
> 
> 
> Have separate rings for DMA: TX (device reads memory)
> and RX (device writes memory).
> With such devices, a mode where userspace can write TX ring
> but not RX ring might make sense.
> 
> This will mean userspace might read kernel memory
> through the device, but can not corrupt it.
> 
> That's already a big win!
> 
> And RX buffers do not have to be added one at a time.
> If we assume 0.2usec per system call, batching some 100 buffers per
> system call gives you 2 nano seconds overhead.  That seems quite
> reasonable.
> 

To add to that, there's no reason to do this on the same core.
Re-arming descriptors can happen in parallel with packet
processing, so this overhead won't affect PPS or latency at all:
only the CPU utilization.


> 
> 
> 
> > 
> > This means that the driver can not be a generic one,
> > and there will be a system call overhead when you
> > write the ring, but that's the price you have to
> > pay for ability to run on systems without an IOMMU.
> > 
> > 
> > 
> > 
> > > ---
> > >  drivers/uio/Kconfig  |   9 ++
> > >  drivers/uio/Makefile |   1 +
> > >  drivers/uio/uio_msi.c| 378 
> > > +++
> > >  include/uapi/linux/Kbuild|   1 +
> > >  include/uapi/linux/uio_msi.h |  22 +++
> > >  5 files changed, 411 insertions(+)
> > >  create mode 100644 drivers/uio/uio_msi.c
> > >  create mode 100644 include/uapi/linux/uio_msi.h
> > > 
> > > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> > > ind

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 04:14:33PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote:
> > > > This in itself is going to use up
> > > > a good proportion of the processing time, as well as that we have to 
> > > > spend cycles
> > > > copying the descriptors from one ring in memory to another. Given that 
> > > > right now
> > > > with the vector ixgbe driver, the cycle cost per packet of RX is just a 
> > > > few dozen
> > > > cycles on modern cores, every additional cycle (fraction of a 
> > > > nanosecond) has
> > > > an impact.
> > > > 
> > > > Regards,
> > > > /Bruce
> > > 
> > > See above.  There is no need for that on data path. Only re-adding
> > > buffers requires a system call.
> > > 
> > 
> > Re-adding buffers is a key part of the data path! Ok, the fact that its 
> > only on
> > descriptor rearm does allow somewhat bigger batches,
> 
> That was the point, yes.
> 
> > but the whole point of having
> > the kernel do this extra work you propose is to allow the kernel to scan and
> > sanitize the physical addresses - and that will take a lot of cycles, 
> > especially
> > if it has to handle all the different descriptor formats of all the 
> > different NICs,
> > as has already been pointed out.
> > 
> > /Bruce
> 
> Well the driver would be per NIC, so there's only need to support
> specific formats supported by a given NIC.
> 
> An alternative is to format the descriptors in kernel, based
> on just the list of addresses. This seems cleaner, but I don't
> know how efficient it would be.


Additionally. rearming descriptors can happen on another
core in parallel with processing packets on the first one.

This will use more CPU but help you stay within your PPS limits.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 11:42:23AM +0200, Vincent JARDIN wrote:
> There were some tentative to get it for other (older) drivers, named
> 'bifurcated drivers', but it is stalled.

That approach also has the advantage that userspace bugs can't do
silly things like reprogram device's EEPROM by mistake.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 07:55:20AM -0700, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 13:14:08 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
> > > >There were some tentative to get it for other (older) drivers, named
> > > >'bifurcated drivers', but it is stalled.
> > > 
> > > IIRC they still exposed the ring to userspace.
> > 
> > How much would a ring write syscall cost? 1-2 microseconds, isn't it?
> 
> The per-packet budget at 10G is 62ns, a syscall just doesn't cut it.

If you give up on privacy and only insist on security
(can read all kernel memory, can't corrupt it), then
you only need the syscall to re-arm RX descriptors,
and these can be batched aggressively without impacting
latency.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 08:01:00AM -0700, Stephen Hemminger wrote:
> The per-driver ring method is what netmap did.

IIUC netmap has a standard format for descriptors, so was slower: it
still had to do all networking in kernel (it only bypasses
part of the networking stack), and to have a thread to
translate between software and hardware formats.

> The problem with that is that it forces infrastructure into already
> complex network driver.  It never was accepted.  There were also still
> security issues like time of check/time of use with the ring.

Right, because people do care about security.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 06:19:33PM +0300, Avi Kivity wrote:
> On 10/01/2015 06:11 PM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote:
> >>>  We already agreed this kernel
> >>>is going to be tainted, and unsupportable.
> >>Yes.  So your only motivation in rejecting the patch is to get the author to
> >>write the ring translation patch and port it to all relevant drivers
> >>instead?
> >Not only that.
> >
> >To make sure users are aware they are doing insecure
> >things when using software poking at device BARs in sysfs.
> 
> I don't think you need to worry about that.  People who program DMA are
> aware of the damage is can cause.

People just install software and run it. They don't program DMA.

And I notice that no software (ab)using this seems to come with
documentation explaining the implications.

> If you want to be extra sure, have uio
> taint the kernel when bus mastering is enabled.

People don't notice kernel is tainted.  Denying module load will make
them notice.

-- 
MST

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 07:50:37AM -0700, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 11:33:06 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > > This driver allows using PCI device with Message Signalled Interrupt
> > > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > > file descriptors similar to VFIO.
> > >
> > > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > > have to work in environments where IOMMU support (real or emulated) is
> > > not available.  All UIO drivers that support DMA are not secure against
> > > rogue userspace applications programming DMA hardware to access
> > > private memory; this driver is no less secure than existing code.
> > > 
> > > Signed-off-by: Stephen Hemminger 
> > 
> > I don't think copying the igb_uio interface is a good idea.
> > What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> > is abusing the sysfs BAR access to provide unlimited
> > access to hardware.
> > 
> > MSI messages are memory writes so any generic device capable
> > of MSI is capable of corrupting kernel memory.
> > This means that a bug in userspace will lead to kernel memory corruption
> > and crashes.  This is something distributions can't support.
> > 
> > uio_pci_generic is already abused like that, mostly
> > because when I wrote it, I didn't add enough protections
> > against using it with DMA capable devices,
> > and we can't go back and break working userspace.
> > But at least it does not bind to VFs which all of
> > them are capable of DMA.
> > 
> > The result of merging this driver will be userspace abusing the
> > sysfs BAR access with VFs as well, and we do not want that.
> > 
> > 
> > Just forwarding events is not enough to make a valid driver.
> > What is missing is a way to access the device in a safe way.
> > 
> > On a more positive note:
> > 
> > What would be a reasonable interface? One that does the following
> > in kernel:
> > 
> > 1. initializes device rings (can be in pinned userspace memory,
> >but can not be writeable by userspace), brings up interface link
> > 2. pins userspace memory (unless using e.g. hugetlbfs)
> > 3. gets request, make sure it's valid and belongs to
> >the correct task, put it in the ring
> > 4. in the reverse direction, notify userspace when buffers
> >are available in the ring
> > 5. notify userspace about MSI (what this driver does)
> > 
> > What userspace can be allowed to do:
> > 
> > format requests (e.g. transmit, receive) in userspace
> > read ring contents
> > 
> > What userspace can't be allowed to do:
> > 
> > access BAR
> > write rings
> > 
> > 
> > This means that the driver can not be a generic one,
> > and there will be a system call overhead when you
> > write the ring, but that's the price you have to
> > pay for ability to run on systems without an IOMMU.
> 
> I think I understand what you are proposing, but it really doesn't
> fit into the high speed userspace networking model.

I'm aware of the fact currently the model does everything including
bringing up the link in user-space.
But there's really no justification for this.
Only data path things should be in userspace.

A userspace bug should not be able to do things like over-writing the
on-device EEPROM.


> 1. Device rings are device specific, can't be in a generic driver.

So that's more work, and it is not going to happen if people
can get by with insecure hacks.

> 2. DPDK uses huge mememory.

Hugetlbfs? Don't see why this is an issue. Might make things simpler.

> 3. Performance requires all ring requests be done in pure userspace,
>(ie no syscalls)

Make only the TX ring writeable then. At least you won't be
able to corrupt the kernel memory.

> 4. Ditto, can't have kernel to userspace notification per packet

RX ring can be read-only, so userspace can read it directly.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote:
> >  We already agreed this kernel
> >is going to be tainted, and unsupportable.
> 
> Yes.  So your only motivation in rejecting the patch is to get the author to
> write the ring translation patch and port it to all relevant drivers
> instead?

Not only that.

To make sure users are aware they are doing insecure
things when using software poking at device BARs in sysfs.
To avoid giving virtualization a bad name for security.
To get people to work on safe, maintainable solutions.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote:
> > > This in itself is going to use up
> > > a good proportion of the processing time, as well as that we have to 
> > > spend cycles
> > > copying the descriptors from one ring in memory to another. Given that 
> > > right now
> > > with the vector ixgbe driver, the cycle cost per packet of RX is just a 
> > > few dozen
> > > cycles on modern cores, every additional cycle (fraction of a nanosecond) 
> > > has
> > > an impact.
> > > 
> > > Regards,
> > > /Bruce
> > 
> > See above.  There is no need for that on data path. Only re-adding
> > buffers requires a system call.
> > 
> 
> Re-adding buffers is a key part of the data path! Ok, the fact that its only 
> on
> descriptor rearm does allow somewhat bigger batches,

That was the point, yes.

> but the whole point of having
> the kernel do this extra work you propose is to allow the kernel to scan and
> sanitize the physical addresses - and that will take a lot of cycles, 
> especially
> if it has to handle all the different descriptor formats of all the different 
> NICs,
> as has already been pointed out.
> 
> /Bruce

Well the driver would be per NIC, so there's only need to support
specific formats supported by a given NIC.

An alternative is to format the descriptors in kernel, based
on just the list of addresses. This seems cleaner, but I don't
know how efficient it would be.

Device vendors and dpdk developers are probably the best people to
figure out what's the best thing to do here.

But it looks like it's not going to happen unless security is made
a requirement for upstreaming code.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
> 
> 
> On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
> >>>>It's not just the lack of system calls, of course, the architecture is
> >>>>completely different.
> >>>Absolutely - I'm not saying move all of DPDK into kernel.
> >>>We just need to protect the RX rings so hardware does
> >>>not corrupt kernel memory.
> >>>
> >>>
> >>>Thinking about it some more, many devices
> >>>have separate rings for DMA: TX (device reads memory)
> >>>and RX (device writes memory).
> >>>With such devices, a mode where userspace can write TX ring
> >>>but not RX ring might make sense.
> >>I'm sure you can cause havoc just by reading, if you read from I/O memory.
> >Not talking about I/O memory here. These are device rings in RAM.
> 
> Right.  But you program them with DMA addresses, so the device can read
> another device's memory.

It can't if host has limited it to only DMA into guest RAM, which is
pretty common.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
> People will just use out of tree drivers (dpdk has several already).  It's a
> pain, but nowhere near what you are proposing.

What's the issue with that? We already agreed this kernel
is going to be tainted, and unsupportable.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote:
> On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> > > 
> > > 
> > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> > > >>It's easy to claim that
> > > >>a solution is around the corner, only no one was looking for it, but the
> > > >>reality is that kernel bypass has been a solution for years for high
> > > >>performance users,
> > > >I never said that it's trivial.
> > > >
> > > >It's probably a lot of work. It's definitely more work than just abusing
> > > >sysfs.
> > > >
> > > >But it looks like a write system call into an eventfd is about 1.5
> > > >microseconds on my laptop. Even with a system call per packet, system
> > > >call overhead is not what makes DPDK drivers outperform Linux ones.
> > > >
> > > 
> > > 1.5 us = 0.6 Mpps per core limit.
> > 
> > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
> > But for RX, you can batch a lot of packets.
> > 
> > You can see by now I'm not that good at benchmarking.
> > Here's what I wrote:
> > 
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > 
> > int main(int argc, char **argv)
> > {
> > int e = eventfd(0, 0);
> > uint64_t v = 1;
> > 
> > int i;
> > 
> > for (i = 0; i < 1000; ++i) {
> > write(e, , sizeof v);
> > }
> > }
> > 
> > 
> > This takes 1.5 seconds to run on my laptop:
> > 
> > $ time ./a.out 
> > 
> > real0m1.507s
> > user0m0.179s
> > sys 0m1.328s
> > 
> > 
> > > dpdk performance is in the tens of
> > > millions of packets per system.
> > 
> > I think that's with a bunch of batching though.
> > 
> > > It's not just the lack of system calls, of course, the architecture is
> > > completely different.
> > 
> > Absolutely - I'm not saying move all of DPDK into kernel.
> > We just need to protect the RX rings so hardware does
> > not corrupt kernel memory.
> > 
> > 
> > Thinking about it some more, many devices
> > have separate rings for DMA: TX (device reads memory)
> > and RX (device writes memory).
> > With such devices, a mode where userspace can write TX ring
> > but not RX ring might make sense.
> > 
> > This will mean userspace might read kernel memory
> > through the device, but can not corrupt it.
> > 
> > That's already a big win!
> > 
> > And RX buffers do not have to be added one at a time.
> > If we assume 0.2usec per system call, batching some 100 buffers per
> > system call gives you 2 nano seconds overhead.  That seems quite
> > reasonable.
> > 
> Hi,
> 
> just to jump in a bit on this.
> 
> Batching of 100 packets is a very large batch, and will add to latency.



This is not on transmit or receive path!
This is only for re-adding buffers to the receive ring.
This batching should not add latency at all:


process rx:
get packet
packets[n] = alloc packet
if (++n > 100) {
system call: add bufs(packets, n);
}





> The
> standard batch size in DPDK right now is 32, and even that may be too high for
> applications in certain domains.
> 
> However, even with that 2ns of overhead calculation, I'd make a few additional
> points.
> * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX 
> and TX on a single thread. 10GB of IO doesn't really stress a core any more. 
> For
> 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with 
> a
> huge batch size of 100 packets, your system call overhead on RX is taking 
> almost
> 12% of our processing time. For a batch size of 32 this overhead would rise to
> over 35% of our packet processing time.

As I said, yes, measureable, but not breaking the bank, and that's with
40GB which still are not widespread.
With 10GB and 100 packets, only 3% overhead.

> For 100G line rate, the packet arrival
> rate is just 6.7ns...

Hypervisors still have time get their act together and support IOMMUs
by the time 100G systems become widespread.

> * As well as this overhead from the system call itself, you are also omitting
> the overhead of scanning the RX descriptors.

I omit it because scanning descriptors can still be done in userspace,
just write-protect the RX ring page.


> This in itself is going to use up
> a good proportion of the processing time, as well as that we have to spend 
> cycles
> copying the descriptors from one ring in memory to another. Given that right 
> now
> with the vector ixgbe driver, the cycle cost per packet of RX is just a few 
> dozen
> cycles on modern cores, every additional cycle (fraction of a nanosecond) has
> an impact.
> 
> Regards,
> /Bruce

See above.  There is no need for that on data path. Only re-adding
buffers requires a system call.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
> >
> >>It's not just the lack of system calls, of course, the architecture is
> >>completely different.
> >Absolutely - I'm not saying move all of DPDK into kernel.
> >We just need to protect the RX rings so hardware does
> >not corrupt kernel memory.
> >
> >
> >Thinking about it some more, many devices
> >have separate rings for DMA: TX (device reads memory)
> >and RX (device writes memory).
> >With such devices, a mode where userspace can write TX ring
> >but not RX ring might make sense.
> 
> I'm sure you can cause havoc just by reading, if you read from I/O memory.

Not talking about I/O memory here. These are device rings in RAM.

> >
> >This will mean userspace might read kernel memory
> >through the device, but can not corrupt it.
> >
> >That's already a big win!
> >
> >And RX buffers do not have to be added one at a time.
> >If we assume 0.2usec per system call, batching some 100 buffers per
> >system call gives you 2 nano seconds overhead.  That seems quite
> >reasonable.
> 
> You're ignoring the page table walk

Some caching strategy might work here.

> and other per-descriptor processing.

You probably can let userspace pre-format it all,
just validate addresses.

> Again^2, maybe this can work.  But it shouldn't block a patch enabling
> interrupt support of VFs.  After the ring proxy is available and proven for
> a few years, we can deprecate bus mastering from uio, and after a few more
> years remove it.

We are talking about DPDK patches posted in June 2015.  It's not some
software proven for years.  If Linux keeps enabling hacks, no one will
bother doing the right thing.  Upstream inclusion is the only carrot
Linux has to make people do the right thing.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 01:25:17PM +0300, Avi Kivity wrote:
> Why use a VF on a non-virtualized system?

1. So a userspace bug does not destroy your hardware
   (PFs generally assume trusted non-buggy drivers, VFs
generally don't).
2. So you can use a PF or another VF for regular networking.
3. So you can manage this system, to some level.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> 
> 
> On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> >>It's easy to claim that
> >>a solution is around the corner, only no one was looking for it, but the
> >>reality is that kernel bypass has been a solution for years for high
> >>performance users,
> >I never said that it's trivial.
> >
> >It's probably a lot of work. It's definitely more work than just abusing
> >sysfs.
> >
> >But it looks like a write system call into an eventfd is about 1.5
> >microseconds on my laptop. Even with a system call per packet, system
> >call overhead is not what makes DPDK drivers outperform Linux ones.
> >
> 
> 1.5 us = 0.6 Mpps per core limit.

Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
But for RX, you can batch a lot of packets.

You can see by now I'm not that good at benchmarking.
Here's what I wrote:

#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
int e = eventfd(0, 0);
uint64_t v = 1;

int i;

for (i = 0; i < 1000; ++i) {
write(e, , sizeof v);
}
}

This takes 1.5 seconds to run on my laptop:

$ time ./a.out 

real0m1.507s
user0m0.179s
sys 0m1.328s

> dpdk performance is in the tens of
> millions of packets per system.

I think that's with a bunch of batching though.

> It's not just the lack of system calls, of course, the architecture is
> completely different.

Absolutely - I'm not saying move all of DPDK into kernel.
We just need to protect the RX rings so hardware does
not corrupt kernel memory.

Thinking about it some more, many devices
have separate rings for DMA: TX (device reads memory)
and RX (device writes memory).
With such devices, a mode where userspace can write TX ring
but not RX ring might make sense.

This will mean userspace might read kernel memory
through the device, but can not corrupt it.

That's already a big win!

And RX buffers do not have to be added one at a time.
If we assume 0.2usec per system call, batching some 100 buffers per
system call gives you 2 nano seconds overhead.  That seems quite
reasonable.

-- 
MST

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 11:33:06AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> > This driver allows using PCI device with Message Signalled Interrupt
> > from userspace. The API is similar to the igb_uio driver used by the DPDK.
> > Via ioctl it provides a mechanism to map MSI-X interrupts into event
> > file descriptors similar to VFIO.
> >
> > VFIO is a better choice if IOMMU is available, but often userspace drivers
> > have to work in environments where IOMMU support (real or emulated) is
> > not available.  All UIO drivers that support DMA are not secure against
> > rogue userspace applications programming DMA hardware to access
> > private memory; this driver is no less secure than existing code.
> > 
> > Signed-off-by: Stephen Hemminger 
> 
> I don't think copying the igb_uio interface is a good idea.
> What DPDK is doing with igb_uio (and indeed uio_pci_generic)
> is abusing the sysfs BAR access to provide unlimited
> access to hardware.
> 
> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.
> 
> uio_pci_generic is already abused like that, mostly
> because when I wrote it, I didn't add enough protections
> against using it with DMA capable devices,
> and we can't go back and break working userspace.
> But at least it does not bind to VFs which all of
> them are capable of DMA.
> 
> The result of merging this driver will be userspace abusing the
> sysfs BAR access with VFs as well, and we do not want that.
> 
> 
> Just forwarding events is not enough to make a valid driver.
> What is missing is a way to access the device in a safe way.
> 
> On a more positive note:
> 
> What would be a reasonable interface? One that does the following
> in kernel:
> 
> 1. initializes device rings (can be in pinned userspace memory,
>but can not be writeable by userspace), brings up interface link
> 2. pins userspace memory (unless using e.g. hugetlbfs)
> 3. gets request, make sure it's valid and belongs to
>the correct task, put it in the ring
> 4. in the reverse direction, notify userspace when buffers
>are available in the ring
> 5. notify userspace about MSI (what this driver does)
> 
> What userspace can be allowed to do:
> 
>   format requests (e.g. transmit, receive) in userspace
>   read ring contents
> 
> What userspace can't be allowed to do:
> 
>   access BAR
>   write rings

Thinking about it some more, many devices


Have separate rings for DMA: TX (device reads memory)
and RX (device writes memory).
With such devices, a mode where userspace can write TX ring
but not RX ring might make sense.

This will mean userspace might read kernel memory
through the device, but can not corrupt it.

That's already a big win!

And RX buffers do not have to be added one at a time.
If we assume 0.2usec per system call, batching some 100 buffers per
system call gives you 2 nano seconds overhead.  That seems quite
reasonable.





> 
> This means that the driver can not be a generic one,
> and there will be a system call overhead when you
> write the ring, but that's the price you have to
> pay for ability to run on systems without an IOMMU.
> 
> 
> 
> 
> > ---
> >  drivers/uio/Kconfig  |   9 ++
> >  drivers/uio/Makefile |   1 +
> >  drivers/uio/uio_msi.c| 378 
> > +++
> >  include/uapi/linux/Kbuild|   1 +
> >  include/uapi/linux/uio_msi.h |  22 +++
> >  5 files changed, 411 insertions(+)
> >  create mode 100644 drivers/uio/uio_msi.c
> >  create mode 100644 include/uapi/linux/uio_msi.h
> > 
> > diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> > index 52c98ce..04adfa0 100644
> > --- a/drivers/uio/Kconfig
> > +++ b/drivers/uio/Kconfig
> > @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
> >   primarily, for virtualization scenarios.
> >   If you compile this as a module, it will be called uio_pci_generic.
> >  
> > +config UIO_PCI_MSI
> > +   tristate "Generic driver supporting MSI-x on PCI Express cards"
> > +   depends on PCI
> > +   help
> > + Generic driver that provides Message Signalled IRQ events
> > + similar to VFIO. If IOMMMU is available please use VFIO
> > + instead since it provides more security.
> > + If you compile this as a module, it will be called uio_msi.
> > +
> >  config UIO_NETX
> > tristate

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote:
> Non-virtualized setups have an iommu available, but they can also use
> pci_uio_generic without patching if they like.

Not with VFs, they can't.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
> >There were some tentative to get it for other (older) drivers, named
> >'bifurcated drivers', but it is stalled.
> 
> IIRC they still exposed the ring to userspace.

How much would a ring write syscall cost? 1-2 microseconds, isn't it?
Measureable, but it's not the end of the world.
ring read might be safe to allow.
Should buy us enough time until hypervisors support IOMMU.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:38:51PM +0300, Avi Kivity wrote:
> The sad thing is that you can do this since forever on a non-virtualized
> system, or on a virtualized system if you don't need interrupt support.  All
> you're doing is blocking interrupt support on virtualized systems.

True, Linux could do more to prevent this kind of abuse.
In fact IIRC, if you enable secureboot, it does exactly that.

A generic uio driver isn't a good interface because it relies on these
sysfs files. We are luckly it doesn't work for VFs, I don't think we
should do anything that relies on this interface in future applications.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> It's easy to claim that
> a solution is around the corner, only no one was looking for it, but the
> reality is that kernel bypass has been a solution for years for high
> performance users,

I never said that it's trivial.

It's probably a lot of work. It's definitely more work than just abusing
sysfs.

But it looks like a write system call into an eventfd is about 1.5
microseconds on my laptop. Even with a system call per packet, system
call overhead is not what makes DPDK drivers outperform Linux ones.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> even when they are some users
> prefer to avoid the performance penalty.

I don't think there's a measureable penalty from passing through the
IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
never saw any numbers that show such.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:15:49PM +0300, Avi Kivity wrote:
> What userspace can't be allowed to do:
> 
> access BAR
> write rings
> 
> 
> 
> 
> It can access the BAR by mmap()ing the resourceN files under sysfs.? You're 
> not
> denying userspace the ability to oops the kernel, just the ability to do 
> useful
> things with hardware.

This interface has to stay there to support existing applications.  A
variety of measures (selinux, secureboot) can be used to make sure
modern ones to not get to touch it. Most distributions enable
some or all of these by default.

And it doesn't mean modern drivers can do this kind of thing.

Look, without an IOMMU, sysfs can not be used securely:
you need some other interface. This is what this driver is missing.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 11:44:28AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
> > > And for what, to prevent
> > > root from touching memory via dma that they can access in a million other
> > > ways?
> > 
> > So one can be reasonably sure a kernel oops is not a result of a
> > userspace bug.
> 
> Actually, I thought about this overnight, and  it should be possible to
> drive it securely from userspace, without hypervisor changes.
> 
> See
> 
> https://mid.gmane.org/20151001104505-mutt-send-email-mst at redhat.com

Ouch, looks like gmane doesn't do https. Sorry, this is the correct
link:

http://mid.gmane.org/20151001104505-mutt-send-email-mst at redhat.com

> 
> 
> > -- 
> > MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 11:52:26AM +0300, Avi Kivity wrote:
> I still don't understand your objection to the patch:
> 
> 
> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.
> 
> 
> If a distribution feels it can't support this configuration, it can disable 
> the
> uio_pci_generic driver, or refuse to support tainted kernels.? If it feels it
> can (and many distributions are starting to support dpdk), then you're just
> denying it the ability to serve its users.

I don't, and can't deny users anything.  I merely think upstream should
avoid putting this driver in-tree.  By doing this, driver writers will
be pushed to develop solutions that can't crash kernel.

I pointed out one way to build it, there are sure to be more.

As far as I could see, without this kind of motivation, people do not
even want to try.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
> > And for what, to prevent
> > root from touching memory via dma that they can access in a million other
> > ways?
> 
> So one can be reasonably sure a kernel oops is not a result of a
> userspace bug.

Actually, I thought about this overnight, and  it should be possible to
drive it securely from userspace, without hypervisor changes.

See

https://mid.gmane.org/20151001104505-mutt-send-email-mst at redhat.com



> -- 
> MST

[dpdk-dev] [PATCH 0/2] uio_msi: device driver

2015-10-01 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 03:28:56PM -0700, Stephen Hemminger wrote:
> This is a new UIO device driver to allow supporting MSI-X and MSI devices
> in userspace.  It has been used in environments like VMware and older versions
> of QEMU/KVM where no IOMMU support is available.
> 
> Stephen Hemminger (2):
> 
> *** BLURB HERE ***

I think this needs some more work to make sure
userspace can drive devices without a risk of
crashing the kernel even when there's no IOMMU.

I sent some comments in response to the patch itself.

Please copy me on future versions of this driver.

Thanks!

> Stephen Hemminger (2):
>   uio: add support for ioctls
>   uio: new driver to support PCI MSI-X
> 
>  drivers/uio/Kconfig  |   9 ++
>  drivers/uio/Makefile |   1 +
>  drivers/uio/uio.c|  15 ++
>  drivers/uio/uio_msi.c| 378 
> +++
>  include/linux/uio_driver.h   |   3 +
>  include/uapi/linux/Kbuild|   1 +
>  include/uapi/linux/uio_msi.h |  22 +++
>  7 files changed, 429 insertions(+)
>  create mode 100644 drivers/uio/uio_msi.c
>  create mode 100644 include/uapi/linux/uio_msi.h
> 
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

[dpdk-dev] [PATCH 2/2] uio: new driver to support PCI MSI-X

2015-10-01 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 03:28:58PM -0700, Stephen Hemminger wrote:
> This driver allows using PCI device with Message Signalled Interrupt
> from userspace. The API is similar to the igb_uio driver used by the DPDK.
> Via ioctl it provides a mechanism to map MSI-X interrupts into event
> file descriptors similar to VFIO.
>
> VFIO is a better choice if IOMMU is available, but often userspace drivers
> have to work in environments where IOMMU support (real or emulated) is
> not available.  All UIO drivers that support DMA are not secure against
> rogue userspace applications programming DMA hardware to access
> private memory; this driver is no less secure than existing code.
> 
> Signed-off-by: Stephen Hemminger 

I don't think copying the igb_uio interface is a good idea.
What DPDK is doing with igb_uio (and indeed uio_pci_generic)
is abusing the sysfs BAR access to provide unlimited
access to hardware.

MSI messages are memory writes so any generic device capable
of MSI is capable of corrupting kernel memory.
This means that a bug in userspace will lead to kernel memory corruption
and crashes.  This is something distributions can't support.

uio_pci_generic is already abused like that, mostly
because when I wrote it, I didn't add enough protections
against using it with DMA capable devices,
and we can't go back and break working userspace.
But at least it does not bind to VFs which all of
them are capable of DMA.

The result of merging this driver will be userspace abusing the
sysfs BAR access with VFs as well, and we do not want that.

Just forwarding events is not enough to make a valid driver.
What is missing is a way to access the device in a safe way.

On a more positive note:

What would be a reasonable interface? One that does the following
in kernel:

1. initializes device rings (can be in pinned userspace memory,
   but can not be writeable by userspace), brings up interface link
2. pins userspace memory (unless using e.g. hugetlbfs)
3. gets request, make sure it's valid and belongs to
   the correct task, put it in the ring
4. in the reverse direction, notify userspace when buffers
   are available in the ring
5. notify userspace about MSI (what this driver does)

What userspace can be allowed to do:

format requests (e.g. transmit, receive) in userspace
read ring contents

What userspace can't be allowed to do:

access BAR
write rings

This means that the driver can not be a generic one,
and there will be a system call overhead when you
write the ring, but that's the price you have to
pay for ability to run on systems without an IOMMU.

> ---
>  drivers/uio/Kconfig  |   9 ++
>  drivers/uio/Makefile |   1 +
>  drivers/uio/uio_msi.c| 378 
> +++
>  include/uapi/linux/Kbuild|   1 +
>  include/uapi/linux/uio_msi.h |  22 +++
>  5 files changed, 411 insertions(+)
>  create mode 100644 drivers/uio/uio_msi.c
>  create mode 100644 include/uapi/linux/uio_msi.h
> 
> diff --git a/drivers/uio/Kconfig b/drivers/uio/Kconfig
> index 52c98ce..04adfa0 100644
> --- a/drivers/uio/Kconfig
> +++ b/drivers/uio/Kconfig
> @@ -93,6 +93,15 @@ config UIO_PCI_GENERIC
> primarily, for virtualization scenarios.
> If you compile this as a module, it will be called uio_pci_generic.
>  
> +config UIO_PCI_MSI
> +   tristate "Generic driver supporting MSI-x on PCI Express cards"
> + depends on PCI
> + help
> +   Generic driver that provides Message Signalled IRQ events
> +   similar to VFIO. If IOMMMU is available please use VFIO
> +   instead since it provides more security.
> +   If you compile this as a module, it will be called uio_msi.
> +
>  config UIO_NETX
>   tristate "Hilscher NetX Card driver"
>   depends on PCI
> diff --git a/drivers/uio/Makefile b/drivers/uio/Makefile
> index 8560dad..62fc44b 100644
> --- a/drivers/uio/Makefile
> +++ b/drivers/uio/Makefile
> @@ -9,3 +9,4 @@ obj-$(CONFIG_UIO_NETX)+= uio_netx.o
>  obj-$(CONFIG_UIO_PRUSS) += uio_pruss.o
>  obj-$(CONFIG_UIO_MF624) += uio_mf624.o
>  obj-$(CONFIG_UIO_FSL_ELBC_GPCM)  += uio_fsl_elbc_gpcm.o
> +obj-$(CONFIG_UIO_PCI_MSI)+= uio_msi.o
> diff --git a/drivers/uio/uio_msi.c b/drivers/uio/uio_msi.c
> new file mode 100644
> index 000..802b5c4
> --- /dev/null
> +++ b/drivers/uio/uio_msi.c
> @@ -0,0 +1,378 @@
> +/*-
> + *
> + * Copyright (c) 2015 by Brocade Communications Systems, Inc.
> + * Author: Stephen Hemminger 
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 only.
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define DRIVER_VERSION   "0.1.1"
> +#define MAX_MSIX_VECTORS 64
> +
> +/* MSI-X vector information */
> +struct uio_msi_pci_dev {
> + struct uio_info info;   /* UIO driver info */
> +

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 02:36:48PM -0700, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 23:09:33 +0300
> Vlad Zolotarov  wrote:
> 
> > 
> > 
> > On 09/30/15 22:39, Michael S. Tsirkin wrote:
> > > On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
> > >>>> How would iommu
> > >>>> virtualization change anything?
> > >>> Kernel can use an iommu to limit device access to memory of
> > >>> the controlling application.
> > >> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
> > >> interrupts support in uio_pci_generic? kernel may continue to limit the
> > >> above access with this support as well.
> > > It could maybe. So if you write a patch to allow MSI by at the same time
> > > creating an isolated IOMMU group and blocking DMA from device in
> > > question anywhere, that sounds reasonable.
> > 
> > No, I'm only planning to add MSI and MSI-X interrupts support for 
> > uio_pci_generic device.
> > The rest mentioned above should naturally be a matter of a different 
> > patch and writing it is orthogonal to the patch I'm working on as has 
> > been extensively discussed in this thread.
> > 
> > >
> > 
> 
> I have a generic MSI and MSI-X driver (posted earlier on this list).
> About to post to upstream kernel.

If Linux holds out and refuses to support insecure interfaces,
hypervisor vendors will add secure ones. If Linux lets them ignore guest
security, they will.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 06:36:17PM +0300, Avi Kivity wrote:
> As it happens, you're removing the functionality from the users who have no
> other option.  They can't use vfio because it doesn't work on virtualized
> setups.

...

> Root can already do anything.

I think there's a contradiction between the two claims above.

>  So what security issue is there?

A buggy userspace can and will corrupt kernel memory.

...

> And for what, to prevent
> root from touching memory via dma that they can access in a million other
> ways?

So one can be reasonably sure a kernel oops is not a result of a
userspace bug.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 11:00:49PM +0300, Gleb Natapov wrote:
> > You are increasing interrupt latency by a huge factor by channeling
> > interrupts through a scheduler.  Let user install an
> > interrupt handler function, and be done with it.
> > 
> Interrupt latency is not always hugely important. If you enter interrupt
> mode only when idle hundred more us on a first packet will not kill you.

It certainly affects worst-case latency.  And if you lower interupt
latency, you can go idle faster, so it affects power too.

> If
> interrupt latency is important then uio may be not the right solution,
> but then neither is vfio.

That's what I'm saying, if you don't need memory isolation you can do
better than just slightly tweak existing drivers.

> --
>   Gleb.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
> >>How would iommu
> >>virtualization change anything?
> >Kernel can use an iommu to limit device access to memory of
> >the controlling application.
> 
> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
> interrupts support in uio_pci_generic? kernel may continue to limit the
> above access with this support as well.

It could maybe. So if you write a patch to allow MSI by at the same time
creating an isolated IOMMU group and blocking DMA from device in
question anywhere, that sounds reasonable.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 18:26, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
> >>How not virtualizing iommu forces "all or nothing" approach?
> >Looks like you can't limit an assigned device to only access part of
> >guest memory that belongs to a given process.  Either let it access all
> >of guest memory ("all") or don't assign the device ("nothing").
> 
> Ok. A question then: can u limit the assigned device to only access part of
> the guest memory even if iommu was virtualized?

That's exactly what an iommu does - limit the device io access to memory.

> How would iommu
> virtualization change anything?

Kernel can use an iommu to limit device access to memory of
the controlling application.

> And why do we care about an assigned device
> to be able to access all Guest memory?

Because we want to be reasonably sure a kernel memory corruption
is not a result of a bug in a userspace application.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 10:43:04AM -0700, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 20:39:43 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> > > On Wed, 30 Sep 2015 13:37:22 +0300
> > > Vlad Zolotarov  wrote:
> > > 
> > > > 
> > > > 
> > > > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > > > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > > > >> "Michael S. Tsirkin"  wrote:
> > > > >>
> > > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > > > >>>> The security breach motivation u brought in "[RFC PATCH] uio:
> > > > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a 
> > > > >>>> bit weak
> > > > >>>> since one u let the userland access to the bar it may do any funny 
> > > > >>>> thing
> > > > >>>> using the DMA engine of the device. This kind of stuff should be 
> > > > >>>> prevented
> > > > >>>> using the iommu and if it's enabled then any funny tricks using 
> > > > >>>> MSI/MSI-X
> > > > >>>> configuration will be prevented too.
> > > > >>>>
> > > > >>>> I'm about to send the patch to main Linux mailing list. Let's 
> > > > >>>> continue this
> > > > >>>> discussion there.
> > > > >>>>
> > > > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > > > 
> > > > If there is an IOMMU in the picture there shouldn't be any problem to 
> > > > use UIO with DMA capable devices.
> > > > 
> > > > >>> I don't think this can change.
> > > > >> Given there is no PV IOMMU and even if there was it would be too 
> > > > >> slow for DPDK
> > > > >> use, I can't accept that.
> > > > > QEMU does allow emulating an iommu.
> > > > 
> > > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > > > option there. And again, it's a general issue not DPDK specific.
> > > > Today one has to develop some proprietary modules (like igb_uio) to 
> > > > workaround the issue and this is lame. IMHO uio_pci_generic should
> > > > be fixed to be able to properly work within any virtualized environment 
> > > > and not only with KVM.
> > > > 
> > > 
> > > Also VMware (bigger problem) has no IOMMU emulation.
> > > Other environments as well (Windriver, GCE) have noe IOMMU.
> > 
> > Because the use-case of userspace drivers is not important enough?
> > Without an IOMMU, there's no way to have secure userspace drivers.
> 
> Look at Cloudius, there is no necessity of security in guest.

It's an interesting concept, isn't it?

So why not do what Cloudius does, and run this task code in ring 0 then,
allocating all memory in the kernel range?

You are increasing interrupt latency by a huge factor by channeling
interrupts through a scheduler.  Let user install an
interrupt handler function, and be done with it.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 13:37:22 +0300
> Vlad Zolotarov  wrote:
> 
> > 
> > 
> > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > >> "Michael S. Tsirkin"  wrote:
> > >>
> > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > >>>> The security breach motivation u brought in "[RFC PATCH] uio:
> > >>>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit 
> > >>>> weak
> > >>>> since one u let the userland access to the bar it may do any funny 
> > >>>> thing
> > >>>> using the DMA engine of the device. This kind of stuff should be 
> > >>>> prevented
> > >>>> using the iommu and if it's enabled then any funny tricks using 
> > >>>> MSI/MSI-X
> > >>>> configuration will be prevented too.
> > >>>>
> > >>>> I'm about to send the patch to main Linux mailing list. Let's continue 
> > >>>> this
> > >>>> discussion there.
> > >>>>
> > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > 
> > If there is an IOMMU in the picture there shouldn't be any problem to 
> > use UIO with DMA capable devices.
> > 
> > >>> I don't think this can change.
> > >> Given there is no PV IOMMU and even if there was it would be too slow 
> > >> for DPDK
> > >> use, I can't accept that.
> > > QEMU does allow emulating an iommu.
> > 
> > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > option there. And again, it's a general issue not DPDK specific.
> > Today one has to develop some proprietary modules (like igb_uio) to 
> > workaround the issue and this is lame. IMHO uio_pci_generic should
> > be fixed to be able to properly work within any virtualized environment 
> > and not only with KVM.
> > 
> 
> Also VMware (bigger problem) has no IOMMU emulation.
> Other environments as well (Windriver, GCE) have noe IOMMU.

Because the use-case of userspace drivers is not important enough?
Without an IOMMU, there's no way to have secure userspace drivers.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
> How not virtualizing iommu forces "all or nothing" approach?

Looks like you can't limit an assigned device to only access part of
guest memory that belongs to a given process.  Either let it access all
of guest memory ("all") or don't assign the device ("nothing").

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 05:53:54PM +0300, Avi Kivity wrote:
> On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
> >>
> >>On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
> >>>On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
> >>>>On 09/30/15 15:03, Michael S. Tsirkin wrote:
> >>>>>On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> >>>>>>On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >>>>>>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> >>>>>>>>The whole idea is to bypass kernel. Especially for networking...
> >>>>>>>... on dumb hardware that doesn't support doing that securely.
> >>>>>>On a very capable HW that supports whatever security requirements needed
> >>>>>>(e.g. 82599 Intel's SR-IOV VF devices).
> >>>>>Network card type is irrelevant as long as you do not have an IOMMU,
> >>>>>otherwise you would just use e.g. VFIO.
> >>>>Sorry, but I don't follow your logic here - Amazon EC2 environment is a
> >>>>example where there *is* iommu but it's not virtualized
> >>>>and thus VFIO is
> >>>>useless and there is an option to use directly assigned SR-IOV networking
> >>>>device there where using the kernel drivers impose a performance impact
> >>>>compared to user space UIO-based user space kernel bypass mode of usage. 
> >>>>How
> >>>>is it irrelevant? Could u, pls, clarify your point?
> >>>>
> >>>So it's not even dumb hardware, it's another piece of software
> >>>that forces an "all or nothing" approach where either
> >>>device has access to all VM memory, or none.
> >>>And this, unfortunately, leaves you with no secure way to
> >>>allow userspace drivers.
> >>Some setups don't need security (they are single-user, single application).
> >>But do need a lot of performance (like 5X-10X performance).  An example is
> >>OpenVSwitch, security doesn't help it at all and if you force it to use the
> >>kernel drivers you cripple it.
> >We'd have to see there are actual users that need this.  So far, dpdk
> >seems like the only one,
> 
> dpdk is a whole class if users.  It's not a specific application.
> 
> >  and it wants to use UIO for slow path stuff
> >like polling link status.  Why this needs kernel bypass support, I don't
> >know.  I asked, and got no answer.
> 
> First, it's more than link status.  dpdk also has an interrupt mode, which
> applications can fall back to when when the load is light in order to save
> power (and in order not to get support calls about 100% cpu when idle).

Aha, looks like it appeared in June. Interesting, thanks for the info.

> Even for link status, you don't want to poll for that, because accessing
> device registers is expensive.  An interrupt is the best approach for rare
> events like link changed.

Yea, but you probably can get by with a timer for that, even if it's ugly.

> >>Also, I'm root.  I can do anything I like, including loading a patched
> >>pci_uio_generic.  You're not providing _any_ security, you're simply making
> >>life harder for users.
> >Maybe that's true on your system. But I guess you know that's not true
> >for everyone, not in 2015.
> 
> Why is it not true?  if I'm root, I can do anything I like to my
> system, and everyone is root in 2015.  I can access the BARs directly
> and program DMA, how am I more secure by uio not allowing me to setup
> msix?

That's not the point.  The point always was that using uio for these
devices (capable of DMA, in particular of msix) isn't possible in a
secure way. And yes, if same device happens to also do interrupts, UIO
does not reject it as it probably should, and we can't change this
without breaking some working setups.  But this doesn't mean we should
add more setups like this that we'll then be forced to maintain.


> Non-root users are already secured by their inability to load the module,
> and by the device permissions.
> 
> >
> >>>So it makes even less sense to add insecure work-arounds in the kernel.
> >>>It seems quite likely that by the time the new kernel reaches
> >>>production X years from now, EC2 will have a virtual iommu.
> >>I can adopt a new kernel tomorrow.  I have no influence on EC2.
> >>
> >>
> >Xen grant tables sound like they could be the right interface
> >for EC2.  google search for "grant tables iommu" immediately gives me:
> >http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html
> >Maybe latest Xen is already doing the right thing, and it's just the
> >question of making VFIO use that.
> >
> 
> grant tables only work for virtual devices, not physical devices.

Why not? That's what the patches above seem to do.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
> 
> 
> On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
> >>
> >>On 09/30/15 15:03, Michael S. Tsirkin wrote:
> >>>On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> >>>>On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >>>>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> >>>>>>The whole idea is to bypass kernel. Especially for networking...
> >>>>>... on dumb hardware that doesn't support doing that securely.
> >>>>On a very capable HW that supports whatever security requirements needed
> >>>>(e.g. 82599 Intel's SR-IOV VF devices).
> >>>Network card type is irrelevant as long as you do not have an IOMMU,
> >>>otherwise you would just use e.g. VFIO.
> >>Sorry, but I don't follow your logic here - Amazon EC2 environment is a
> >>example where there *is* iommu but it's not virtualized
> >>and thus VFIO is
> >>useless and there is an option to use directly assigned SR-IOV networking
> >>device there where using the kernel drivers impose a performance impact
> >>compared to user space UIO-based user space kernel bypass mode of usage. How
> >>is it irrelevant? Could u, pls, clarify your point?
> >>
> >So it's not even dumb hardware, it's another piece of software
> >that forces an "all or nothing" approach where either
> >device has access to all VM memory, or none.
> >And this, unfortunately, leaves you with no secure way to
> >allow userspace drivers.
> 
> Some setups don't need security (they are single-user, single application).
> But do need a lot of performance (like 5X-10X performance).  An example is
> OpenVSwitch, security doesn't help it at all and if you force it to use the
> kernel drivers you cripple it.

We'd have to see there are actual users that need this.  So far, dpdk
seems like the only one, and it wants to use UIO for slow path stuff
like polling link status.  Why this needs kernel bypass support, I don't
know.  I asked, and got no answer.

> 
> Also, I'm root.  I can do anything I like, including loading a patched
> pci_uio_generic.  You're not providing _any_ security, you're simply making
> life harder for users.

Maybe that's true on your system. But I guess you know that's not true
for everyone, not in 2015.

> >So it makes even less sense to add insecure work-arounds in the kernel.
> >It seems quite likely that by the time the new kernel reaches
> >production X years from now, EC2 will have a virtual iommu.
> 
> I can adopt a new kernel tomorrow.  I have no influence on EC2.
> 
>

Xen grant tables sound like they could be the right interface
for EC2.  google search for "grant tables iommu" immediately gives me:
http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html
Maybe latest Xen is already doing the right thing, and it's just the
question of making VFIO use that.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 15:03, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> >>
> >>On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> >>>>The whole idea is to bypass kernel. Especially for networking...
> >>>... on dumb hardware that doesn't support doing that securely.
> >>On a very capable HW that supports whatever security requirements needed
> >>(e.g. 82599 Intel's SR-IOV VF devices).
> >Network card type is irrelevant as long as you do not have an IOMMU,
> >otherwise you would just use e.g. VFIO.
> 
> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
> example where there *is* iommu but it's not virtualized
> and thus VFIO is
> useless and there is an option to use directly assigned SR-IOV networking
> device there where using the kernel drivers impose a performance impact
> compared to user space UIO-based user space kernel bypass mode of usage. How
> is it irrelevant? Could u, pls, clarify your point?
> 

So it's not even dumb hardware, it's another piece of software
that forces an "all or nothing" approach where either
device has access to all VM memory, or none.
And this, unfortunately, leaves you with no secure way to
allow userspace drivers.

So it makes even less sense to add insecure work-arounds in the kernel.
It seems quite likely that by the time the new kernel reaches
production X years from now, EC2 will have a virtual iommu.


> >
> >>>Colour me unimpressed.
> >>>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> >>The whole idea is to bypass kernel. Especially for networking...
> >... on dumb hardware that doesn't support doing that securely.
> 
> On a very capable HW that supports whatever security requirements needed
> (e.g. 82599 Intel's SR-IOV VF devices).

Network card type is irrelevant as long as you do not have an IOMMU,
otherwise you would just use e.g. VFIO.

> >Colour me unimpressed.
> >

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> The whole idea is to bypass kernel. Especially for networking...

... on dumb hardware that doesn't support doing that securely.
Colour me unimpressed.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 01:37:22PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 00:49, Michael S. Tsirkin wrote:
> >On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> >>On Tue, 29 Sep 2015 23:54:54 +0300
> >>"Michael S. Tsirkin"  wrote:
> >>
> >>>On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> >>>>The security breach motivation u brought in "[RFC PATCH] uio:
> >>>>uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> >>>>since one u let the userland access to the bar it may do any funny thing
> >>>>using the DMA engine of the device. This kind of stuff should be prevented
> >>>>using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> >>>>configuration will be prevented too.
> >>>>
> >>>>I'm about to send the patch to main Linux mailing list. Let's continue 
> >>>>this
> >>>>discussion there.
> >>>Basically UIO shouldn't be used with devices capable of DMA.
> >>>Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> 
> If there is an IOMMU in the picture there shouldn't be any problem to use
> UIO with DMA capable devices.

UIO doesn't enforce the IOMMU though. That's why it's not a good fit.

> >>>I don't think this can change.
> >>Given there is no PV IOMMU and even if there was it would be too slow for 
> >>DPDK
> >>use, I can't accept that.
> >QEMU does allow emulating an iommu.
> 
> Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an option
> there.

Not only that, a bunch of boxes have their IOMMU disabled.
So for such systems, you can't have userspace poking at
device registers. You need a kernel driver to validate
userspace requests before passing them on to devices.

> And again, it's a general issue not DPDK specific.
> Today one has to develop some proprietary modules (like igb_uio) to
> workaround the issue and this is lame.

Of course it is lame. So don't bypass the kernel then, use the upstream drivers.

> IMHO uio_pci_generic should
> be fixed to be able to properly work within any virtualized environment and
> not only with KVM.

The motivation for UIO is pretty clear:

For many types of devices, creating a Linux kernel driver is
overkill.  All that is really needed is some way to handle an
interrupt and provide access to the memory space of the
device.  The logic of controlling the device does not
necessarily have to be within the kernel, as the device does
not need to take advantage of any of other resources that the
kernel provides.  One such common class of devices that are
like this are for industrial I/O cards.

Devices doing DMA do need to take advantage of memory protection
that the kernel provides.

> 
> >  DPDK uses static mappings, so I
> >doubt it's speed matters at all.
> >
> >Anyway, DPDK is doing polling all the time. I don't see why does it
> >insist on using interrupts to detect link up events. Just poll for that
> >too.
> >

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> On Tue, 29 Sep 2015 23:54:54 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > > The security breach motivation u brought in "[RFC PATCH] uio:
> > > uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> > > since one u let the userland access to the bar it may do any funny thing
> > > using the DMA engine of the device. This kind of stuff should be prevented
> > > using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> > > configuration will be prevented too.
> > > 
> > > I'm about to send the patch to main Linux mailing list. Let's continue 
> > > this
> > > discussion there.
> > >   
> > 
> > Basically UIO shouldn't be used with devices capable of DMA.
> > Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > I don't think this can change.
> 
> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
> use, I can't accept that. 

QEMU does allow emulating an iommu.  DPDK uses static mappings, so I
doubt it's speed matters at all.

Anyway, DPDK is doing polling all the time. I don't see why does it
insist on using interrupts to detect link up events. Just poll for that
too.

-- 
MST

[dpdk-dev] Unlinking hugepage backing file after initialiation

2015-09-30 Thread Michael S. Tsirkin

On Tue, Sep 29, 2015 at 05:50:00PM +, shesha Sreenivasamurthy (shesha) 
wrote:
> Sure. Then, is there any real reason why the backing files should not be
> unlinked ?

AFAIK qemu unlinks them already.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> The security breach motivation u brought in "[RFC PATCH] uio:
> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> since one u let the userland access to the bar it may do any funny thing
> using the DMA engine of the device. This kind of stuff should be prevented
> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> configuration will be prevented too.
> 
> I'm about to send the patch to main Linux mailing list. Let's continue this
> discussion there.
> 

Basically UIO shouldn't be used with devices capable of DMA.
Use VFIO for that (yes, this implies an emulated or PV IOMMU).
I don't think this can change.

> >
> >I think that DPDK should be fixed to not require uio_pci_generic
> >for VF devices (or any devices without INT#x).
> >
> >If DPDK requires a place-holder driver, the pci-stub driver should
> >do this adequately. See ./drivers/pci/pci-stub.c
> >

[dpdk-dev] Unlinking hugepage backing file after initialiation

2015-09-29 Thread Michael S. Tsirkin

On Tue, Sep 29, 2015 at 03:48:08PM +, shesha Sreenivasamurthy (shesha) 
wrote:
> If huge pages are allocated for the guest and if the guest crashes there may 
> be
> a chance that the new guest may not be able to get huge pages again as some
> other guest or process on the host used it. But I am not able to understand
> memory corruption you are talking about. In my opinion, if a process using a
> piece of memory goes away, it should not re-attach to the same piece of memory
> without running a sanity check on it.

guest memory is allocated an freed by hypervisor, right?
I don't think it's dpdk's job.

-- 
MST

1 2 >

1 - 100 of 113 matches

Mail list logo