Hi Jianfeng, Thanks for raising the container issues and proposing some solutions. General comments below.
2016-02-05 19:20, Jianfeng Tan: > This patchset is to provide high performance networking interface (virtio) > for container-based DPDK applications. The way of starting DPDK apps in > containers with ownership of NIC devices exclusively is beyond the scope. > The basic idea here is to present a new virtual device (named eth_cvio), > which can be discovered and initialized in container-based DPDK apps using > rte_eal_init(). To minimize the change, we reuse already-existing virtio > frontend driver code (driver/net/virtio/). > > Compared to QEMU/VM case, virtio device framework (translates I/O port r/w > operations into unix socket/cuse protocol, which is originally provided in > QEMU), is integrated in virtio frontend driver. So this converged driver > actually plays the role of original frontend driver and the role of QEMU > device framework. > > The major difference lies in how to calculate relative address for vhost. > The principle of virtio is that: based on one or multiple shared memory > segments, vhost maintains a reference system with the base addresses and > length for each segment so that an address from VM comes (usually GPA, > Guest Physical Address) can be translated into vhost-recognizable address > (named VVA, Vhost Virtual Address). To decrease the overhead of address > translation, we should maintain as few segments as possible. In VM's case, > GPA is always locally continuous. In container's case, CVA (Container > Virtual Address) can be used. Specifically: > a. when set_base_addr, CVA address is used; > b. when preparing RX's descriptors, CVA address is used; > c. when transmitting packets, CVA is filled in TX's descriptors; > d. in TX and CQ's header, CVA is used. > > How to share memory? In VM's case, qemu always shares all physical layout > to backend. But it's not feasible for a container, as a process, to share > all virtual memory regions to backend. So only specified virtual memory > regions (with type of shared) are sent to backend. It's a limitation that > only addresses in these areas can be used to transmit or receive packets. > > Known issues > > a. When used with vhost-net, root privilege is required to create tap > device inside. > b. Control queue and multi-queue are not supported yet. > c. When --single-file option is used, socket_id of the memory may be > wrong. (Use "numactl -N x -m x" to work around this for now) There are 2 different topics in this patchset: 1/ How to provide networking in containers 2/ How to provide memory in containers 1/ You have decided to use the virtio spec to bridge the host with its containers. But there is no virtio device in a container and no vhost interface in the host (except the kernel one). So you are extending virtio to work as a vdev inside the container. Could you explain what is the datapath between virtio and the host app? Does it need to use a fake device from Qemu as Tetsuya has done? Do you think there can be some alternatives to vhost/virtio in containers? 2/ The memory management is already a mess and it's going worst. I think we need to think the requirements first and then write a proper implementation to cover every identified needs. I have started a new thread to cover this part: http://thread.gmane.org/gmane.comp.networking.dpdk.devel/37445