RE: Guest bridge setup variations
-Original Message- From: Arnd Bergmann [mailto:a...@arndb.de] Sent: Wednesday, December 16, 2009 6:16 AM To: virtualization@lists.linux-foundation.org Cc: Leonid Grossman; qemu-de...@nongnu.org Subject: Re: Guest bridge setup variations On Wednesday 16 December 2009, Leonid Grossman wrote: 3. Doing the bridging in the NIC using macvlan in passthrough mode. This lowers the CPU utilization further compared to 2, at the expense of limiting throughput by the performance of the PCIe interconnect to the adapter. Whether or not this is a win is workload dependent. This is certainly true today for pci-e 1.1 and 2.0 devices, but as NICs move to pci-e 3.0 (while remaining almost exclusively dual port 10GbE for a long while), EVB internal bandwidth will significantly exceed external bandwidth. So, #3 can become a win for most inter-guest workloads. Right, it's also hardware dependent, but it usually comes down to whether it's cheaper to spend CPU cycles or to spend IO bandwidth. I would be surprised if all future machines with PCIe 3.0 suddenly have a huge surplus of bandwidth but no CPU to keep up with that. Access controls now happen in the NIC. Currently, this is not supported yet, due to lack of device drivers, but it will be an important scenario in the future according to some people. Actually, x3100 10GbE drivers support this today via sysfs interface to the host driver that can choose to control VEB tables (and therefore MAC addresses, vlan memberships, etc. for all passthru interfaces behind the VEB). Ok, I didn't know about that. OF course a more generic vendor-independent interface will be important in the future. Right. I hope we can come up with something soon. I'll have a look at what your driver does and see if that can be abstracted in some way. Sounds good, please let us know if looking at the code/documentation will suffice or you need a couple cards to go along with the code. I expect that if we can find an interface between the kernel and device driver for two or three NIC implementations that it will be good enough to adapt to everyone else as well. The interface will likely evolve along with EVB standards and other developments, but initial implementation can be pretty basic (and vendor-independent). Early IOV NIC deployments can benefit from an interface that sets couple VF parameters missing in legacy NIC interface - things like bandwidth limit and list of MAC addresses (since setting a NIC in promisc mode doesn't work well for VEB, it is currently forced to learn the addresses it is configured for). The interface can also include querying IOV NIC capabilities like number of VFs, support for VEB and/or VEPA mode, etc as well as getting VF stats and MAC/VLAN tables - all in all, it is not a long list. Arnd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: Guest bridge setup variations
-Original Message- From: virtualization-boun...@lists.linux-foundation.org [mailto:virtualization-boun...@lists.linux-foundation.org] On Behalf Of Arnd Bergmann Sent: Tuesday, December 08, 2009 8:08 AM To: virtualization@lists.linux-foundation.org Cc: qemu-de...@nongnu.org Subject: Guest bridge setup variations As promised, here is my small writeup on which setups I feel are important in the long run for server-type guests. This does not cover -net user, which is really for desktop kinds of applications where you do not want to connect into the guest from another IP address. I can see four separate setups that we may or may not want to support, the main difference being how the forwarding between guests happens: 1. The current setup, with a bridge and tun/tap devices on ports of the bridge. This is what Gerhard's work on access controls is focused on and the only option where the hypervisor actually is in full control of the traffic between guests. CPU utilization should be highest this way, and network management can be a burden, because the controls are done through a Linux, libvirt and/or Director specific interface. 2. Using macvlan as a bridging mechanism, replacing the bridge and tun/tap entirely. This should offer the best performance on inter-guest communication, both in terms of throughput and CPU utilization, but offer no access control for this traffic at all. Performance of guest-external traffic should be slightly better than bridge/tap. 3. Doing the bridging in the NIC using macvlan in passthrough mode. This lowers the CPU utilization further compared to 2, at the expense of limiting throughput by the performance of the PCIe interconnect to the adapter. Whether or not this is a win is workload dependent. This is certainly true today for pci-e 1.1 and 2.0 devices, but as NICs move to pci-e 3.0 (while remaining almost exclusively dual port 10GbE for a long while), EVB internal bandwidth will significantly exceed external bandwidth. So, #3 can become a win for most inter-guest workloads. Access controls now happen in the NIC. Currently, this is not supported yet, due to lack of device drivers, but it will be an important scenario in the future according to some people. Actually, x3100 10GbE drivers support this today via sysfs interface to the host driver that can choose to control VEB tables (and therefore MAC addresses, vlan memberships, etc. for all passthru interfaces behind the VEB). OF course a more generic vendor-independent interface will be important in the future. 4. Using macvlan for actual VEPA on the outbound interface. This is mostly interesting because it makes the network access controls visible in an external switch that is already managed. CPU utilization and guest-external throughput should be identical to 3, but inter-guest latency can only be worse because all frames go through the external switch. In case 2 through 4, we have the choice between macvtap and the raw packet interface for connecting macvlan to qemu. Raw sockets are better tested right now, while macvtap has better permission management (i.e. it does not require CAP_NET_ADMIN). Neither one is upstream though at the moment. The raw driver only requires qemu patches, while macvtap requires both a new kernel driver and a trivial change in qemu. In all four cases, vhost-net could be used to move the workload from user space into the kernel, which may be an advantage. The decision for or against vhost-net is entirely independent of the other decisions. Arnd ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
-Original Message- From: Fischer, Anna [mailto:[EMAIL PROTECTED] Sent: Saturday, November 08, 2008 3:10 AM To: Greg KH; Yu Zhao Cc: Matthew Wilcox; Anthony Liguori; H L; [EMAIL PROTECTED]; [EMAIL PROTECTED]; Chiang, Alexander; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; virtualization@lists.linux-foundation.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; Leonid Grossman; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support But would such an api really take advantage of the new IOV interfaces that are exposed by the new device type? I agree with what Yu says. The idea is to have hardware capabilities to virtualize a PCI device in a way that those virtual devices can represent full PCI devices. The advantage of that is that those virtual device can then be used like any other standard PCI device, meaning we can use existing OS tools, configuration mechanism etc. to start working with them. Also, when using a virtualization-based system, e.g. Xen or KVM, we do not need to introduce new mechanisms to make use of SR-IOV, because we can handle VFs as full PCI devices. A virtual PCI device in hardware (a VF) can be as powerful or complex as you like, or it can be very simple. But the big advantage of SR-IOV is that hardware presents a complete PCI device to the OS - as opposed to some resources, or queues, that need specific new configuration and assignment mechanisms in order to use them with a guest OS (like, for example, VMDq or similar technologies). Anna Ditto. Taking netdev interface as an example - a queue pair is a great way to scale across cpu cores in a single OS image, but it is just not a good way to share device across multiple OS images. The best unit of virtualization is a VF that is implemented as a complete netdev pci device (not a subset of a pci device). This way, native netdev device drivers can work for direct hw access to a VF as is, and most/all Linux networking features (including VMQ) will work in a guest. Also, guest migration for netdev interfaces (both direct and virtual) can be supported via native Linux mechanism (bonding driver), while Dom0 can retain veto power over any guest direct interface operation it deems privileged (vlan, mac address, promisc mode, bandwidth allocation between VFs, etc.). Leonid ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Zhao, Yu Sent: Thursday, November 06, 2008 11:06 PM To: Chris Wright Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; Matthew Wilcox; Greg KH; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; virtualization@lists.linux-foundation.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support Chris Wright wrote: * Greg KH ([EMAIL PROTECTED]) wrote: On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote: On Thu, Nov 06, 2008 at 08:49:19AM -0800, Greg KH wrote: On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote: I have not modified any existing drivers, but instead I threw together a bare-bones module enabling me to make a call to pci_iov_register() and then poke at an SR-IOV adapter's /sys entries for which no driver was loaded. It appears from my perusal thus far that drivers using these new SR-IOV patches will require modification; i.e. the driver associated with the Physical Function (PF) will be required to make the pci_iov_register() call along with the requisite notify() function. Essentially this suggests to me a model for the PF driver to perform any global actions or setup on behalf of VFs before enabling them after which VF drivers could be associated. Where would the VF drivers have to be associated? On the pci_dev level or on a higher one? Will all drivers that want to bind to a VF device need to be rewritten? The current model being implemented by my colleagues has separate drivers for the PF (aka native) and VF devices. I don't personally believe this is the correct path, but I'm reserving judgement until I see some code. Hm, I would like to see that code before we can properly evaluate this interface. Especially as they are all tightly tied together. I don't think we really know what the One True Usage model is for VF devices. Chris Wright has some ideas, I have some ideas and Yu Zhao has some ideas. I bet there's other people who have other ideas too. I'd love to hear those ideas. First there's the question of how to represent the VF on the host. Ideally (IMO) this would show up as a normal interface so that normal tools can configure the interface. This is not exactly how the first round of patches were designed. Whether the VF can show up as a normal interface is decided by VF driver. VF is represented by 'pci_dev' at PCI level, so VF driver can be loaded as normal PCI device driver. What the software representation (eth, framebuffer, etc.) created by VF driver is not controlled by SR-IOV framework. So you definitely can use normal tool to configure the VF if its driver supports that :-) Second there's the question of reserving the BDF on the host such that we don't have two drivers (one in the host and one in a guest) trying to drive the same device (an issue that shows up for device assignment as well as VF assignment). If we don't reserve BDF for the device, they can't work neither in the host nor the guest. Without BDF, we can't access the config space of the device, the device also can't do DMA. Did I miss your point? Third there's the question of whether the VF can be used in the host at all. Why can't? My VFs work well in the host as normal PCI devices :-) Fourth there's the question of whether the VF and PF drivers are the same or separate. As I mentioned in another email of this thread. We can't predict how hardware vendor creates their SR-IOV device. PCI SIG doesn't define device specific logics. So I think the answer of this question is up to the device driver developers. If PF and VF in a SR-IOV device have similar logics, then they can combine the driver. Otherwise, e.g., if PF doesn't have real functionality at all -- it only has registers to control internal resource allocation for VFs, then the drivers should be separate, right? Right, this really depends upon the functionality behind a VF. If VF is done as a subset of netdev interface (for example, a queue pair), then a split VF/PF driver model and a proprietary communication channel is in order. If each VF is done as a complete netdev interface (like in our 10GbE IOV controllers), then PF and VF drivers could be the same. Each VF can be independently driven by such native netdev driver; this includes the ability to run a native driver in a guest in passthru mode. A PF driver in a privileged domain doesn't even have to be present. The typical usecase is assigning the VF to the guest directly, so there's only enough functionality in the host side to allocate a VF, configure it, and assign it (and propagate AER). This is with separate PF and VF driver. As Anthony mentioned, we are interested in allowing the host to use the VF. This could be useful for