RE: Guest bridge setup variations

2009-12-16 Thread Leonid Grossman


 -Original Message-
 From: Arnd Bergmann [mailto:a...@arndb.de]
 Sent: Wednesday, December 16, 2009 6:16 AM
 To: virtualization@lists.linux-foundation.org
 Cc: Leonid Grossman; qemu-de...@nongnu.org
 Subject: Re: Guest bridge setup variations
 
 On Wednesday 16 December 2009, Leonid Grossman wrote:
3. Doing the bridging in the NIC using macvlan in passthrough
mode. This lowers the CPU utilization further compared to 2,
at the expense of limiting throughput by the performance of
the PCIe interconnect to the adapter. Whether or not this
is a win is workload dependent.
 
  This is certainly true today for pci-e 1.1 and 2.0 devices, but
  as NICs move to pci-e 3.0 (while remaining almost exclusively dual
 port
  10GbE for a long while),
  EVB internal bandwidth will significantly exceed external bandwidth.
  So, #3 can become a win for most inter-guest workloads.
 
 Right, it's also hardware dependent, but it usually comes down
 to whether it's cheaper to spend CPU cycles or to spend IO bandwidth.
 
 I would be surprised if all future machines with PCIe 3.0 suddenly
have
 a huge surplus of bandwidth but no CPU to keep up with that.
 
Access controls now happen
in the NIC. Currently, this is not supported yet, due to lack of
device drivers, but it will be an important scenario in the
 future
according to some people.
 
  Actually, x3100 10GbE drivers support this today via sysfs interface
 to
  the host driver
  that can choose to control VEB tables (and therefore MAC addresses,
 vlan
  memberships, etc. for all passthru interfaces behind the VEB).
 
 Ok, I didn't know about that.
 
  OF course a more generic vendor-independent interface will be
 important
  in the future.
 
 Right. I hope we can come up with something soon. I'll have a look at
 what your driver does and see if that can be abstracted in some way.

Sounds good, please let us know if looking at the code/documentation
will suffice or you need a couple cards to go along with the code.

 I expect that if we can find an interface between the kernel and
device
 driver for two or three NIC implementations that it will be good
enough
 to adapt to everyone else as well.

The interface will likely evolve along with EVB standards and other
developments, but 
initial implementation can be pretty basic (and vendor-independent). 
Early IOV NIC deployments can benefit from an interface that sets couple
VF parameters missing in legacy NIC interface - things like bandwidth
limit and list of MAC addresses (since setting a NIC in promisc mode
doesn't work well for VEB, it is currently forced to learn the addresses
it is configured for). 
The interface can also include querying IOV NIC capabilities like number
of VFs, support for VEB and/or VEPA mode, etc as well as getting VF
stats and MAC/VLAN tables - all in all, it is not a long list.


 
   Arnd
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: Guest bridge setup variations

2009-12-15 Thread Leonid Grossman
  -Original Message-
  From: virtualization-boun...@lists.linux-foundation.org
  [mailto:virtualization-boun...@lists.linux-foundation.org] On Behalf
 Of
  Arnd Bergmann
  Sent: Tuesday, December 08, 2009 8:08 AM
  To: virtualization@lists.linux-foundation.org
  Cc: qemu-de...@nongnu.org
  Subject: Guest bridge setup variations
 
  As promised, here is my small writeup on which setups I feel
  are important in the long run for server-type guests. This
  does not cover -net user, which is really for desktop kinds
  of applications where you do not want to connect into the
  guest from another IP address.
 
  I can see four separate setups that we may or may not want to
  support, the main difference being how the forwarding between
  guests happens:
 
  1. The current setup, with a bridge and tun/tap devices on ports
  of the bridge. This is what Gerhard's work on access controls is
  focused on and the only option where the hypervisor actually
  is in full control of the traffic between guests. CPU utilization
 should
  be highest this way, and network management can be a burden,
  because the controls are done through a Linux, libvirt and/or
 Director
  specific interface.
 
  2. Using macvlan as a bridging mechanism, replacing the bridge
  and tun/tap entirely. This should offer the best performance on
  inter-guest communication, both in terms of throughput and
  CPU utilization, but offer no access control for this traffic at
all.
  Performance of guest-external traffic should be slightly better
  than bridge/tap.
 
  3. Doing the bridging in the NIC using macvlan in passthrough
  mode. This lowers the CPU utilization further compared to 2,
  at the expense of limiting throughput by the performance of
  the PCIe interconnect to the adapter. Whether or not this
  is a win is workload dependent. 

This is certainly true today for pci-e 1.1 and 2.0 devices, but 
as NICs move to pci-e 3.0 (while remaining almost exclusively dual port
10GbE for a long while), 
EVB internal bandwidth will significantly exceed external bandwidth.
So, #3 can become a win for most inter-guest workloads.

  Access controls now happen
  in the NIC. Currently, this is not supported yet, due to lack of
  device drivers, but it will be an important scenario in the future
  according to some people.

Actually, x3100 10GbE drivers support this today via sysfs interface to
the host driver 
that can choose to control VEB tables (and therefore MAC addresses, vlan
memberships, etc. for all passthru interfaces behind the VEB).
OF course a more generic vendor-independent interface will be important
in the future.

 
  4. Using macvlan for actual VEPA on the outbound interface.
  This is mostly interesting because it makes the network access
  controls visible in an external switch that is already managed.
  CPU utilization and guest-external throughput should be
  identical to 3, but inter-guest latency can only be worse because
  all frames go through the external switch.
 
  In case 2 through 4, we have the choice between macvtap and
  the raw packet interface for connecting macvlan to qemu.
  Raw sockets are better tested right now, while macvtap has
  better permission management (i.e. it does not require
  CAP_NET_ADMIN). Neither one is upstream though at the
  moment. The raw driver only requires qemu patches, while
  macvtap requires both a new kernel driver and a trivial change
  in qemu.
 
  In all four cases, vhost-net could be used to move the workload
  from user space into the kernel, which may be an advantage.
  The decision for or against vhost-net is entirely independent of
  the other decisions.
 
  Arnd
  ___
  Virtualization mailing list
  Virtualization@lists.linux-foundation.org
  https://lists.linux-foundation.org/mailman/listinfo/virtualization
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

2008-11-08 Thread Leonid Grossman


 -Original Message-
 From: Fischer, Anna [mailto:[EMAIL PROTECTED]
 Sent: Saturday, November 08, 2008 3:10 AM
 To: Greg KH; Yu Zhao
 Cc: Matthew Wilcox; Anthony Liguori; H L; [EMAIL PROTECTED];
 [EMAIL PROTECTED]; Chiang, Alexander;
[EMAIL PROTECTED];
 [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED];
 virtualization@lists.linux-foundation.org; [EMAIL PROTECTED];
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; Leonid Grossman;
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
 


  But would such an api really take advantage of the new IOV
interfaces
  that are exposed by the new device type?
 
 I agree with what Yu says. The idea is to have hardware capabilities
to
 virtualize a PCI device in a way that those virtual devices can
represent
 full PCI devices. The advantage of that is that those virtual device
can
 then be used like any other standard PCI device, meaning we can use
 existing
 OS tools, configuration mechanism etc. to start working with them.
Also,
 when
 using a virtualization-based system, e.g. Xen or KVM, we do not need
 to introduce new mechanisms to make use of SR-IOV, because we can
handle
 VFs as full PCI devices.
 
 A virtual PCI device in hardware (a VF) can be as powerful or complex
as
 you like, or it can be very simple. But the big advantage of SR-IOV is
 that hardware presents a complete PCI device to the OS - as opposed to
 some resources, or queues, that need specific new configuration and
 assignment mechanisms in order to use them with a guest OS (like, for
 example, VMDq or similar technologies).
 
 Anna


Ditto. 
Taking netdev interface as an example - a queue pair is a great way to
scale across cpu cores in a single OS image, but it is just not a good
way to share device across multiple OS images. 
The best unit of virtualization is a VF that is implemented as a
complete netdev pci device (not a subset of a pci device).
 This way, native netdev device drivers can work for direct hw access to
a VF as is, and most/all Linux networking features (including VMQ)
will work in a guest.
Also, guest migration for netdev interfaces (both direct and virtual)
can be supported via native Linux mechanism (bonding driver), while Dom0
can retain veto power over any guest direct interface operation it
deems privileged (vlan, mac address, promisc mode, bandwidth allocation
between VFs, etc.).
 
Leonid
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


RE: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support

2008-11-07 Thread Leonid Grossman


 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf
Of
 Zhao, Yu
 Sent: Thursday, November 06, 2008 11:06 PM
 To: Chris Wright
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED];
 Matthew Wilcox; Greg KH; [EMAIL PROTECTED];
[EMAIL PROTECTED];
 [EMAIL PROTECTED]; virtualization@lists.linux-foundation.org;
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: [PATCH 0/16 v6] PCI: Linux kernel SR-IOV support
 
 Chris Wright wrote:
  * Greg KH ([EMAIL PROTECTED]) wrote:
  On Thu, Nov 06, 2008 at 10:47:41AM -0700, Matthew Wilcox wrote:
  On Thu, Nov 06, 2008 at 08:49:19AM -0800, Greg KH wrote:
  On Thu, Nov 06, 2008 at 08:41:53AM -0800, H L wrote:
  I have not modified any existing drivers, but instead I threw
 together
  a bare-bones module enabling me to make a call to
pci_iov_register()
  and then poke at an SR-IOV adapter's /sys entries for which no
 driver
  was loaded.
 
  It appears from my perusal thus far that drivers using these new
  SR-IOV patches will require modification; i.e. the driver
associated
  with the Physical Function (PF) will be required to make the
  pci_iov_register() call along with the requisite notify()
function.
  Essentially this suggests to me a model for the PF driver to
perform
  any global actions or setup on behalf of VFs before enabling
them
  after which VF drivers could be associated.
  Where would the VF drivers have to be associated?  On the
pci_dev
  level or on a higher one?
 
  Will all drivers that want to bind to a VF device need to be
  rewritten?
  The current model being implemented by my colleagues has separate
  drivers for the PF (aka native) and VF devices.  I don't
personally
  believe this is the correct path, but I'm reserving judgement
until I
  see some code.
  Hm, I would like to see that code before we can properly evaluate
this
  interface.  Especially as they are all tightly tied together.
 
  I don't think we really know what the One True Usage model is for
VF
  devices.  Chris Wright has some ideas, I have some ideas and Yu
Zhao
 has
  some ideas.  I bet there's other people who have other ideas too.
  I'd love to hear those ideas.
 
  First there's the question of how to represent the VF on the host.
  Ideally (IMO) this would show up as a normal interface so that
normal
 tools
  can configure the interface.  This is not exactly how the first
round of
  patches were designed.
 
 Whether the VF can show up as a normal interface is decided by VF
 driver. VF is represented by 'pci_dev' at PCI level, so VF driver can
be
 loaded as normal PCI device driver.
 
 What the software representation (eth, framebuffer, etc.) created by
VF
 driver is not controlled by SR-IOV framework.
 
 So you definitely can use normal tool to configure the VF if its
driver
 supports that :-)
 
 
  Second there's the question of reserving the BDF on the host such
that
  we don't have two drivers (one in the host and one in a guest)
trying to
  drive the same device (an issue that shows up for device assignment
as
  well as VF assignment).
 
 If we don't reserve BDF for the device, they can't work neither in the
 host nor the guest.
 
 Without BDF, we can't access the config space of the device, the
device
 also can't do DMA.
 
 Did I miss your point?
 
 
  Third there's the question of whether the VF can be used in the host
at
  all.
 
 Why can't? My VFs work well in the host as normal PCI devices :-)
 
 
  Fourth there's the question of whether the VF and PF drivers are the
  same or separate.
 
 As I mentioned in another email of this thread. We can't predict how
 hardware vendor creates their SR-IOV device. PCI SIG doesn't define
 device specific logics.
 
 So I think the answer of this question is up to the device driver
 developers. If PF and VF in a SR-IOV device have similar logics, then
 they can combine the driver. Otherwise, e.g., if PF doesn't have real
 functionality at all -- it only has registers to control internal
 resource allocation for VFs, then the drivers should be separate,
right?


Right, this really depends upon the functionality behind a VF. If VF is
done as a subset of netdev interface (for example, a queue pair), then a
split VF/PF driver model and a proprietary communication channel is in
order. 

If each VF is done as a complete netdev interface (like in our 10GbE IOV
controllers), then PF and VF drivers could be the same. Each VF can be
independently driven by such native netdev driver; this includes the
ability to run a native driver in a guest in passthru mode. 
A PF driver in a privileged domain doesn't even have to be present.

 
 
  The typical usecase is assigning the VF to the guest directly, so
  there's only enough functionality in the host side to allocate a VF,
  configure it, and assign it (and propagate AER).  This is with
separate
  PF and VF driver.
 
  As Anthony mentioned, we are interested in allowing the host to use
the
  VF.  This could be useful for