subject:"\[dpdk\-dev\] Having troubles binding an SR\-IOV VF to uio_pci

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-04 Thread Michael S. Tsirkin

On Fri, Oct 02, 2015 at 03:07:24PM +0100, Bruce Richardson wrote:
> On Fri, Oct 02, 2015 at 05:00:14PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
> > > validation and translation would add 10s if not 100s of nanoseconds to the
> > > time needed to process each packet.  In addition we are talking about 
> > > doing
> > > this in kernel space which means we wouldn't really be able to take
> > > advantage of things like SSE or AVX instructions.
> > 
> > Yes. But the nice thing is that it's rearming so it can happen on
> > a separate core, in parallel with packet processing.
> > It does not need to add to latency.
> > 
> > You will burn up more CPU, but again, all this for boxes/hypervisors
> > without an IOMMU.
> > 
> > I'm sure people can come up with even better approaches, once enough
> > people get it that kernel absolutely needs to be protected from
> > userspace.
> > 
> > Long term, the right thing to do is to focus on IOMMU support. This
> > gives you hardware-based memory protection without need to burn up CPU
> > cycles.
> > 
> > -- 
> > MST
> 
> Running it on another will have it's own problems. The main one that springs 
> to
> mind for me is the performance impact of having all those cache lines shared
> between the two cores.
> 
> /Bruce

The cache line is currently invalidated by the device write
packet processing core -> device -> packet processing core
We are adding another stage
packet processing core -> rearming core -> device -> packet processing core
but the amount of sharing per core isn't increased.

This is something that can be tried immediately without kernel changes.
Who knows, maybe you will actually be able to push more pps this way.

Further, rearming is not doing a lot besides moving bits around in
memory, and it's in kernel so using very limited resources - maybe we
can efficiently use an HT logical core for this task.
This remains to be seen.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-02 Thread Gleb Natapov

On Fri, Oct 02, 2015 at 05:00:14PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
> > validation and translation would add 10s if not 100s of nanoseconds to the
> > time needed to process each packet.  In addition we are talking about doing
> > this in kernel space which means we wouldn't really be able to take
> > advantage of things like SSE or AVX instructions.
> 
> Yes. But the nice thing is that it's rearming so it can happen on
> a separate core, in parallel with packet processing.
> It does not need to add to latency.
> 
Modern nics have no less queues than most machines has cores. There is
no such thing as free core to offload you processing to, otherwise you
designed your application wrong and waste cpu cycles.

> You will burn up more CPU, but again, all this for boxes/hypervisors
> without an IOMMU.
> 
> I'm sure people can come up with even better approaches, once enough
> people get it that kernel absolutely needs to be protected from
> userspace.
> 
People should not "get" things which are, lets be polite here, untrue.
The kernel never tried to protect itself from userspace rumning on
behalf of root. Secure boot, which is quite recent, is may be an only
instance where kernel tries to do so (unfortunately) and it does so by
disabling things if boot is secure. Linux was always "jack of all
trades" and was suitable to run on a machine with secure boot and a vm
that acts as application container or embedded device running packet
forwarding.

the only valid point is that nobody should debug crashes that may be
caused by buggy userspace and tainting kernel solves that.

> Long term, the right thing to do is to focus on IOMMU support. This
> gives you hardware-based memory protection without need to burn up CPU
> cycles.
> 
> -- 
> MST

--
Gleb.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-02 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
> validation and translation would add 10s if not 100s of nanoseconds to the
> time needed to process each packet.  In addition we are talking about doing
> this in kernel space which means we wouldn't really be able to take
> advantage of things like SSE or AVX instructions.

Yes. But the nice thing is that it's rearming so it can happen on
a separate core, in parallel with packet processing.
It does not need to add to latency.

You will burn up more CPU, but again, all this for boxes/hypervisors
without an IOMMU.

I'm sure people can come up with even better approaches, once enough
people get it that kernel absolutely needs to be protected from
userspace.

Long term, the right thing to do is to focus on IOMMU support. This
gives you hardware-based memory protection without need to burn up CPU
cycles.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-02 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 02:17:49PM -0700, Alexander Duyck wrote:
> On 10/01/2015 02:42 AM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> >>even when they are some users
> >>prefer to avoid the performance penalty.
> >I don't think there's a measureable penalty from passing through the
> >IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
> >never saw any numbers that show such.
> 
> It depends on the IOMMU.  I believe Intel had a performance penalty on all
> CPUs prior to Ivy Bridge.  Since then things have improved to where they are
> comparable to bare metal.
> 
> The graph on page 5 of
> https://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf
> shows the penalty clear as day.  Pretty much anything before Ivy Bridge w/
> small packets is slowed to a crawl with an IOMMU enabled.
> 
> - Alex

VMs are running with IOMMU enabled anyway.
Avi here tells us no one uses SRIOV on bare metal so ...
we don't need to argue about that.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-02 Thread Bruce Richardson

On Fri, Oct 02, 2015 at 05:00:14PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
> > validation and translation would add 10s if not 100s of nanoseconds to the
> > time needed to process each packet.  In addition we are talking about doing
> > this in kernel space which means we wouldn't really be able to take
> > advantage of things like SSE or AVX instructions.
> 
> Yes. But the nice thing is that it's rearming so it can happen on
> a separate core, in parallel with packet processing.
> It does not need to add to latency.
> 
> You will burn up more CPU, but again, all this for boxes/hypervisors
> without an IOMMU.
> 
> I'm sure people can come up with even better approaches, once enough
> people get it that kernel absolutely needs to be protected from
> userspace.
> 
> Long term, the right thing to do is to focus on IOMMU support. This
> gives you hardware-based memory protection without need to burn up CPU
> cycles.
> 
> -- 
> MST

Running it on another will have it's own problems. The main one that springs to
mind for me is the performance impact of having all those cache lines shared
between the two cores.

/Bruce

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-02 Thread Alexander Duyck

On 10/02/2015 07:00 AM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:02:24PM -0700, Alexander Duyck wrote:
>> validation and translation would add 10s if not 100s of nanoseconds to the
>> time needed to process each packet.  In addition we are talking about doing
>> this in kernel space which means we wouldn't really be able to take
>> advantage of things like SSE or AVX instructions.
> Yes. But the nice thing is that it's rearming so it can happen on
> a separate core, in parallel with packet processing.
> It does not need to add to latency.

Moving it to another core is automatically going to add extra latency.  
You will have to evict the data out of the L1 cache for one core and 
into the L1 cache for another when you update it, and then reading it 
will force it to have to transition back out.  If you are lucky it is 
only evicted to L2, if not then to L3, or possibly even back to memory.  
Odds are that alone will add tens of nanoseconds to the process, and you 
would need three or more cores to do the same workload as running the 
process over multiple threads means having to add synchronization 
primitives to the whole mess. Then there is the NUMA factor on top of that.

> You will burn up more CPU, but again, all this for boxes/hypervisors
> without an IOMMU.

There are use cases this will completely make useless.  If for example 
you are running a workload that needs three cores with DPDK bumping it 
to nine or more will likely push you out of being able to do the 
workload on some systems.

> I'm sure people can come up with even better approaches, once enough
> people get it that kernel absolutely needs to be protected from
> userspace.

I don't see that happening.  Many people don't care about kernel 
security that much.  If they did something like DPDK wouldn't have 
gotten off of the ground.  Once someone has the ability to load kernel 
modules any protection of the kernel from userspace pretty much goes 
right out the window.  You are just as much at risk from a buggy driver 
in userspace as you are from one that can be added to the kernel.

> Long term, the right thing to do is to focus on IOMMU support. This
> gives you hardware-based memory protection without need to burn up CPU
> cycles.

We have a solution that makes use of IOMMU support with vfio.  The 
problem is there are multiple cases where that support is either not 
available, or using the IOMMU provides excess overhead.

- Alex

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 04:14:33PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote:
> > > > This in itself is going to use up
> > > > a good proportion of the processing time, as well as that we have to 
> > > > spend cycles
> > > > copying the descriptors from one ring in memory to another. Given that 
> > > > right now
> > > > with the vector ixgbe driver, the cycle cost per packet of RX is just a 
> > > > few dozen
> > > > cycles on modern cores, every additional cycle (fraction of a 
> > > > nanosecond) has
> > > > an impact.
> > > > 
> > > > Regards,
> > > > /Bruce
> > > 
> > > See above.  There is no need for that on data path. Only re-adding
> > > buffers requires a system call.
> > > 
> > 
> > Re-adding buffers is a key part of the data path! Ok, the fact that its 
> > only on
> > descriptor rearm does allow somewhat bigger batches,
> 
> That was the point, yes.
> 
> > but the whole point of having
> > the kernel do this extra work you propose is to allow the kernel to scan and
> > sanitize the physical addresses - and that will take a lot of cycles, 
> > especially
> > if it has to handle all the different descriptor formats of all the 
> > different NICs,
> > as has already been pointed out.
> > 
> > /Bruce
> 
> Well the driver would be per NIC, so there's only need to support
> specific formats supported by a given NIC.
> 
> An alternative is to format the descriptors in kernel, based
> on just the list of addresses. This seems cleaner, but I don't
> know how efficient it would be.


Additionally. rearming descriptors can happen on another
core in parallel with processing packets on the first one.

This will use more CPU but help you stay within your PPS limits.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 11:42:23AM +0200, Vincent JARDIN wrote:
> There were some tentative to get it for other (older) drivers, named
> 'bifurcated drivers', but it is stalled.

That approach also has the advantage that userspace bugs can't do
silly things like reprogram device's EEPROM by mistake.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 07:55:20AM -0700, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 13:14:08 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
> > > >There were some tentative to get it for other (older) drivers, named
> > > >'bifurcated drivers', but it is stalled.
> > > 
> > > IIRC they still exposed the ring to userspace.
> > 
> > How much would a ring write syscall cost? 1-2 microseconds, isn't it?
> 
> The per-packet budget at 10G is 62ns, a syscall just doesn't cut it.

If you give up on privacy and only insist on security
(can read all kernel memory, can't corrupt it), then
you only need the syscall to re-arm RX descriptors,
and these can be batched aggressively without impacting
latency.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 08:01:00AM -0700, Stephen Hemminger wrote:
> The per-driver ring method is what netmap did.

IIUC netmap has a standard format for descriptors, so was slower: it
still had to do all networking in kernel (it only bypasses
part of the networking stack), and to have a thread to
translate between software and hardware formats.

> The problem with that is that it forces infrastructure into already
> complex network driver.  It never was accepted.  There were also still
> security issues like time of check/time of use with the ring.

Right, because people do care about security.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 06:19:33PM +0300, Avi Kivity wrote:
> On 10/01/2015 06:11 PM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote:
> >>>  We already agreed this kernel
> >>>is going to be tainted, and unsupportable.
> >>Yes.  So your only motivation in rejecting the patch is to get the author to
> >>write the ring translation patch and port it to all relevant drivers
> >>instead?
> >Not only that.
> >
> >To make sure users are aware they are doing insecure
> >things when using software poking at device BARs in sysfs.
> 
> I don't think you need to worry about that.  People who program DMA are
> aware of the damage is can cause.

People just install software and run it. They don't program DMA.

And I notice that no software (ab)using this seems to come with
documentation explaining the implications.

> If you want to be extra sure, have uio
> taint the kernel when bus mastering is enabled.

People don't notice kernel is tainted.  Denying module load will make
them notice.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 06:11 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote:
>>>   We already agreed this kernel
>>> is going to be tainted, and unsupportable.
>> Yes.  So your only motivation in rejecting the patch is to get the author to
>> write the ring translation patch and port it to all relevant drivers
>> instead?
> Not only that.
>
> To make sure users are aware they are doing insecure
> things when using software poking at device BARs in sysfs.

I don't think you need to worry about that.  People who program DMA are 
aware of the damage is can cause.  If you want to be extra sure, have 
uio taint the kernel when bus mastering is enabled.

> To avoid giving virtualization a bad name for security.

There is no security issue here.  Those VMs run a single application, 
and cannot attack the host or other VMs.

> To get people to work on safe, maintainable solutions.

That's a great goal but I don't think it can be achieved without 
sacrificing performance, which is the only reason for dpdk's existence.  
If safe and maintainable were the only requirements, people would not 
bypass the kernel.

The only thing you are really achieving by blocking this is causing pain.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 02:32:19PM +0300, Avi Kivity wrote:
> >  We already agreed this kernel
> >is going to be tainted, and unsupportable.
> 
> Yes.  So your only motivation in rejecting the patch is to get the author to
> write the ring translation patch and port it to all relevant drivers
> instead?

Not only that.

To make sure users are aware they are doing insecure
things when using software poking at device BARs in sysfs.
To avoid giving virtualization a bad name for security.
To get people to work on safe, maintainable solutions.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 06:01 PM, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 14:32:19 +0300
> Avi Kivity  wrote:
>
>> On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
 People will just use out of tree drivers (dpdk has several already).  It's 
 a
 pain, but nowhere near what you are proposing.
>>> What's the issue with that?
>> Out of tree drivers have to be compiled on the target system (cannot
>> ship a binary package), and occasionally break.
>>
>> dkms helps with that, as do distributions that promise binary
>> compatibility, but it is still a pain, compared to just shipping a
>> userspace package.
>>
>>>We already agreed this kernel
>>> is going to be tainted, and unsupportable.
>> Yes.  So your only motivation in rejecting the patch is to get the
>> author to write the ring translation patch and port it to all relevant
>> drivers instead?
> The per-driver ring method is what netmap did.
> The problem with that is that it forces infrastructure into already
> complex network driver. It never was accepted. There were also still
> security issues like time of check/time of use with the ring.

There would have to be two rings, with the driver picking up descriptors 
from the software ring, translating virtual addresses, and pushing them 
into the hardware ring.

I'm not familiar enough with the truly high performance dpdk 
applications to estimate the performance impact.  Seastar/scylla gets a 
huge benefit from dpdk, but is still nowhere near line rate.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Vlad Zolotarov



On 10/01/15 17:47, Stephen Hemminger wrote:
> On Thu, 1 Oct 2015 11:00:28 +0300
> Vlad Zolotarov  wrote:
>
>>
>> On 10/01/15 00:36, Stephen Hemminger wrote:
>>> On Wed, 30 Sep 2015 23:09:33 +0300
>>> Vlad Zolotarov  wrote:
>>>
 On 09/30/15 22:39, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
 How would iommu
 virtualization change anything?
>>> Kernel can use an iommu to limit device access to memory of
>>> the controlling application.
>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
>> interrupts support in uio_pci_generic? kernel may continue to limit the
>> above access with this support as well.
> It could maybe. So if you write a patch to allow MSI by at the same time
> creating an isolated IOMMU group and blocking DMA from device in
> question anywhere, that sounds reasonable.
 No, I'm only planning to add MSI and MSI-X interrupts support for
 uio_pci_generic device.
 The rest mentioned above should naturally be a matter of a different
 patch and writing it is orthogonal to the patch I'm working on as has
 been extensively discussed in this thread.

>>> I have a generic MSI and MSI-X driver (posted earlier on this list).
>>> About to post to upstream kernel.
>> Stephen, hi!
>>
>> I found the mentioned series and first thing I noticed was that it's
>> been sent in May so the first question is how far in your list of tasks
>> submitting it upstream is? We need it more or less yesterday and I'm
>> working on it right now. Therefore if u don't have time for it I'd like
>> to help... ;) However I'd like u to clarify a few small things. Pls.,
>> see below...
>>
>> I noticed that u've created a separate msi_msix driver and the second
>> question is what do u plan for the upstream? I was thinking of extending
>> the existing uio_pci_generic with the MSI-X functionality similar to
>> your code and preserving the INT#X functionality as it is now:
> The igb_uio has a bunch of other things I didn't want to deal with:
> the name (being specific to old Intel driver); compatibility with older
> kernels; legacy ABI support. Therefore in effect uio_msi is a rebase
> of igb_uio.
>
> The submission upstream yesterday is the first step, I expect lots
> of review feedback.

Sure, we have lots of feedback already even before the patch has been 
sent... ;)
So, I'm preparing the uio_pci_generic patch. Just wanted to make sure we 
are not working on the same patch at the same time... ;)

It's going to enable both MSI and MSI-X support.
For a backward compatibility it'll enable INT#X by default.
It follows the concepts and uses some code pieces from your uio_msi 
patch. If u want I'll put u as a signed-off when I send it.


>
>>*   INT#X and MSI would provide the IRQ number to the UIO module while
>>  only MSI-X case would register with UIO_IRQ_CUSTOM.
> I wanted all IRQ's to be the same for the driver, ie all go through
> eventfd mechanism. This makes code on DPDK side consistent with less
> special cases.

Of course. The name (uio_msi) is a bit confusing since it only adds 
MSI-X support. I mistakenly thought that it adds both MSI and MSI-X but 
it seems to only add MSI-X and then there are no further questions... ;)

>
>> I also noticed that u enable MSI-X on a first open() call. I assume
>> there was a good reason (that I miss) for not doing it in probe(). Could
>> u, pls., clarify?

What about this?

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote:
> > > This in itself is going to use up
> > > a good proportion of the processing time, as well as that we have to 
> > > spend cycles
> > > copying the descriptors from one ring in memory to another. Given that 
> > > right now
> > > with the vector ixgbe driver, the cycle cost per packet of RX is just a 
> > > few dozen
> > > cycles on modern cores, every additional cycle (fraction of a nanosecond) 
> > > has
> > > an impact.
> > > 
> > > Regards,
> > > /Bruce
> > 
> > See above.  There is no need for that on data path. Only re-adding
> > buffers requires a system call.
> > 
> 
> Re-adding buffers is a key part of the data path! Ok, the fact that its only 
> on
> descriptor rearm does allow somewhat bigger batches,

That was the point, yes.

> but the whole point of having
> the kernel do this extra work you propose is to allow the kernel to scan and
> sanitize the physical addresses - and that will take a lot of cycles, 
> especially
> if it has to handle all the different descriptor formats of all the different 
> NICs,
> as has already been pointed out.
> 
> /Bruce

Well the driver would be per NIC, so there's only need to support
specific formats supported by a given NIC.

An alternative is to format the descriptors in kernel, based
on just the list of addresses. This seems cleaner, but I don't
know how efficient it would be.

Device vendors and dpdk developers are probably the best people to
figure out what's the best thing to do here.

But it looks like it's not going to happen unless security is made
a requirement for upstreaming code.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 02:31 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
>>
>> On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
>> It's not just the lack of system calls, of course, the architecture is
>> completely different.
> Absolutely - I'm not saying move all of DPDK into kernel.
> We just need to protect the RX rings so hardware does
> not corrupt kernel memory.
>
>
> Thinking about it some more, many devices
> have separate rings for DMA: TX (device reads memory)
> and RX (device writes memory).
> With such devices, a mode where userspace can write TX ring
> but not RX ring might make sense.
 I'm sure you can cause havoc just by reading, if you read from I/O memory.
>>> Not talking about I/O memory here. These are device rings in RAM.
>> Right.  But you program them with DMA addresses, so the device can read
>> another device's memory.
> It can't if host has limited it to only DMA into guest RAM, which is
> pretty common.
>

Ok.  So yes, the tx ring can be mapped R/W.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
>> People will just use out of tree drivers (dpdk has several already).  It's a
>> pain, but nowhere near what you are proposing.
> What's the issue with that?

Out of tree drivers have to be compiled on the target system (cannot 
ship a binary package), and occasionally break.

dkms helps with that, as do distributions that promise binary 
compatibility, but it is still a pain, compared to just shipping a 
userspace package.

>   We already agreed this kernel
> is going to be tainted, and unsupportable.

Yes.  So your only motivation in rejecting the patch is to get the 
author to write the ring translation patch and port it to all relevant 
drivers instead?

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
> 
> 
> On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
> It's not just the lack of system calls, of course, the architecture is
> completely different.
> >>>Absolutely - I'm not saying move all of DPDK into kernel.
> >>>We just need to protect the RX rings so hardware does
> >>>not corrupt kernel memory.
> >>>
> >>>
> >>>Thinking about it some more, many devices
> >>>have separate rings for DMA: TX (device reads memory)
> >>>and RX (device writes memory).
> >>>With such devices, a mode where userspace can write TX ring
> >>>but not RX ring might make sense.
> >>I'm sure you can cause havoc just by reading, if you read from I/O memory.
> >Not talking about I/O memory here. These are device rings in RAM.
> 
> Right.  But you program them with DMA addresses, so the device can read
> another device's memory.

It can't if host has limited it to only DMA into guest RAM, which is
pretty common.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
> People will just use out of tree drivers (dpdk has several already).  It's a
> pain, but nowhere near what you are proposing.

What's the issue with that? We already agreed this kernel
is going to be tainted, and unsupportable.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote:
> On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> > > 
> > > 
> > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> > > >>It's easy to claim that
> > > >>a solution is around the corner, only no one was looking for it, but the
> > > >>reality is that kernel bypass has been a solution for years for high
> > > >>performance users,
> > > >I never said that it's trivial.
> > > >
> > > >It's probably a lot of work. It's definitely more work than just abusing
> > > >sysfs.
> > > >
> > > >But it looks like a write system call into an eventfd is about 1.5
> > > >microseconds on my laptop. Even with a system call per packet, system
> > > >call overhead is not what makes DPDK drivers outperform Linux ones.
> > > >
> > > 
> > > 1.5 us = 0.6 Mpps per core limit.
> > 
> > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
> > But for RX, you can batch a lot of packets.
> > 
> > You can see by now I'm not that good at benchmarking.
> > Here's what I wrote:
> > 
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > 
> > int main(int argc, char **argv)
> > {
> > int e = eventfd(0, 0);
> > uint64_t v = 1;
> > 
> > int i;
> > 
> > for (i = 0; i < 1000; ++i) {
> > write(e, , sizeof v);
> > }
> > }
> > 
> > 
> > This takes 1.5 seconds to run on my laptop:
> > 
> > $ time ./a.out 
> > 
> > real0m1.507s
> > user0m0.179s
> > sys 0m1.328s
> > 
> > 
> > > dpdk performance is in the tens of
> > > millions of packets per system.
> > 
> > I think that's with a bunch of batching though.
> > 
> > > It's not just the lack of system calls, of course, the architecture is
> > > completely different.
> > 
> > Absolutely - I'm not saying move all of DPDK into kernel.
> > We just need to protect the RX rings so hardware does
> > not corrupt kernel memory.
> > 
> > 
> > Thinking about it some more, many devices
> > have separate rings for DMA: TX (device reads memory)
> > and RX (device writes memory).
> > With such devices, a mode where userspace can write TX ring
> > but not RX ring might make sense.
> > 
> > This will mean userspace might read kernel memory
> > through the device, but can not corrupt it.
> > 
> > That's already a big win!
> > 
> > And RX buffers do not have to be added one at a time.
> > If we assume 0.2usec per system call, batching some 100 buffers per
> > system call gives you 2 nano seconds overhead.  That seems quite
> > reasonable.
> > 
> Hi,
> 
> just to jump in a bit on this.
> 
> Batching of 100 packets is a very large batch, and will add to latency.



This is not on transmit or receive path!
This is only for re-adding buffers to the receive ring.
This batching should not add latency at all:


process rx:
get packet
packets[n] = alloc packet
if (++n > 100) {
system call: add bufs(packets, n);
}





> The
> standard batch size in DPDK right now is 32, and even that may be too high for
> applications in certain domains.
> 
> However, even with that 2ns of overhead calculation, I'd make a few additional
> points.
> * For DPDK, we are reasonably close to being able to do 40GB of IO - both RX 
> and TX on a single thread. 10GB of IO doesn't really stress a core any more. 
> For
> 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with 
> a
> huge batch size of 100 packets, your system call overhead on RX is taking 
> almost
> 12% of our processing time. For a batch size of 32 this overhead would rise to
> over 35% of our packet processing time.

As I said, yes, measureable, but not breaking the bank, and that's with
40GB which still are not widespread.
With 10GB and 100 packets, only 3% overhead.

> For 100G line rate, the packet arrival
> rate is just 6.7ns...

Hypervisors still have time get their act together and support IOMMUs
by the time 100G systems become widespread.

> * As well as this overhead from the system call itself, you are also omitting
> the overhead of scanning the RX descriptors.

I omit it because scanning descriptors can still be done in userspace,
just write-protect the RX ring page.


> This in itself is going to use up
> a good proportion of the processing time, as well as that we have to spend 
> cycles
> copying the descriptors from one ring in memory to another. Given that right 
> now
> with the vector ixgbe driver, the cycle cost per packet of RX is just a few 
> dozen
> cycles on modern cores, every additional cycle (fraction of a nanosecond) has
> an impact.
> 
> Regards,
> /Bruce

See above.  There is no need for that on data path. Only re-adding
buffers requires a system call.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 02:09 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
 It's not just the lack of system calls, of course, the architecture is
 completely different.
>>> Absolutely - I'm not saying move all of DPDK into kernel.
>>> We just need to protect the RX rings so hardware does
>>> not corrupt kernel memory.
>>>
>>>
>>> Thinking about it some more, many devices
>>> have separate rings for DMA: TX (device reads memory)
>>> and RX (device writes memory).
>>> With such devices, a mode where userspace can write TX ring
>>> but not RX ring might make sense.
>> I'm sure you can cause havoc just by reading, if you read from I/O memory.
> Not talking about I/O memory here. These are device rings in RAM.

Right.  But you program them with DMA addresses, so the device can read 
another device's memory.

>>> This will mean userspace might read kernel memory
>>> through the device, but can not corrupt it.
>>>
>>> That's already a big win!
>>>
>>> And RX buffers do not have to be added one at a time.
>>> If we assume 0.2usec per system call, batching some 100 buffers per
>>> system call gives you 2 nano seconds overhead.  That seems quite
>>> reasonable.
>> You're ignoring the page table walk
> Some caching strategy might work here.

It may, or it may not.  I'm not against this.  I'm against blocking 
user's access to their hardware, using an existing, established 
interface, for a small subset of setups.  It doesn't help you in any way 
(you can still get reports of oopses due to buggy userspace drivers on 
physical machines, or on virtual machines that don't require 
interrupts), and it harms them.

>> and other per-descriptor processing.
> You probably can let userspace pre-format it all,
> just validate addresses.

You have to figure out if the descriptor contains an address or not 
(many devices have several descriptor formats, some with addresses and 
some without, which are intermixed).  You also have to parse the 
descriptor size and see if it crosses a page boundary or not.

>
>> Again^2, maybe this can work.  But it shouldn't block a patch enabling
>> interrupt support of VFs.  After the ring proxy is available and proven for
>> a few years, we can deprecate bus mastering from uio, and after a few more
>> years remove it.
> We are talking about DPDK patches posted in June 2015.  It's not some
> software proven for years.

dpdk has been used for years, it just won't work on VFs, if you need 
interrupt support.

>If Linux keeps enabling hacks, no one will
> bother doing the right thing.  Upstream inclusion is the only carrot
> Linux has to make people do the right thing.

It's not a carrot, it's a stick.  Implementing you scheme will take a 
huge effort, is not guaranteed to provide the performance needed, and 
will not be available for years.  Meanwhile exactly the same thing on 
physical machines is supported.

People will just use out of tree drivers (dpdk has several already).  
It's a pain, but nowhere near what you are proposing.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Alexander Duyck

On 10/01/2015 02:42 AM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
>> even when they are some users
>> prefer to avoid the performance penalty.
> I don't think there's a measureable penalty from passing through the
> IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
> never saw any numbers that show such.

It depends on the IOMMU.  I believe Intel had a performance penalty on 
all CPUs prior to Ivy Bridge.  Since then things have improved to where 
they are comparable to bare metal.

The graph on page 5 of 
https://networkbuilders.intel.com/docs/Network_Builders_RA_vBRAS_Final.pdf 
shows the penalty clear as day.  Pretty much anything before Ivy Bridge 
w/ small packets is slowed to a crawl with an IOMMU enabled.

- Alex

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 01:50:10PM +0300, Avi Kivity wrote:
> >
> >>It's not just the lack of system calls, of course, the architecture is
> >>completely different.
> >Absolutely - I'm not saying move all of DPDK into kernel.
> >We just need to protect the RX rings so hardware does
> >not corrupt kernel memory.
> >
> >
> >Thinking about it some more, many devices
> >have separate rings for DMA: TX (device reads memory)
> >and RX (device writes memory).
> >With such devices, a mode where userspace can write TX ring
> >but not RX ring might make sense.
> 
> I'm sure you can cause havoc just by reading, if you read from I/O memory.

Not talking about I/O memory here. These are device rings in RAM.

> >
> >This will mean userspace might read kernel memory
> >through the device, but can not corrupt it.
> >
> >That's already a big win!
> >
> >And RX buffers do not have to be added one at a time.
> >If we assume 0.2usec per system call, batching some 100 buffers per
> >system call gives you 2 nano seconds overhead.  That seems quite
> >reasonable.
> 
> You're ignoring the page table walk

Some caching strategy might work here.

> and other per-descriptor processing.

You probably can let userspace pre-format it all,
just validate addresses.

> Again^2, maybe this can work.  But it shouldn't block a patch enabling
> interrupt support of VFs.  After the ring proxy is available and proven for
> a few years, we can deprecate bus mastering from uio, and after a few more
> years remove it.

We are talking about DPDK patches posted in June 2015.  It's not some
software proven for years.  If Linux keeps enabling hacks, no one will
bother doing the right thing.  Upstream inclusion is the only carrot
Linux has to make people do the right thing.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Alexander Duyck

On 10/01/2015 06:14 AM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:07:13PM +0100, Bruce Richardson wrote:
 This in itself is going to use up
 a good proportion of the processing time, as well as that we have to spend 
 cycles
 copying the descriptors from one ring in memory to another. Given that 
 right now
 with the vector ixgbe driver, the cycle cost per packet of RX is just a 
 few dozen
 cycles on modern cores, every additional cycle (fraction of a nanosecond) 
 has
 an impact.

 Regards,
 /Bruce
>>> See above.  There is no need for that on data path. Only re-adding
>>> buffers requires a system call.
>>>
>> Re-adding buffers is a key part of the data path! Ok, the fact that its only 
>> on
>> descriptor rearm does allow somewhat bigger batches,
> That was the point, yes.
>
>> but the whole point of having
>> the kernel do this extra work you propose is to allow the kernel to scan and
>> sanitize the physical addresses - and that will take a lot of cycles, 
>> especially
>> if it has to handle all the different descriptor formats of all the 
>> different NICs,
>> as has already been pointed out.
>>
>> /Bruce
> Well the driver would be per NIC, so there's only need to support
> specific formats supported by a given NIC.

One thing that seems to be overlooked in your discussion is the cost to 
translate these descriptors.  It isn't as if most systems running DPDK 
have the cycles to spare.  As I believe was brought up in another thread 
we are looking at a budget of something like 68ns of 10Gbps line rate.  
The overhead for having to go through and translate/parse/validate the 
descriptors would end up being pretty significant.  If you need proof of 
that just try running the ixgbe driver and route small packets.  We end 
up spending something like 40ns in ixgbe_clean_rx_irq and that is mostly 
just translating the descriptor bits into the correct sk_buff bits.  
Also trying to maintain a user-space ring in addition to the 
kernel-space ring means that much more memory overhead and increasing 
the liklihood of things getting pushed out of the L1 cache.

As far as the descriptor validation itself the overhead for that would 
guarantee that you cannot get any performance out of the device.  There 
are too many corner cases that would have to be addressed in validating 
user-space input to allow for us to process packets in any sort of 
timely fashion.  For starters we would have to validate the size, 
alignment, and ownership of a given buffer. If it is a transmit buffer 
we have to go through and validate any offloads being requested.  Likely 
just the validation and translation would add 10s if not 100s of 
nanoseconds to the time needed to process each packet.  In addition we 
are talking about doing this in kernel space which means we wouldn't 
really be able to take advantage of things like SSE or AVX instructions.

> An alternative is to format the descriptors in kernel, based
> on just the list of addresses. This seems cleaner, but I don't
> know how efficient it would be.
>
> Device vendors and dpdk developers are probably the best people to
> figure out what's the best thing to do here.

As far as the bifurcated driver approach the only way something like 
that would ever work is if you could limit the access via an IOMMU. At 
least everything I have seen proposed for a bifurcated driver still 
involved one if they expected to get any performance.

> But it looks like it's not going to happen unless security is made
> a requirement for upstreaming code.

The fact is we already ship uio_pci_generic.  User space drivers are 
here to stay.  What is being asked for is an extension to the existing 
infrastructure to allow MSI-X interrupts to trigger an event on a file 
descriptor.  As far as I know that doesn't add any additional security 
risk since it is the kernel PCIe subsystem itself that would be 
programming the address and data for said device, it wouldn't actually 
grant any more access other then the additional file descriptors to 
support MSI-X vectors.

Anyway that is just my $.02 on this.

- Alex

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 01:44 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 01:25:17PM +0300, Avi Kivity wrote:
>> Why use a VF on a non-virtualized system?
> 1. So a userspace bug does not destroy your hardware
> (PFs generally assume trusted non-buggy drivers, VFs
>  generally don't).

People who use dpdk trust their drivers (those drivers are the reason 
for the system to exist in the first place).

> 2. So you can use a PF or another VF for regular networking.

This is valid, but usually those systems have a separate management 
network.  Unfortunately VFs limit the number of queues you can expose, 
making them less performant than PFs.

The "bifurcated drivers" were meant as a way of enabling this 
functionality without resorting to VFs, but it seems they are stalled.

> 3. So you can manage this system, to some level.
>

Again existing practice doesn't follow this.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 01:38 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
>>
>> On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
 It's easy to claim that
 a solution is around the corner, only no one was looking for it, but the
 reality is that kernel bypass has been a solution for years for high
 performance users,
>>> I never said that it's trivial.
>>>
>>> It's probably a lot of work. It's definitely more work than just abusing
>>> sysfs.
>>>
>>> But it looks like a write system call into an eventfd is about 1.5
>>> microseconds on my laptop. Even with a system call per packet, system
>>> call overhead is not what makes DPDK drivers outperform Linux ones.
>>>
>> 1.5 us = 0.6 Mpps per core limit.
> Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.

You also trimmed the extra work that needs to be done, that I 
mentioned.  Maybe your ring proxy can work, maybe it can't.  In any case 
it's a hefty chunk of work.  Should this work block users from using 
their VFs, if they happen to need interrupt support?

> But for RX, you can batch a lot of packets.
>
> You can see by now I'm not that good at benchmarking.
> Here's what I wrote:
>
>
> #include 
> #include 
> #include 
> #include 
>
>
> int main(int argc, char **argv)
> {
>  int e = eventfd(0, 0);
>  uint64_t v = 1;
>
>  int i;
>
>  for (i = 0; i < 1000; ++i) {
>  write(e, , sizeof v);
>  }
> }
>
>
> This takes 1.5 seconds to run on my laptop:
>
> $ time ./a.out
>
> real0m1.507s
> user0m0.179s
> sys 0m1.328s
>
>
>> dpdk performance is in the tens of
>> millions of packets per system.
> I think that's with a bunch of batching though.

Yes, it's also with their application code running as well.  They didn't 
reach this kind of performance by spending cycles unnecessarily.

I'm not saying that the ring proxy is not workable; just that we don't 
know whether it is or not, and I don't think that a patch that enables 
_existing functionality_ for VFs should be blocked in favor of a new and 
unproven approach.

>
>> It's not just the lack of system calls, of course, the architecture is
>> completely different.
> Absolutely - I'm not saying move all of DPDK into kernel.
> We just need to protect the RX rings so hardware does
> not corrupt kernel memory.
>
>
> Thinking about it some more, many devices
> have separate rings for DMA: TX (device reads memory)
> and RX (device writes memory).
> With such devices, a mode where userspace can write TX ring
> but not RX ring might make sense.

I'm sure you can cause havoc just by reading, if you read from I/O memory.

>
> This will mean userspace might read kernel memory
> through the device, but can not corrupt it.
>
> That's already a big win!
>
> And RX buffers do not have to be added one at a time.
> If we assume 0.2usec per system call, batching some 100 buffers per
> system call gives you 2 nano seconds overhead.  That seems quite
> reasonable.

You're ignoring the page table walk and other per-descriptor processing.

Again^2, maybe this can work.  But it shouldn't block a patch enabling 
interrupt support of VFs.  After the ring proxy is available and proven 
for a few years, we can deprecate bus mastering from uio, and after a 
few more years remove it.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 01:25:17PM +0300, Avi Kivity wrote:
> Why use a VF on a non-virtualized system?

1. So a userspace bug does not destroy your hardware
   (PFs generally assume trusted non-buggy drivers, VFs
generally don't).
2. So you can use a PF or another VF for regular networking.
3. So you can manage this system, to some level.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> 
> 
> On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> >>It's easy to claim that
> >>a solution is around the corner, only no one was looking for it, but the
> >>reality is that kernel bypass has been a solution for years for high
> >>performance users,
> >I never said that it's trivial.
> >
> >It's probably a lot of work. It's definitely more work than just abusing
> >sysfs.
> >
> >But it looks like a write system call into an eventfd is about 1.5
> >microseconds on my laptop. Even with a system call per packet, system
> >call overhead is not what makes DPDK drivers outperform Linux ones.
> >
> 
> 1.5 us = 0.6 Mpps per core limit.

Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
But for RX, you can batch a lot of packets.

You can see by now I'm not that good at benchmarking.
Here's what I wrote:

#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
int e = eventfd(0, 0);
uint64_t v = 1;

int i;

for (i = 0; i < 1000; ++i) {
write(e, , sizeof v);
}
}

This takes 1.5 seconds to run on my laptop:

$ time ./a.out 

real0m1.507s
user0m0.179s
sys 0m1.328s

> dpdk performance is in the tens of
> millions of packets per system.

I think that's with a bunch of batching though.

> It's not just the lack of system calls, of course, the architecture is
> completely different.

Absolutely - I'm not saying move all of DPDK into kernel.
We just need to protect the RX rings so hardware does
not corrupt kernel memory.

Thinking about it some more, many devices
have separate rings for DMA: TX (device reads memory)
and RX (device writes memory).
With such devices, a mode where userspace can write TX ring
but not RX ring might make sense.

This will mean userspace might read kernel memory
through the device, but can not corrupt it.

That's already a big win!

And RX buffers do not have to be added one at a time.
If we assume 0.2usec per system call, batching some 100 buffers per
system call gives you 2 nano seconds overhead.  That seems quite
reasonable.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 01:24 PM, Avi Kivity wrote:
> On 10/01/2015 01:17 PM, Michael S. Tsirkin wrote:
>> On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote:
>>> Non-virtualized setups have an iommu available, but they can also use
>>> pci_uio_generic without patching if they like.
>> Not with VFs, they can't.
>>
>
> They can and they do (I use it myself).

I mean with a PF.  Why use a VF on a non-virtualized system?

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 01:17 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote:
>> Non-virtualized setups have an iommu available, but they can also use
>> pci_uio_generic without patching if they like.
> Not with VFs, they can't.
>

They can and they do (I use it myself).

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 01:14 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
>>> There were some tentative to get it for other (older) drivers, named
>>> 'bifurcated drivers', but it is stalled.
>> IIRC they still exposed the ring to userspace.
> How much would a ring write syscall cost? 1-2 microseconds, isn't it?
> Measureable, but it's not the end of the world.

Plus a page table walk per packet fragment (dpdk has the physical 
address prepared in the mbuf IIRC).  The 10Mpps+ users of dpdk should 
comment on whether the performance hit is acceptable (my use case is 
much more modest).

> ring read might be safe to allow.
> Should buy us enough time until hypervisors support IOMMU.

All the relevant drivers need to be converted to support ring 
translation, and exposing the ring to userspace in the new API.  It 
shouldn't take more than 3-4 years.

Meanwhile, users of virtualized systems that need interrupt support 
cannot use their machines, while non-virtualized users are free to DMA 
wherever they like, in the name of security.

btw, an API like you describe already exists -- vhost.  Of course the 
virtio protocol is nowhere near fast enough, but at least it's an example.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:53:14PM +0300, Avi Kivity wrote:
> Non-virtualized setups have an iommu available, but they can also use
> pci_uio_generic without patching if they like.

Not with VFs, they can't.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
> >There were some tentative to get it for other (older) drivers, named
> >'bifurcated drivers', but it is stalled.
> 
> IIRC they still exposed the ring to userspace.

How much would a ring write syscall cost? 1-2 microseconds, isn't it?
Measureable, but it's not the end of the world.
ring read might be safe to allow.
Should buy us enough time until hypervisors support IOMMU.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 01:07 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:38:51PM +0300, Avi Kivity wrote:
>> The sad thing is that you can do this since forever on a non-virtualized
>> system, or on a virtualized system if you don't need interrupt support.  All
>> you're doing is blocking interrupt support on virtualized systems.
> True, Linux could do more to prevent this kind of abuse.
> In fact IIRC, if you enable secureboot, it does exactly that.
>
> A generic uio driver isn't a good interface because it relies on these
> sysfs files. We are luckly it doesn't work for VFs, I don't think we
> should do anything that relies on this interface in future applications.
>

I agree that uio is not a good solution.  But for some users, which we 
are discussing now, it is the only solution.

A bad solution is better than no solution.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:38:51PM +0300, Avi Kivity wrote:
> The sad thing is that you can do this since forever on a non-virtualized
> system, or on a virtualized system if you don't need interrupt support.  All
> you're doing is blocking interrupt support on virtualized systems.

True, Linux could do more to prevent this kind of abuse.
In fact IIRC, if you enable secureboot, it does exactly that.

A generic uio driver isn't a good interface because it relies on these
sysfs files. We are luckly it doesn't work for VFs, I don't think we
should do anything that relies on this interface in future applications.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Bruce Richardson

On Thu, Oct 01, 2015 at 02:23:17PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:08:07PM +0100, Bruce Richardson wrote:
> > On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote:
> > > On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> > > > 
> > > > 
> > > > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> > > > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> > > > >>It's easy to claim that
> > > > >>a solution is around the corner, only no one was looking for it, but 
> > > > >>the
> > > > >>reality is that kernel bypass has been a solution for years for high
> > > > >>performance users,
> > > > >I never said that it's trivial.
> > > > >
> > > > >It's probably a lot of work. It's definitely more work than just 
> > > > >abusing
> > > > >sysfs.
> > > > >
> > > > >But it looks like a write system call into an eventfd is about 1.5
> > > > >microseconds on my laptop. Even with a system call per packet, system
> > > > >call overhead is not what makes DPDK drivers outperform Linux ones.
> > > > >
> > > > 
> > > > 1.5 us = 0.6 Mpps per core limit.
> > > 
> > > Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
> > > But for RX, you can batch a lot of packets.
> > > 
> > > You can see by now I'm not that good at benchmarking.
> > > Here's what I wrote:
> > > 
> > > 
> > > #include 
> > > #include 
> > > #include 
> > > #include 
> > > 
> > > 
> > > int main(int argc, char **argv)
> > > {
> > > int e = eventfd(0, 0);
> > > uint64_t v = 1;
> > > 
> > > int i;
> > > 
> > > for (i = 0; i < 1000; ++i) {
> > > write(e, , sizeof v);
> > > }
> > > }
> > > 
> > > 
> > > This takes 1.5 seconds to run on my laptop:
> > > 
> > > $ time ./a.out 
> > > 
> > > real0m1.507s
> > > user0m0.179s
> > > sys 0m1.328s
> > > 
> > > 
> > > > dpdk performance is in the tens of
> > > > millions of packets per system.
> > > 
> > > I think that's with a bunch of batching though.
> > > 
> > > > It's not just the lack of system calls, of course, the architecture is
> > > > completely different.
> > > 
> > > Absolutely - I'm not saying move all of DPDK into kernel.
> > > We just need to protect the RX rings so hardware does
> > > not corrupt kernel memory.
> > > 
> > > 
> > > Thinking about it some more, many devices
> > > have separate rings for DMA: TX (device reads memory)
> > > and RX (device writes memory).
> > > With such devices, a mode where userspace can write TX ring
> > > but not RX ring might make sense.
> > > 
> > > This will mean userspace might read kernel memory
> > > through the device, but can not corrupt it.
> > > 
> > > That's already a big win!
> > > 
> > > And RX buffers do not have to be added one at a time.
> > > If we assume 0.2usec per system call, batching some 100 buffers per
> > > system call gives you 2 nano seconds overhead.  That seems quite
> > > reasonable.
> > > 
> > Hi,
> > 
> > just to jump in a bit on this.
> > 
> > Batching of 100 packets is a very large batch, and will add to latency.
> 
> 
> 
> This is not on transmit or receive path!
> This is only for re-adding buffers to the receive ring.
> This batching should not add latency at all:
> 
> 
> process rx:
>   get packet
>   packets[n] = alloc packet
>   if (++n > 100) {
>   system call: add bufs(packets, n);
>   }
> 
> 
> 
> 
> 
> > The
> > standard batch size in DPDK right now is 32, and even that may be too high 
> > for
> > applications in certain domains.
> > 
> > However, even with that 2ns of overhead calculation, I'd make a few 
> > additional
> > points.
> > * For DPDK, we are reasonably close to being able to do 40GB of IO - both 
> > RX 
> > and TX on a single thread. 10GB of IO doesn't really stress a core any 
> > more. For
> > 40GB of small packet traffic, the packet arrival rate is 16.8ns, so even 
> > with a
> > huge batch size of 100 packets, your system call overhead on RX is taking 
> > almost
> > 12% of our processing time. For a batch size of 32 this overhead would rise 
> > to
> > over 35% of our packet processing time.
> 
> As I said, yes, measureable, but not breaking the bank, and that's with
> 40GB which still are not widespread.
> With 10GB and 100 packets, only 3% overhead.
> 
> > For 100G line rate, the packet arrival
> > rate is just 6.7ns...
> 
> Hypervisors still have time get their act together and support IOMMUs
> by the time 100G systems become widespread.
> 
> > * As well as this overhead from the system call itself, you are also 
> > omitting
> > the overhead of scanning the RX descriptors.
> 
> I omit it because scanning descriptors can still be done in userspace,
> just write-protect the RX ring page.
> 
> 
> > This in itself is going to use up
> > a good proportion of the processing time, as well as that we have to spend 
> > cycles
> > copying the descriptors from one ring in memory to another. Given that 
> > right now
> > with the vector ixgbe

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
>> It's easy to claim that
>> a solution is around the corner, only no one was looking for it, but the
>> reality is that kernel bypass has been a solution for years for high
>> performance users,
> I never said that it's trivial.
>
> It's probably a lot of work. It's definitely more work than just abusing
> sysfs.
>
> But it looks like a write system call into an eventfd is about 1.5
> microseconds on my laptop. Even with a system call per packet, system
> call overhead is not what makes DPDK drivers outperform Linux ones.
>

1.5 us = 0.6 Mpps per core limit.  dpdk performance is in the tens of 
millions of packets per system.

It's not just the lack of system calls, of course, the architecture is 
completely different.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> It's easy to claim that
> a solution is around the corner, only no one was looking for it, but the
> reality is that kernel bypass has been a solution for years for high
> performance users,

I never said that it's trivial.

It's probably a lot of work. It's definitely more work than just abusing
sysfs.

But it looks like a write system call into an eventfd is about 1.5
microseconds on my laptop. Even with a system call per packet, system
call overhead is not what makes DPDK drivers outperform Linux ones.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 12:48 PM, Vincent JARDIN wrote:
> On 01/10/2015 11:43, Avi Kivity wrote:
>>
>> That is because the device itself contains an iommu.
>
> Yes.
>
> It could be an option:
>   - we could flag the Linux system unsafe when the device does not 
> have any IOMMU
>   - we flag the Linux system safe when the device has an IOMMU

This already exists, it's called the tainted flag.

I don't know if pci_uio_generic already taints the kernel; it certainly 
should with DMA capable devices.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 12:42 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
>> even when they are some users
>> prefer to avoid the performance penalty.
> I don't think there's a measureable penalty from passing through the
> IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
> never saw any numbers that show such.
>

Maybe not.  But again, virtualized setups will not have a guest iommu 
and therefore can't use it; and those happen to be exactly the setups 
you're blocking.

Non-virtualized setups have an iommu available, but they can also use 
pci_uio_generic without patching if they like.

The virtualized setups have no other option; you're leaving them out in 
the cold.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 12:42 PM, Vincent JARDIN wrote:
> On 01/10/2015 11:22, Avi Kivity wrote:
>>> As far as I could see, without this kind of motivation, people do not
>>> even want to try.
>>
>> You are mistaken.  The problem is a lot harder than you think.
>>
>> People didn't go and write userspace drivers because they were lazy.
>> They wrote them because there was no other way.
>
> I disagree, it is possible to write a 'partial' userspace driver.
>
> Here it is an example:
>   http://dpdk.org/browse/dpdk/tree/drivers/net/mlx4
>
> It benefits of the kernel's capabilities while the userland manages 
> only the IOs.
>

That is because the device itself contains an iommu.

> There were some tentative to get it for other (older) drivers, named 
> 'bifurcated drivers', but it is stalled.

IIRC they still exposed the ring to userspace.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> even when they are some users
> prefer to avoid the performance penalty.

I don't think there's a measureable penalty from passing through the
IOMMU, as long as mappings are mostly static (i.e. iommu=pt).  I sure
never saw any numbers that show such.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 12:15:49PM +0300, Avi Kivity wrote:
> What userspace can't be allowed to do:
> 
> access BAR
> write rings
> 
> 
> 
> 
> It can access the BAR by mmap()ing the resourceN files under sysfs.? You're 
> not
> denying userspace the ability to oops the kernel, just the ability to do 
> useful
> things with hardware.

This interface has to stay there to support existing applications.  A
variety of measures (selinux, secureboot) can be used to make sure
modern ones to not get to touch it. Most distributions enable
some or all of these by default.

And it doesn't mean modern drivers can do this kind of thing.

Look, without an IOMMU, sysfs can not be used securely:
you need some other interface. This is what this driver is missing.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 12:15 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 11:52:26AM +0300, Avi Kivity wrote:
>> I still don't understand your objection to the patch:
>>
>>
>>  MSI messages are memory writes so any generic device capable
>>  of MSI is capable of corrupting kernel memory.
>>  This means that a bug in userspace will lead to kernel memory corruption
>>  and crashes.  This is something distributions can't support.
>>
>>
>> If a distribution feels it can't support this configuration, it can disable 
>> the
>> uio_pci_generic driver, or refuse to support tainted kernels.  If it feels it
>> can (and many distributions are starting to support dpdk), then you're just
>> denying it the ability to serve its users.
> I don't, and can't deny users anything.  I merely think upstream should
> avoid putting this driver in-tree.  By doing this, driver writers will
> be pushed to develop solutions that can't crash kernel.
>
> I pointed out one way to build it, there are sure to be more.

And I pointed out that your solution is unworkable.  It's easy to claim 
that a solution is around the corner, only no one was looking for it, 
but the reality is that kernel bypass has been a solution for years for 
high performance users, that it cannot be made safe without an iommu, 
and that iommus are not available everywhere; and even when they are 
some users prefer to avoid the performance penalty.

> As far as I could see, without this kind of motivation, people do not
> even want to try.

You are mistaken.  The problem is a lot harder than you think.

People didn't go and write userspace drivers because they were lazy.  
They wrote them because there was no other way.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 11:44:28AM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
> > > And for what, to prevent
> > > root from touching memory via dma that they can access in a million other
> > > ways?
> > 
> > So one can be reasonably sure a kernel oops is not a result of a
> > userspace bug.
> 
> Actually, I thought about this overnight, and  it should be possible to
> drive it securely from userspace, without hypervisor changes.
> 
> See
> 
> https://mid.gmane.org/20151001104505-mutt-send-email-mst at redhat.com

Ouch, looks like gmane doesn't do https. Sorry, this is the correct
link:

http://mid.gmane.org/20151001104505-mutt-send-email-mst at redhat.com

> 
> 
> > -- 
> > MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity



On 10/01/2015 11:52 AM, Avi Kivity wrote:
>
>
> On 10/01/2015 11:44 AM, Michael S. Tsirkin wrote:
>> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
 And for what, to prevent
 root from touching memory via dma that they can access in a million other
 ways?
>>> So one can be reasonably sure a kernel oops is not a result of a
>>> userspace bug.
>> Actually, I thought about this overnight, and  it should be possible to
>> drive it securely from userspace, without hypervisor changes.
>
> Also without the performance that was the whole reason from doing it 
> in userspace in the first place.
>
> I still don't understand your objection to the patch:
>
>> MSI messages are memory writes so any generic device capable
>> of MSI is capable of corrupting kernel memory.
>> This means that a bug in userspace will lead to kernel memory corruption
>> and crashes.  This is something distributions can't support.
>

And this:

> What userspace can't be allowed to do:
>
>   access BAR
>   write rings
>

It can access the BAR by mmap()ing the resourceN files under sysfs. 
You're not denying userspace the ability to oops the kernel, just the 
ability to do useful things with hardware.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Thu, Oct 01, 2015 at 11:52:26AM +0300, Avi Kivity wrote:
> I still don't understand your objection to the patch:
> 
> 
> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.
> 
> 
> If a distribution feels it can't support this configuration, it can disable 
> the
> uio_pci_generic driver, or refuse to support tainted kernels.? If it feels it
> can (and many distributions are starting to support dpdk), then you're just
> denying it the ability to serve its users.

I don't, and can't deny users anything.  I merely think upstream should
avoid putting this driver in-tree.  By doing this, driver writers will
be pushed to develop solutions that can't crash kernel.

I pointed out one way to build it, there are sure to be more.

As far as I could see, without this kind of motivation, people do not
even want to try.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Bruce Richardson

On Thu, Oct 01, 2015 at 01:38:37PM +0300, Michael S. Tsirkin wrote:
> On Thu, Oct 01, 2015 at 12:59:47PM +0300, Avi Kivity wrote:
> > 
> > 
> > On 10/01/2015 12:55 PM, Michael S. Tsirkin wrote:
> > >On Thu, Oct 01, 2015 at 12:22:46PM +0300, Avi Kivity wrote:
> > >>It's easy to claim that
> > >>a solution is around the corner, only no one was looking for it, but the
> > >>reality is that kernel bypass has been a solution for years for high
> > >>performance users,
> > >I never said that it's trivial.
> > >
> > >It's probably a lot of work. It's definitely more work than just abusing
> > >sysfs.
> > >
> > >But it looks like a write system call into an eventfd is about 1.5
> > >microseconds on my laptop. Even with a system call per packet, system
> > >call overhead is not what makes DPDK drivers outperform Linux ones.
> > >
> > 
> > 1.5 us = 0.6 Mpps per core limit.
> 
> Oh, I calculated it incorrectly. It's 0.15 us. So 6Mpps.
> But for RX, you can batch a lot of packets.
> 
> You can see by now I'm not that good at benchmarking.
> Here's what I wrote:
> 
> 
> #include 
> #include 
> #include 
> #include 
> 
> 
> int main(int argc, char **argv)
> {
> int e = eventfd(0, 0);
> uint64_t v = 1;
> 
> int i;
> 
> for (i = 0; i < 1000; ++i) {
> write(e, , sizeof v);
> }
> }
> 
> 
> This takes 1.5 seconds to run on my laptop:
> 
> $ time ./a.out 
> 
> real0m1.507s
> user0m0.179s
> sys 0m1.328s
> 
> 
> > dpdk performance is in the tens of
> > millions of packets per system.
> 
> I think that's with a bunch of batching though.
> 
> > It's not just the lack of system calls, of course, the architecture is
> > completely different.
> 
> Absolutely - I'm not saying move all of DPDK into kernel.
> We just need to protect the RX rings so hardware does
> not corrupt kernel memory.
> 
> 
> Thinking about it some more, many devices
> have separate rings for DMA: TX (device reads memory)
> and RX (device writes memory).
> With such devices, a mode where userspace can write TX ring
> but not RX ring might make sense.
> 
> This will mean userspace might read kernel memory
> through the device, but can not corrupt it.
> 
> That's already a big win!
> 
> And RX buffers do not have to be added one at a time.
> If we assume 0.2usec per system call, batching some 100 buffers per
> system call gives you 2 nano seconds overhead.  That seems quite
> reasonable.
> 
Hi,

just to jump in a bit on this.

Batching of 100 packets is a very large batch, and will add to latency. The
standard batch size in DPDK right now is 32, and even that may be too high for
applications in certain domains.

However, even with that 2ns of overhead calculation, I'd make a few additional
points.
* For DPDK, we are reasonably close to being able to do 40GB of IO - both RX 
and TX on a single thread. 10GB of IO doesn't really stress a core any more. For
40GB of small packet traffic, the packet arrival rate is 16.8ns, so even with a
huge batch size of 100 packets, your system call overhead on RX is taking almost
12% of our processing time. For a batch size of 32 this overhead would rise to
over 35% of our packet processing time. For 100G line rate, the packet arrival
rate is just 6.7ns...

* As well as this overhead from the system call itself, you are also omitting
the overhead of scanning the RX descriptors. This in itself is going to use up
a good proportion of the processing time, as well as that we have to spend 
cycles
copying the descriptors from one ring in memory to another. Given that right now
with the vector ixgbe driver, the cycle cost per packet of RX is just a few 
dozen
cycles on modern cores, every additional cycle (fraction of a nanosecond) has
an impact.

Regards,
/Bruce

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 10/01/2015 11:44 AM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
>>> And for what, to prevent
>>> root from touching memory via dma that they can access in a million other
>>> ways?
>> So one can be reasonably sure a kernel oops is not a result of a
>> userspace bug.
> Actually, I thought about this overnight, and  it should be possible to
> drive it securely from userspace, without hypervisor changes.

Also without the performance that was the whole reason from doing it in 
userspace in the first place.

I still don't understand your objection to the patch:

> MSI messages are memory writes so any generic device capable
> of MSI is capable of corrupting kernel memory.
> This means that a bug in userspace will lead to kernel memory corruption
> and crashes.  This is something distributions can't support.

If a distribution feels it can't support this configuration, it can 
disable the uio_pci_generic driver, or refuse to support tainted 
kernels.  If it feels it can (and many distributions are starting to 
support dpdk), then you're just denying it the ability to serve its users.

> See
>
> https://mid.gmane.org/20151001104505-mutt-send-email-mst at redhat.com
>
>
>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Vincent JARDIN

On 01/10/2015 11:43, Avi Kivity wrote:
>
> That is because the device itself contains an iommu.

Yes.

It could be an option:
   - we could flag the Linux system unsafe when the device does not have 
any IOMMU
   - we flag the Linux system safe when the device has an IOMMU

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Vlad Zolotarov



On 10/01/15 11:44, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
>>> And for what, to prevent
>>> root from touching memory via dma that they can access in a million other
>>> ways?
>> So one can be reasonably sure a kernel oops is not a result of a
>> userspace bug.
> Actually, I thought about this overnight, and  it should be possible to
> drive it securely from userspace, without hypervisor changes.
>
> See
>
> https://mid.gmane.org/20151001104505-mutt-send-email-mst at redhat.com

Looks like a dead link.

>
>
>
>> -- 
>> MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 11:40:16PM +0300, Michael S. Tsirkin wrote:
> > And for what, to prevent
> > root from touching memory via dma that they can access in a million other
> > ways?
> 
> So one can be reasonably sure a kernel oops is not a result of a
> userspace bug.

Actually, I thought about this overnight, and  it should be possible to
drive it securely from userspace, without hypervisor changes.

See

https://mid.gmane.org/20151001104505-mutt-send-email-mst at redhat.com



> -- 
> MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Vincent JARDIN

On 01/10/2015 11:22, Avi Kivity wrote:
>> As far as I could see, without this kind of motivation, people do not
>> even want to try.
>
> You are mistaken.  The problem is a lot harder than you think.
>
> People didn't go and write userspace drivers because they were lazy.
> They wrote them because there was no other way.

I disagree, it is possible to write a 'partial' userspace driver.

Here it is an example:
   http://dpdk.org/browse/dpdk/tree/drivers/net/mlx4

It benefits of the kernel's capabilities while the userland manages only 
the IOs.

There were some tentative to get it for other (older) drivers, named 
'bifurcated drivers', but it is stalled.

best regards,
   Vincent

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Gleb Natapov

On Wed, Sep 30, 2015 at 11:36:58PM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 11:00:49PM +0300, Gleb Natapov wrote:
> > > You are increasing interrupt latency by a huge factor by channeling
> > > interrupts through a scheduler.  Let user install an
> > > interrupt handler function, and be done with it.
> > > 
> > Interrupt latency is not always hugely important. If you enter interrupt
> > mode only when idle hundred more us on a first packet will not kill you.
> 
> It certainly affects worst-case latency.  And if you lower interupt
> latency, you can go idle faster, so it affects power too.
> 
We are polling 100% now. Going idle faster is the least of our concern.

> > If
> > interrupt latency is important then uio may be not the right solution,
> > but then neither is vfio.
> 
> That's what I'm saying, if you don't need memory isolation you can do
> better than just slightly tweak existing drivers.
> 
No, you are forcing everyone to code in kernel no matter if it make
sense or not. You decide for everyone what is good for them. Believe me
people here know about trade-offs and made appropriate considerations.

--
Gleb.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Stephen Hemminger

On Thu, 1 Oct 2015 14:32:19 +0300
Avi Kivity  wrote:

> On 10/01/2015 02:27 PM, Michael S. Tsirkin wrote:
> > On Thu, Oct 01, 2015 at 02:20:37PM +0300, Avi Kivity wrote:
> >> People will just use out of tree drivers (dpdk has several already).  It's 
> >> a
> >> pain, but nowhere near what you are proposing.
> > What's the issue with that?
> 
> Out of tree drivers have to be compiled on the target system (cannot 
> ship a binary package), and occasionally break.
> 
> dkms helps with that, as do distributions that promise binary 
> compatibility, but it is still a pain, compared to just shipping a 
> userspace package.
> 
> >   We already agreed this kernel
> > is going to be tainted, and unsupportable.
> 
> Yes.  So your only motivation in rejecting the patch is to get the 
> author to write the ring translation patch and port it to all relevant 
> drivers instead?

The per-driver ring method is what netmap did.
The problem with that is that it forces infrastructure into already
complex network driver. It never was accepted. There were also still
security issues like time of check/time of use with the ring.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Stephen Hemminger

On Thu, 1 Oct 2015 13:14:08 +0300
"Michael S. Tsirkin"  wrote:

> On Thu, Oct 01, 2015 at 12:43:53PM +0300, Avi Kivity wrote:
> > >There were some tentative to get it for other (older) drivers, named
> > >'bifurcated drivers', but it is stalled.
> > 
> > IIRC they still exposed the ring to userspace.
> 
> How much would a ring write syscall cost? 1-2 microseconds, isn't it?

The per-packet budget at 10G is 62ns, a syscall just doesn't cut it.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Stephen Hemminger

On Thu, 01 Oct 2015 11:42:23 +0200
Vincent JARDIN  wrote:

> On 01/10/2015 11:22, Avi Kivity wrote:
> >> As far as I could see, without this kind of motivation, people do not
> >> even want to try.
> >
> > You are mistaken.  The problem is a lot harder than you think.
> >
> > People didn't go and write userspace drivers because they were lazy.
> > They wrote them because there was no other way.
> 
> I disagree, it is possible to write a 'partial' userspace driver.
> 
> Here it is an example:
>http://dpdk.org/browse/dpdk/tree/drivers/net/mlx4
> 
> It benefits of the kernel's capabilities while the userland manages only 
> the IOs.
> 
> There were some tentative to get it for other (older) drivers, named 
> 'bifurcated drivers', but it is stalled.
> 

And in our testing the mlx4 driver performance is terrible.
That maybe because of the overhead of infiniband library.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Stephen Hemminger

On Thu, 1 Oct 2015 11:00:28 +0300
Vlad Zolotarov  wrote:

> 
> 
> On 10/01/15 00:36, Stephen Hemminger wrote:
> > On Wed, 30 Sep 2015 23:09:33 +0300
> > Vlad Zolotarov  wrote:
> >
> >>
> >> On 09/30/15 22:39, Michael S. Tsirkin wrote:
> >>> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
> >> How would iommu
> >> virtualization change anything?
> > Kernel can use an iommu to limit device access to memory of
> > the controlling application.
>  Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
>  interrupts support in uio_pci_generic? kernel may continue to limit the
>  above access with this support as well.
> >>> It could maybe. So if you write a patch to allow MSI by at the same time
> >>> creating an isolated IOMMU group and blocking DMA from device in
> >>> question anywhere, that sounds reasonable.
> >> No, I'm only planning to add MSI and MSI-X interrupts support for
> >> uio_pci_generic device.
> >> The rest mentioned above should naturally be a matter of a different
> >> patch and writing it is orthogonal to the patch I'm working on as has
> >> been extensively discussed in this thread.
> >>
> > I have a generic MSI and MSI-X driver (posted earlier on this list).
> > About to post to upstream kernel.
> 
> Stephen, hi!
> 
> I found the mentioned series and first thing I noticed was that it's 
> been sent in May so the first question is how far in your list of tasks 
> submitting it upstream is? We need it more or less yesterday and I'm 
> working on it right now. Therefore if u don't have time for it I'd like 
> to help... ;) However I'd like u to clarify a few small things. Pls., 
> see below...
> 
> I noticed that u've created a separate msi_msix driver and the second 
> question is what do u plan for the upstream? I was thinking of extending 
> the existing uio_pci_generic with the MSI-X functionality similar to 
> your code and preserving the INT#X functionality as it is now:

The igb_uio has a bunch of other things I didn't want to deal with:
the name (being specific to old Intel driver); compatibility with older
kernels; legacy ABI support. Therefore in effect uio_msi is a rebase
of igb_uio.

The submission upstream yesterday is the first step, I expect lots
of review feedback.

>   *   INT#X and MSI would provide the IRQ number to the UIO module while
> only MSI-X case would register with UIO_IRQ_CUSTOM.

I wanted all IRQ's to be the same for the driver, ie all go through
eventfd mechanism. This makes code on DPDK side consistent with less
special cases.

> I also noticed that u enable MSI-X on a first open() call. I assume 
> there was a good reason (that I miss) for not doing it in probe(). Could 
> u, pls., clarify?

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Vlad Zolotarov



On 10/01/15 00:36, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 23:09:33 +0300
> Vlad Zolotarov  wrote:
>
>>
>> On 09/30/15 22:39, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
>> How would iommu
>> virtualization change anything?
> Kernel can use an iommu to limit device access to memory of
> the controlling application.
 Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
 interrupts support in uio_pci_generic? kernel may continue to limit the
 above access with this support as well.
>>> It could maybe. So if you write a patch to allow MSI by at the same time
>>> creating an isolated IOMMU group and blocking DMA from device in
>>> question anywhere, that sounds reasonable.
>> No, I'm only planning to add MSI and MSI-X interrupts support for
>> uio_pci_generic device.
>> The rest mentioned above should naturally be a matter of a different
>> patch and writing it is orthogonal to the patch I'm working on as has
>> been extensively discussed in this thread.
>>
> I have a generic MSI and MSI-X driver (posted earlier on this list).
> About to post to upstream kernel.

Great! It would save me a few working days... ;) Thanks, Stephen!

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 02:36:48PM -0700, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 23:09:33 +0300
> Vlad Zolotarov  wrote:
> 
> > 
> > 
> > On 09/30/15 22:39, Michael S. Tsirkin wrote:
> > > On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
> >  How would iommu
> >  virtualization change anything?
> > >>> Kernel can use an iommu to limit device access to memory of
> > >>> the controlling application.
> > >> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
> > >> interrupts support in uio_pci_generic? kernel may continue to limit the
> > >> above access with this support as well.
> > > It could maybe. So if you write a patch to allow MSI by at the same time
> > > creating an isolated IOMMU group and blocking DMA from device in
> > > question anywhere, that sounds reasonable.
> > 
> > No, I'm only planning to add MSI and MSI-X interrupts support for 
> > uio_pci_generic device.
> > The rest mentioned above should naturally be a matter of a different 
> > patch and writing it is orthogonal to the patch I'm working on as has 
> > been extensively discussed in this thread.
> > 
> > >
> > 
> 
> I have a generic MSI and MSI-X driver (posted earlier on this list).
> About to post to upstream kernel.

If Linux holds out and refuses to support insecure interfaces,
hypervisor vendors will add secure ones. If Linux lets them ignore guest
security, they will.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Avi Kivity

On 09/30/2015 11:40 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 06:36:17PM +0300, Avi Kivity wrote:
>> As it happens, you're removing the functionality from the users who have no
>> other option.  They can't use vfio because it doesn't work on virtualized
>> setups.
> ...
>
>> Root can already do anything.
> I think there's a contradiction between the two claims above.

Yes, root can replace the current kernel with a patched kernel.  In that 
sense, root can do anything, and the kernel is complete.  Now let's stop 
playing word games.

>>   So what security issue is there?
> A buggy userspace can and will corrupt kernel memory.
>
> ...
>
>> And for what, to prevent
>> root from touching memory via dma that they can access in a million other
>> ways?
> So one can be reasonably sure a kernel oops is not a result of a
> userspace bug.
>

That's not security.  It's a legitimate concern though, one that is 
addressed by tainting the kernel.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 06:36:17PM +0300, Avi Kivity wrote:
> As it happens, you're removing the functionality from the users who have no
> other option.  They can't use vfio because it doesn't work on virtualized
> setups.

...

> Root can already do anything.

I think there's a contradiction between the two claims above.

>  So what security issue is there?

A buggy userspace can and will corrupt kernel memory.

...

> And for what, to prevent
> root from touching memory via dma that they can access in a million other
> ways?

So one can be reasonably sure a kernel oops is not a result of a
userspace bug.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 11:00:49PM +0300, Gleb Natapov wrote:
> > You are increasing interrupt latency by a huge factor by channeling
> > interrupts through a scheduler.  Let user install an
> > interrupt handler function, and be done with it.
> > 
> Interrupt latency is not always hugely important. If you enter interrupt
> mode only when idle hundred more us on a first packet will not kill you.

It certainly affects worst-case latency.  And if you lower interupt
latency, you can go idle faster, so it affects power too.

> If
> interrupt latency is important then uio may be not the right solution,
> but then neither is vfio.

That's what I'm saying, if you don't need memory isolation you can do
better than just slightly tweak existing drivers.

> --
>   Gleb.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Vlad Zolotarov



On 09/30/15 22:39, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
 How would iommu
 virtualization change anything?
>>> Kernel can use an iommu to limit device access to memory of
>>> the controlling application.
>> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
>> interrupts support in uio_pci_generic? kernel may continue to limit the
>> above access with this support as well.
> It could maybe. So if you write a patch to allow MSI by at the same time
> creating an isolated IOMMU group and blocking DMA from device in
> question anywhere, that sounds reasonable.

No, I'm only planning to add MSI and MSI-X interrupts support for 
uio_pci_generic device.
The rest mentioned above should naturally be a matter of a different 
patch and writing it is orthogonal to the patch I'm working on as has 
been extensively discussed in this thread.

>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-10-01 Thread Gleb Natapov

On Wed, Sep 30, 2015 at 09:50:08PM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 10:43:04AM -0700, Stephen Hemminger wrote:
> > On Wed, 30 Sep 2015 20:39:43 +0300
> > "Michael S. Tsirkin"  wrote:
> > 
> > > On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> > > > On Wed, 30 Sep 2015 13:37:22 +0300
> > > > Vlad Zolotarov  wrote:
> > > > 
> > > > > 
> > > > > 
> > > > > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > > > > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > > > > >> "Michael S. Tsirkin"  wrote:
> > > > > >>
> > > > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > > > >  The security breach motivation u brought in "[RFC PATCH] uio:
> > > > >  uio_pci_generic: Add support for MSI interrupts" thread seems a 
> > > > >  bit weak
> > > > >  since one u let the userland access to the bar it may do any 
> > > > >  funny thing
> > > > >  using the DMA engine of the device. This kind of stuff should be 
> > > > >  prevented
> > > > >  using the iommu and if it's enabled then any funny tricks using 
> > > > >  MSI/MSI-X
> > > > >  configuration will be prevented too.
> > > > > 
> > > > >  I'm about to send the patch to main Linux mailing list. Let's 
> > > > >  continue this
> > > > >  discussion there.
> > > > > 
> > > > > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > > > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > > > > 
> > > > > If there is an IOMMU in the picture there shouldn't be any problem to 
> > > > > use UIO with DMA capable devices.
> > > > > 
> > > > > >>> I don't think this can change.
> > > > > >> Given there is no PV IOMMU and even if there was it would be too 
> > > > > >> slow for DPDK
> > > > > >> use, I can't accept that.
> > > > > > QEMU does allow emulating an iommu.
> > > > > 
> > > > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > > > > option there. And again, it's a general issue not DPDK specific.
> > > > > Today one has to develop some proprietary modules (like igb_uio) to 
> > > > > workaround the issue and this is lame. IMHO uio_pci_generic should
> > > > > be fixed to be able to properly work within any virtualized 
> > > > > environment 
> > > > > and not only with KVM.
> > > > > 
> > > > 
> > > > Also VMware (bigger problem) has no IOMMU emulation.
> > > > Other environments as well (Windriver, GCE) have noe IOMMU.
> > > 
> > > Because the use-case of userspace drivers is not important enough?
> > > Without an IOMMU, there's no way to have secure userspace drivers.
> > 
> > Look at Cloudius, there is no necessity of security in guest.
> 
> It's an interesting concept, isn't it?
> 
It is.

> So why not do what Cloudius does, and run this task code in ring 0 then,
> allocating all memory in the kernel range?
> 
Except this is not what Cloudius does. The idea of OSv is that it can
run your regular userspace application, but remove unneeded level of
indirection by bypassing userspace/kernelspace communication (among
other things).  Application still uses virtual, not directly mapped
physical memory like Linux ring 0 has.

You can achieve most of the benefits of kernel bypass on Linux too, but
unlike OSv you need to code for it. UIO is one of those things that
allows that.

> You are increasing interrupt latency by a huge factor by channeling
> interrupts through a scheduler.  Let user install an
> interrupt handler function, and be done with it.
> 
Interrupt latency is not always hugely important. If you enter interrupt
mode only when idle hundred more us on a first packet will not kill you. If
interrupt latency is important then uio may be not the right solution,
but then neither is vfio.

--
Gleb.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
> >>How would iommu
> >>virtualization change anything?
> >Kernel can use an iommu to limit device access to memory of
> >the controlling application.
> 
> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
> interrupts support in uio_pci_generic? kernel may continue to limit the
> above access with this support as well.

It could maybe. So if you write a patch to allow MSI by at the same time
creating an isolated IOMMU group and blocking DMA from device in
question anywhere, that sounds reasonable.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Vlad Zolotarov



On 09/30/15 22:10, Vlad Zolotarov wrote:
>
>
> On 09/30/15 22:06, Vlad Zolotarov wrote:
>>
>>
>> On 09/30/15 21:55, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote:

 On 09/30/15 18:26, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
>> How not virtualizing iommu forces "all or nothing" approach?
> Looks like you can't limit an assigned device to only access part of
> guest memory that belongs to a given process.  Either let it 
> access all
> of guest memory ("all") or don't assign the device ("nothing").
 Ok. A question then: can u limit the assigned device to only access 
 part of
 the guest memory even if iommu was virtualized?
>>> That's exactly what an iommu does - limit the device io access to 
>>> memory.
>>
>> If it does - it will continue to do so with or without the patch and 
>> if it doesn't (for any reason) it won't do it even without the patch.
>> So, again, the above (rhetorical) question stands. ;)
>>
>> I think Avi has already explained quite in detail why security is 
>> absolutely a non issue in regard to this patch or in regard to UIO in 
>> general. Security has to be enforced by some other means like iommu.
>>
>>>
 How would iommu
 virtualization change anything?
>>> Kernel can use an iommu to limit device access to memory of
>>> the controlling application.
>>
>> Ok, this is obvious but what it has to do with enabling using 
>> MSI/MSI-X interrupts support in uio_pci_generic? kernel may continue 
>> to limit the above access with this support as well.
>>
>>>
 And why do we care about an assigned device
 to be able to access all Guest memory?
>>> Because we want to be reasonably sure a kernel memory corruption
>>> is not a result of a bug in a userspace application.
>>
>> Corrupting Guest's memory due to any SW misbehavior (including bugs) 
>> is a non-issue by design - this is what HV and Guest machines were 
>> invented for. So, like Avi also said, instead of trying to enforce 
>> nobody cares about 
>
> Let me rephrase: by pretending enforcing some security promise that u 
> don't actually fulfill... ;)

...the promise nobody really cares about...

>
>> we'd rather make the developers life easier instead (by applying the 
>> not-yet-completed patch I'm working on).
>>>
>>
>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Vlad Zolotarov



On 09/30/15 22:06, Vlad Zolotarov wrote:
>
>
> On 09/30/15 21:55, Michael S. Tsirkin wrote:
>> On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote:
>>>
>>> On 09/30/15 18:26, Michael S. Tsirkin wrote:
 On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
> How not virtualizing iommu forces "all or nothing" approach?
 Looks like you can't limit an assigned device to only access part of
 guest memory that belongs to a given process.  Either let it access 
 all
 of guest memory ("all") or don't assign the device ("nothing").
>>> Ok. A question then: can u limit the assigned device to only access 
>>> part of
>>> the guest memory even if iommu was virtualized?
>> That's exactly what an iommu does - limit the device io access to 
>> memory.
>
> If it does - it will continue to do so with or without the patch and 
> if it doesn't (for any reason) it won't do it even without the patch.
> So, again, the above (rhetorical) question stands. ;)
>
> I think Avi has already explained quite in detail why security is 
> absolutely a non issue in regard to this patch or in regard to UIO in 
> general. Security has to be enforced by some other  means like iommu.
>
>>
>>> How would iommu
>>> virtualization change anything?
>> Kernel can use an iommu to limit device access to memory of
>> the controlling application.
>
> Ok, this is obvious but what it has to do with enabling using 
> MSI/MSI-X interrupts support in uio_pci_generic? kernel may continue 
> to limit the above access with this support as well.
>
>>
>>> And why do we care about an assigned device
>>> to be able to access all Guest memory?
>> Because we want to be reasonably sure a kernel memory corruption
>> is not a result of a bug in a userspace application.
>
> Corrupting Guest's memory due to any SW misbehavior (including bugs) 
> is a non-issue by design - this is what HV and Guest machines were 
> invented for. So, like Avi also said, instead of trying to enforce 
> nobody cares about 

Let me rephrase: by pretending enforcing some security promise that u 
don't actually fulfill... ;)

> we'd rather make the developers life easier instead (by applying the 
> not-yet-completed patch I'm working on).
>>
>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Vlad Zolotarov



On 09/30/15 21:55, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 18:26, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
 How not virtualizing iommu forces "all or nothing" approach?
>>> Looks like you can't limit an assigned device to only access part of
>>> guest memory that belongs to a given process.  Either let it access all
>>> of guest memory ("all") or don't assign the device ("nothing").
>> Ok. A question then: can u limit the assigned device to only access part of
>> the guest memory even if iommu was virtualized?
> That's exactly what an iommu does - limit the device io access to memory.

If it does - it will continue to do so with or without the patch and if 
it doesn't (for any reason) it won't do it even without the patch.
So, again, the above (rhetorical) question stands. ;)

I think Avi has already explained quite in detail why security is 
absolutely a non issue in regard to this patch or in regard to UIO in 
general. Security has to be enforced by some other  means like iommu.

>
>> How would iommu
>> virtualization change anything?
> Kernel can use an iommu to limit device access to memory of
> the controlling application.

Ok, this is obvious but what it has to do with enabling using MSI/MSI-X 
interrupts support in uio_pci_generic? kernel may continue to limit the 
above access with this support as well.

>
>> And why do we care about an assigned device
>> to be able to access all Guest memory?
> Because we want to be reasonably sure a kernel memory corruption
> is not a result of a bug in a userspace application.

Corrupting Guest's memory due to any SW misbehavior (including bugs) is 
a non-issue by design - this is what HV and Guest machines were invented 
for. So, like Avi also said, instead of trying to enforce nobody cares 
about we'd rather make the developers life easier instead (by applying 
the not-yet-completed patch I'm working on).
>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 09:15:56PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 18:26, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
> >>How not virtualizing iommu forces "all or nothing" approach?
> >Looks like you can't limit an assigned device to only access part of
> >guest memory that belongs to a given process.  Either let it access all
> >of guest memory ("all") or don't assign the device ("nothing").
> 
> Ok. A question then: can u limit the assigned device to only access part of
> the guest memory even if iommu was virtualized?

That's exactly what an iommu does - limit the device io access to memory.

> How would iommu
> virtualization change anything?

Kernel can use an iommu to limit device access to memory of
the controlling application.

> And why do we care about an assigned device
> to be able to access all Guest memory?

Because we want to be reasonably sure a kernel memory corruption
is not a result of a bug in a userspace application.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 10:43:04AM -0700, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 20:39:43 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> > > On Wed, 30 Sep 2015 13:37:22 +0300
> > > Vlad Zolotarov  wrote:
> > > 
> > > > 
> > > > 
> > > > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > > > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > > > >> "Michael S. Tsirkin"  wrote:
> > > > >>
> > > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > > >  The security breach motivation u brought in "[RFC PATCH] uio:
> > > >  uio_pci_generic: Add support for MSI interrupts" thread seems a 
> > > >  bit weak
> > > >  since one u let the userland access to the bar it may do any funny 
> > > >  thing
> > > >  using the DMA engine of the device. This kind of stuff should be 
> > > >  prevented
> > > >  using the iommu and if it's enabled then any funny tricks using 
> > > >  MSI/MSI-X
> > > >  configuration will be prevented too.
> > > > 
> > > >  I'm about to send the patch to main Linux mailing list. Let's 
> > > >  continue this
> > > >  discussion there.
> > > > 
> > > > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > > > 
> > > > If there is an IOMMU in the picture there shouldn't be any problem to 
> > > > use UIO with DMA capable devices.
> > > > 
> > > > >>> I don't think this can change.
> > > > >> Given there is no PV IOMMU and even if there was it would be too 
> > > > >> slow for DPDK
> > > > >> use, I can't accept that.
> > > > > QEMU does allow emulating an iommu.
> > > > 
> > > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > > > option there. And again, it's a general issue not DPDK specific.
> > > > Today one has to develop some proprietary modules (like igb_uio) to 
> > > > workaround the issue and this is lame. IMHO uio_pci_generic should
> > > > be fixed to be able to properly work within any virtualized environment 
> > > > and not only with KVM.
> > > > 
> > > 
> > > Also VMware (bigger problem) has no IOMMU emulation.
> > > Other environments as well (Windriver, GCE) have noe IOMMU.
> > 
> > Because the use-case of userspace drivers is not important enough?
> > Without an IOMMU, there's no way to have secure userspace drivers.
> 
> Look at Cloudius, there is no necessity of security in guest.

It's an interesting concept, isn't it?

So why not do what Cloudius does, and run this task code in ring 0 then,
allocating all memory in the kernel range?

You are increasing interrupt latency by a huge factor by channeling
interrupts through a scheduler.  Let user install an
interrupt handler function, and be done with it.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Vlad Zolotarov

On 09/30/15 18:26, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
>> How not virtualizing iommu forces "all or nothing" approach?
> Looks like you can't limit an assigned device to only access part of
> guest memory that belongs to a given process.  Either let it access all
> of guest memory ("all") or don't assign the device ("nothing").

Ok. A question then: can u limit the assigned device to only access part 
of the guest memory even if iommu was virtualized? How would iommu 
virtualization change anything? And why do we care about an assigned 
device to be able to access all Guest memory?

>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Gleb Natapov

On Wed, Sep 30, 2015 at 08:39:43PM +0300, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> > On Wed, 30 Sep 2015 13:37:22 +0300
> > Vlad Zolotarov  wrote:
> > 
> > > 
> > > 
> > > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > > >> "Michael S. Tsirkin"  wrote:
> > > >>
> > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > >  The security breach motivation u brought in "[RFC PATCH] uio:
> > >  uio_pci_generic: Add support for MSI interrupts" thread seems a bit 
> > >  weak
> > >  since one u let the userland access to the bar it may do any funny 
> > >  thing
> > >  using the DMA engine of the device. This kind of stuff should be 
> > >  prevented
> > >  using the iommu and if it's enabled then any funny tricks using 
> > >  MSI/MSI-X
> > >  configuration will be prevented too.
> > > 
> > >  I'm about to send the patch to main Linux mailing list. Let's 
> > >  continue this
> > >  discussion there.
> > > 
> > > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > > 
> > > If there is an IOMMU in the picture there shouldn't be any problem to 
> > > use UIO with DMA capable devices.
> > > 
> > > >>> I don't think this can change.
> > > >> Given there is no PV IOMMU and even if there was it would be too slow 
> > > >> for DPDK
> > > >> use, I can't accept that.
> > > > QEMU does allow emulating an iommu.
> > > 
> > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > > option there. And again, it's a general issue not DPDK specific.
> > > Today one has to develop some proprietary modules (like igb_uio) to 
> > > workaround the issue and this is lame. IMHO uio_pci_generic should
> > > be fixed to be able to properly work within any virtualized environment 
> > > and not only with KVM.
> > > 
> > 
> > Also VMware (bigger problem) has no IOMMU emulation.
> > Other environments as well (Windriver, GCE) have noe IOMMU.
> 
> Because the use-case of userspace drivers is not important enough?
Because "secure" userspace drivers is not important enough.

> Without an IOMMU, there's no way to have secure userspace drivers.
> 
People use VMs as an application containers, not as a machine that needs
to be secured for multiuser scenario.

--
Gleb.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> On Wed, 30 Sep 2015 13:37:22 +0300
> Vlad Zolotarov  wrote:
> 
> > 
> > 
> > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > >> "Michael S. Tsirkin"  wrote:
> > >>
> > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> >  The security breach motivation u brought in "[RFC PATCH] uio:
> >  uio_pci_generic: Add support for MSI interrupts" thread seems a bit 
> >  weak
> >  since one u let the userland access to the bar it may do any funny 
> >  thing
> >  using the DMA engine of the device. This kind of stuff should be 
> >  prevented
> >  using the iommu and if it's enabled then any funny tricks using 
> >  MSI/MSI-X
> >  configuration will be prevented too.
> > 
> >  I'm about to send the patch to main Linux mailing list. Let's continue 
> >  this
> >  discussion there.
> > 
> > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > 
> > If there is an IOMMU in the picture there shouldn't be any problem to 
> > use UIO with DMA capable devices.
> > 
> > >>> I don't think this can change.
> > >> Given there is no PV IOMMU and even if there was it would be too slow 
> > >> for DPDK
> > >> use, I can't accept that.
> > > QEMU does allow emulating an iommu.
> > 
> > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > option there. And again, it's a general issue not DPDK specific.
> > Today one has to develop some proprietary modules (like igb_uio) to 
> > workaround the issue and this is lame. IMHO uio_pci_generic should
> > be fixed to be able to properly work within any virtualized environment 
> > and not only with KVM.
> > 
> 
> Also VMware (bigger problem) has no IOMMU emulation.
> Other environments as well (Windriver, GCE) have noe IOMMU.

Because the use-case of userspace drivers is not important enough?
Without an IOMMU, there's no way to have secure userspace drivers.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Avi Kivity

On 09/30/2015 06:21 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 05:53:54PM +0300, Avi Kivity wrote:
>> On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
 On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
>> On 09/30/15 15:03, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
 On 09/30/15 14:41, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>> The whole idea is to bypass kernel. Especially for networking...
> ... on dumb hardware that doesn't support doing that securely.
 On a very capable HW that supports whatever security requirements 
 needed
 (e.g. 82599 Intel's SR-IOV VF devices).
>>> Network card type is irrelevant as long as you do not have an IOMMU,
>>> otherwise you would just use e.g. VFIO.
>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
>> example where there *is* iommu but it's not virtualized
>> and thus VFIO is
>> useless and there is an option to use directly assigned SR-IOV networking
>> device there where using the kernel drivers impose a performance impact
>> compared to user space UIO-based user space kernel bypass mode of usage. 
>> How
>> is it irrelevant? Could u, pls, clarify your point?
>>
> So it's not even dumb hardware, it's another piece of software
> that forces an "all or nothing" approach where either
> device has access to all VM memory, or none.
> And this, unfortunately, leaves you with no secure way to
> allow userspace drivers.
 Some setups don't need security (they are single-user, single application).
 But do need a lot of performance (like 5X-10X performance).  An example is
 OpenVSwitch, security doesn't help it at all and if you force it to use the
 kernel drivers you cripple it.
>>> We'd have to see there are actual users that need this.  So far, dpdk
>>> seems like the only one,
>> dpdk is a whole class if users.  It's not a specific application.
>>
>>>   and it wants to use UIO for slow path stuff
>>> like polling link status.  Why this needs kernel bypass support, I don't
>>> know.  I asked, and got no answer.
>> First, it's more than link status.  dpdk also has an interrupt mode, which
>> applications can fall back to when when the load is light in order to save
>> power (and in order not to get support calls about 100% cpu when idle).
> Aha, looks like it appeared in June. Interesting, thanks for the info.
>
>> Even for link status, you don't want to poll for that, because accessing
>> device registers is expensive.  An interrupt is the best approach for rare
>> events like link changed.
> Yea, but you probably can get by with a timer for that, even if it's ugly.

Maybe you can, but (a) why increase link status change detection latency 
(b) link status change detection is not the only user of the feature, 
since June.

 Also, I'm root.  I can do anything I like, including loading a patched
 pci_uio_generic.  You're not providing _any_ security, you're simply making
 life harder for users.
>>> Maybe that's true on your system. But I guess you know that's not true
>>> for everyone, not in 2015.
>> Why is it not true?  if I'm root, I can do anything I like to my
>> system, and everyone is root in 2015.  I can access the BARs directly
>> and program DMA, how am I more secure by uio not allowing me to setup
>> msix?
> That's not the point.  The point always was that using uio for these
> devices (capable of DMA, in particular of msix) isn't possible in a
> secure way.

uio is used today for DMA-capable devices.  Some users are perfectly 
willing to give up security for functionality (that's all users who have 
root access to their machines, not just uio users).  You aren't adding 
any security by disallowing uio, you're just removing functionality.

As it happens, you're removing the functionality from the users who have 
no other option.  They can't use vfio because it doesn't work on 
virtualized setups.

(note even on a setup that does support vfio, high performance users 
will want to avoid it).

>   And yes, if same device happens to also do interrupts, UIO
> does not reject it as it probably should, and we can't change this
> without breaking some working setups.  But this doesn't mean we should
> add more setups like this that we'll then be forced to maintain.

pci_uio_generic is maybe the driver with the lowest maintenance burden 
in the entire kernel.  One driver supporting all pci devices, if you 
don't need msi/msix.  And with the patch, it will be one driver 
supporting all pci devices.

I don't really understand the tradeoff.  By rejecting the patch you're 
denying users the ability to use their devices, except through the much 
slower

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 03:50:09PM +0300, Vlad Zolotarov wrote:
> How not virtualizing iommu forces "all or nothing" approach?

Looks like you can't limit an assigned device to only access part of
guest memory that belongs to a given process.  Either let it access all
of guest memory ("all") or don't assign the device ("nothing").

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 05:53:54PM +0300, Avi Kivity wrote:
> On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
> >>
> >>On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
> >>>On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
> On 09/30/15 15:03, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> >>On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> The whole idea is to bypass kernel. Especially for networking...
> >>>... on dumb hardware that doesn't support doing that securely.
> >>On a very capable HW that supports whatever security requirements needed
> >>(e.g. 82599 Intel's SR-IOV VF devices).
> >Network card type is irrelevant as long as you do not have an IOMMU,
> >otherwise you would just use e.g. VFIO.
> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
> example where there *is* iommu but it's not virtualized
> and thus VFIO is
> useless and there is an option to use directly assigned SR-IOV networking
> device there where using the kernel drivers impose a performance impact
> compared to user space UIO-based user space kernel bypass mode of usage. 
> How
> is it irrelevant? Could u, pls, clarify your point?
> 
> >>>So it's not even dumb hardware, it's another piece of software
> >>>that forces an "all or nothing" approach where either
> >>>device has access to all VM memory, or none.
> >>>And this, unfortunately, leaves you with no secure way to
> >>>allow userspace drivers.
> >>Some setups don't need security (they are single-user, single application).
> >>But do need a lot of performance (like 5X-10X performance).  An example is
> >>OpenVSwitch, security doesn't help it at all and if you force it to use the
> >>kernel drivers you cripple it.
> >We'd have to see there are actual users that need this.  So far, dpdk
> >seems like the only one,
> 
> dpdk is a whole class if users.  It's not a specific application.
> 
> >  and it wants to use UIO for slow path stuff
> >like polling link status.  Why this needs kernel bypass support, I don't
> >know.  I asked, and got no answer.
> 
> First, it's more than link status.  dpdk also has an interrupt mode, which
> applications can fall back to when when the load is light in order to save
> power (and in order not to get support calls about 100% cpu when idle).

Aha, looks like it appeared in June. Interesting, thanks for the info.

> Even for link status, you don't want to poll for that, because accessing
> device registers is expensive.  An interrupt is the best approach for rare
> events like link changed.

Yea, but you probably can get by with a timer for that, even if it's ugly.

> >>Also, I'm root.  I can do anything I like, including loading a patched
> >>pci_uio_generic.  You're not providing _any_ security, you're simply making
> >>life harder for users.
> >Maybe that's true on your system. But I guess you know that's not true
> >for everyone, not in 2015.
> 
> Why is it not true?  if I'm root, I can do anything I like to my
> system, and everyone is root in 2015.  I can access the BARs directly
> and program DMA, how am I more secure by uio not allowing me to setup
> msix?

That's not the point.  The point always was that using uio for these
devices (capable of DMA, in particular of msix) isn't possible in a
secure way. And yes, if same device happens to also do interrupts, UIO
does not reject it as it probably should, and we can't change this
without breaking some working setups.  But this doesn't mean we should
add more setups like this that we'll then be forced to maintain.


> Non-root users are already secured by their inability to load the module,
> and by the device permissions.
> 
> >
> >>>So it makes even less sense to add insecure work-arounds in the kernel.
> >>>It seems quite likely that by the time the new kernel reaches
> >>>production X years from now, EC2 will have a virtual iommu.
> >>I can adopt a new kernel tomorrow.  I have no influence on EC2.
> >>
> >>
> >Xen grant tables sound like they could be the right interface
> >for EC2.  google search for "grant tables iommu" immediately gives me:
> >http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html
> >Maybe latest Xen is already doing the right thing, and it's just the
> >question of making VFIO use that.
> >
> 
> grant tables only work for virtual devices, not physical devices.

Why not? That's what the patches above seem to do.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Avi Kivity

On 09/30/2015 05:39 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
>>
>> On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
 On 09/30/15 15:03, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
>> On 09/30/15 14:41, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
 The whole idea is to bypass kernel. Especially for networking...
>>> ... on dumb hardware that doesn't support doing that securely.
>> On a very capable HW that supports whatever security requirements needed
>> (e.g. 82599 Intel's SR-IOV VF devices).
> Network card type is irrelevant as long as you do not have an IOMMU,
> otherwise you would just use e.g. VFIO.
 Sorry, but I don't follow your logic here - Amazon EC2 environment is a
 example where there *is* iommu but it's not virtualized
 and thus VFIO is
 useless and there is an option to use directly assigned SR-IOV networking
 device there where using the kernel drivers impose a performance impact
 compared to user space UIO-based user space kernel bypass mode of usage. 
 How
 is it irrelevant? Could u, pls, clarify your point?

>>> So it's not even dumb hardware, it's another piece of software
>>> that forces an "all or nothing" approach where either
>>> device has access to all VM memory, or none.
>>> And this, unfortunately, leaves you with no secure way to
>>> allow userspace drivers.
>> Some setups don't need security (they are single-user, single application).
>> But do need a lot of performance (like 5X-10X performance).  An example is
>> OpenVSwitch, security doesn't help it at all and if you force it to use the
>> kernel drivers you cripple it.
> We'd have to see there are actual users that need this.  So far, dpdk
> seems like the only one,

dpdk is a whole class if users.  It's not a specific application.

>   and it wants to use UIO for slow path stuff
> like polling link status.  Why this needs kernel bypass support, I don't
> know.  I asked, and got no answer.

First, it's more than link status.  dpdk also has an interrupt mode, 
which applications can fall back to when when the load is light in order 
to save power (and in order not to get support calls about 100% cpu when 
idle).

Even for link status, you don't want to poll for that, because accessing 
device registers is expensive.  An interrupt is the best approach for 
rare events like link changed.

>
>> Also, I'm root.  I can do anything I like, including loading a patched
>> pci_uio_generic.  You're not providing _any_ security, you're simply making
>> life harder for users.
> Maybe that's true on your system. But I guess you know that's not true
> for everyone, not in 2015.

Why is it not true?  if I'm root, I can do anything I like to my system, 
and everyone is root in 2015.  I can access the BARs directly and 
program DMA, how am I more secure by uio not allowing me to setup msix?

Non-root users are already secured by their inability to load the 
module, and by the device permissions.

>
>>> So it makes even less sense to add insecure work-arounds in the kernel.
>>> It seems quite likely that by the time the new kernel reaches
>>> production X years from now, EC2 will have a virtual iommu.
>> I can adopt a new kernel tomorrow.  I have no influence on EC2.
>>
>>
> Xen grant tables sound like they could be the right interface
> for EC2.  google search for "grant tables iommu" immediately gives me:
> http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html
> Maybe latest Xen is already doing the right thing, and it's just the
> question of making VFIO use that.
>

grant tables only work for virtual devices, not physical devices.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 04:05:40PM +0300, Avi Kivity wrote:
> 
> 
> On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
> >>
> >>On 09/30/15 15:03, Michael S. Tsirkin wrote:
> >>>On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> >>The whole idea is to bypass kernel. Especially for networking...
> >... on dumb hardware that doesn't support doing that securely.
> On a very capable HW that supports whatever security requirements needed
> (e.g. 82599 Intel's SR-IOV VF devices).
> >>>Network card type is irrelevant as long as you do not have an IOMMU,
> >>>otherwise you would just use e.g. VFIO.
> >>Sorry, but I don't follow your logic here - Amazon EC2 environment is a
> >>example where there *is* iommu but it's not virtualized
> >>and thus VFIO is
> >>useless and there is an option to use directly assigned SR-IOV networking
> >>device there where using the kernel drivers impose a performance impact
> >>compared to user space UIO-based user space kernel bypass mode of usage. How
> >>is it irrelevant? Could u, pls, clarify your point?
> >>
> >So it's not even dumb hardware, it's another piece of software
> >that forces an "all or nothing" approach where either
> >device has access to all VM memory, or none.
> >And this, unfortunately, leaves you with no secure way to
> >allow userspace drivers.
> 
> Some setups don't need security (they are single-user, single application).
> But do need a lot of performance (like 5X-10X performance).  An example is
> OpenVSwitch, security doesn't help it at all and if you force it to use the
> kernel drivers you cripple it.

We'd have to see there are actual users that need this.  So far, dpdk
seems like the only one, and it wants to use UIO for slow path stuff
like polling link status.  Why this needs kernel bypass support, I don't
know.  I asked, and got no answer.

> 
> Also, I'm root.  I can do anything I like, including loading a patched
> pci_uio_generic.  You're not providing _any_ security, you're simply making
> life harder for users.

Maybe that's true on your system. But I guess you know that's not true
for everyone, not in 2015.

> >So it makes even less sense to add insecure work-arounds in the kernel.
> >It seems quite likely that by the time the new kernel reaches
> >production X years from now, EC2 will have a virtual iommu.
> 
> I can adopt a new kernel tomorrow.  I have no influence on EC2.
> 
>

Xen grant tables sound like they could be the right interface
for EC2.  google search for "grant tables iommu" immediately gives me:
http://lists.xenproject.org/archives/html/xen-devel/2014-04/msg00963.html
Maybe latest Xen is already doing the right thing, and it's just the
question of making VFIO use that.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Avi Kivity



On 09/30/2015 03:27 PM, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 15:03, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
 On 09/30/15 14:41, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>> The whole idea is to bypass kernel. Especially for networking...
> ... on dumb hardware that doesn't support doing that securely.
 On a very capable HW that supports whatever security requirements needed
 (e.g. 82599 Intel's SR-IOV VF devices).
>>> Network card type is irrelevant as long as you do not have an IOMMU,
>>> otherwise you would just use e.g. VFIO.
>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
>> example where there *is* iommu but it's not virtualized
>> and thus VFIO is
>> useless and there is an option to use directly assigned SR-IOV networking
>> device there where using the kernel drivers impose a performance impact
>> compared to user space UIO-based user space kernel bypass mode of usage. How
>> is it irrelevant? Could u, pls, clarify your point?
>>
> So it's not even dumb hardware, it's another piece of software
> that forces an "all or nothing" approach where either
> device has access to all VM memory, or none.
> And this, unfortunately, leaves you with no secure way to
> allow userspace drivers.

Some setups don't need security (they are single-user, single 
application). But do need a lot of performance (like 5X-10X 
performance).  An example is OpenVSwitch, security doesn't help it at 
all and if you force it to use the kernel drivers you cripple it.

Also, I'm root.  I can do anything I like, including loading a patched 
pci_uio_generic.  You're not providing _any_ security, you're simply 
making life harder for users.

> So it makes even less sense to add insecure work-arounds in the kernel.
> It seems quite likely that by the time the new kernel reaches
> production X years from now, EC2 will have a virtual iommu.

I can adopt a new kernel tomorrow.  I have no influence on EC2.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Vlad Zolotarov



On 09/30/15 15:27, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 15:03, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
 On 09/30/15 14:41, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>>   The whole idea is to bypass kernel. Especially for networking...
> ... on dumb hardware that doesn't support doing that securely.
 On a very capable HW that supports whatever security requirements needed
 (e.g. 82599 Intel's SR-IOV VF devices).
>>> Network card type is irrelevant as long as you do not have an IOMMU,
>>> otherwise you would just use e.g. VFIO.
>> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
>> example where there *is* iommu but it's not virtualized
>> and thus VFIO is
>> useless and there is an option to use directly assigned SR-IOV networking
>> device there where using the kernel drivers impose a performance impact
>> compared to user space UIO-based user space kernel bypass mode of usage. How
>> is it irrelevant? Could u, pls, clarify your point?
>>
> So it's not even dumb hardware, it's another piece of software
> that forces an "all or nothing" approach where either
> device has access to all VM memory, or none.
> And this, unfortunately, leaves you with no secure way to
> allow userspace drivers.
UIO is not secure even today so what are we arguing about? ;)
Adding MSI/MSI-X support won't change this state, so, pls., discard the 
security argument unless u thing that UIO is completely secure piece of 
software today. In the later case, could u, pls., clarify what would 
prevent the userspace program to configure a DMA controller via 
registers and do whatever it wants?


How not virtualizing iommu forces "all or nothing" approach? What 
insecure in relying on HV to control the iommu and not letting the VF 
any access to it?
As far as I see it - there isn't any security problem here at all. The 
only problem I see here is that dumb current uio_pci_generic 
implementation forces people to go and invent the workarounds instead of 
having a proper MSI/MSI-X support implemented. And as I've mentioned 
above it has nothing to do with security because there is no such thing 
as security (on the UIO driver level) when we talk about UIO - it has to 
be ensured by some other entity like HV.

>
> So it makes even less sense to add insecure work-arounds in the kernel.
> It seems quite likely that by the time the new kernel reaches
> production X years from now, EC2 will have a virtual iommu.

I'd bet that new kernel would reach production long before Amazon does 
that... ;)

>
>
> Colour me unimpressed.
>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 03:16:04PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 15:03, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> >>
> >>On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >>>On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> The whole idea is to bypass kernel. Especially for networking...
> >>>... on dumb hardware that doesn't support doing that securely.
> >>On a very capable HW that supports whatever security requirements needed
> >>(e.g. 82599 Intel's SR-IOV VF devices).
> >Network card type is irrelevant as long as you do not have an IOMMU,
> >otherwise you would just use e.g. VFIO.
> 
> Sorry, but I don't follow your logic here - Amazon EC2 environment is a
> example where there *is* iommu but it's not virtualized
> and thus VFIO is
> useless and there is an option to use directly assigned SR-IOV networking
> device there where using the kernel drivers impose a performance impact
> compared to user space UIO-based user space kernel bypass mode of usage. How
> is it irrelevant? Could u, pls, clarify your point?
> 

So it's not even dumb hardware, it's another piece of software
that forces an "all or nothing" approach where either
device has access to all VM memory, or none.
And this, unfortunately, leaves you with no secure way to
allow userspace drivers.

So it makes even less sense to add insecure work-arounds in the kernel.
It seems quite likely that by the time the new kernel reaches
production X years from now, EC2 will have a virtual iommu.


> >
> >>>Colour me unimpressed.
> >>>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Vlad Zolotarov

On 09/30/15 15:03, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 14:41, Michael S. Tsirkin wrote:
>>> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
 The whole idea is to bypass kernel. Especially for networking...
>>> ... on dumb hardware that doesn't support doing that securely.
>> On a very capable HW that supports whatever security requirements needed
>> (e.g. 82599 Intel's SR-IOV VF devices).
> Network card type is irrelevant as long as you do not have an IOMMU,
> otherwise you would just use e.g. VFIO.

Sorry, but I don't follow your logic here - Amazon EC2 environment is a 
example where there *is* iommu but it's not virtualized and thus VFIO is 
useless and there is an option to use directly assigned SR-IOV 
networking device there where using the kernel drivers impose a 
performance impact compared to user space UIO-based user space kernel 
bypass mode of usage. How is it irrelevant? Could u, pls, clarify your 
point?

>
>>> Colour me unimpressed.
>>>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 02:53:19PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 14:41, Michael S. Tsirkin wrote:
> >On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> >>The whole idea is to bypass kernel. Especially for networking...
> >... on dumb hardware that doesn't support doing that securely.
> 
> On a very capable HW that supports whatever security requirements needed
> (e.g. 82599 Intel's SR-IOV VF devices).

Network card type is irrelevant as long as you do not have an IOMMU,
otherwise you would just use e.g. VFIO.

> >Colour me unimpressed.
> >

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Vlad Zolotarov



On 09/30/15 14:41, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
>> The whole idea is to bypass kernel. Especially for networking...
> ... on dumb hardware that doesn't support doing that securely.

On a very capable HW that supports whatever security requirements needed 
(e.g. 82599 Intel's SR-IOV VF devices).

> Colour me unimpressed.
>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 02:26:01PM +0300, Vlad Zolotarov wrote:
> The whole idea is to bypass kernel. Especially for networking...

... on dumb hardware that doesn't support doing that securely.
Colour me unimpressed.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Stephen Hemminger

On Wed, 30 Sep 2015 23:09:33 +0300
Vlad Zolotarov  wrote:

> 
> 
> On 09/30/15 22:39, Michael S. Tsirkin wrote:
> > On Wed, Sep 30, 2015 at 10:06:52PM +0300, Vlad Zolotarov wrote:
>  How would iommu
>  virtualization change anything?
> >>> Kernel can use an iommu to limit device access to memory of
> >>> the controlling application.
> >> Ok, this is obvious but what it has to do with enabling using MSI/MSI-X
> >> interrupts support in uio_pci_generic? kernel may continue to limit the
> >> above access with this support as well.
> > It could maybe. So if you write a patch to allow MSI by at the same time
> > creating an isolated IOMMU group and blocking DMA from device in
> > question anywhere, that sounds reasonable.
> 
> No, I'm only planning to add MSI and MSI-X interrupts support for 
> uio_pci_generic device.
> The rest mentioned above should naturally be a matter of a different 
> patch and writing it is orthogonal to the patch I'm working on as has 
> been extensively discussed in this thread.
> 
> >
> 

I have a generic MSI and MSI-X driver (posted earlier on this list).
About to post to upstream kernel.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Vlad Zolotarov

On 09/30/15 13:58, Michael S. Tsirkin wrote:
> On Wed, Sep 30, 2015 at 01:37:22PM +0300, Vlad Zolotarov wrote:
>>
>> On 09/30/15 00:49, Michael S. Tsirkin wrote:
>>> On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
 On Tue, 29 Sep 2015 23:54:54 +0300
 "Michael S. Tsirkin"  wrote:

> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
>> The security breach motivation u brought in "[RFC PATCH] uio:
>> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
>> since one u let the userland access to the bar it may do any funny thing
>> using the DMA engine of the device. This kind of stuff should be 
>> prevented
>> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
>> configuration will be prevented too.
>>
>> I'm about to send the patch to main Linux mailing list. Let's continue 
>> this
>> discussion there.
> Basically UIO shouldn't be used with devices capable of DMA.
> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
>> If there is an IOMMU in the picture there shouldn't be any problem to use
>> UIO with DMA capable devices.
> UIO doesn't enforce the IOMMU though. That's why it's not a good fit.

Having said all that - does UIO denies to work with the devices with DMA 
capability today? Either i have missed that logic or it's not there.
So all what u are so worried about may already be done today. That's why 
I don't understand why adding a support for MSI/MSI-X interrupts
would change anything here. U are right that UIO *today* has a security 
hole however it should be addressed separately and the same solution
that will cover the the security breach in the current code will cover 
the "MSI/MSI-X security vulnerability" because they are actually exactly 
the same
issue.

>
> I don't think this can change.
 Given there is no PV IOMMU and even if there was it would be too slow for 
 DPDK
 use, I can't accept that.
>>> QEMU does allow emulating an iommu.
>> Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an option
>> there.
> Not only that, a bunch of boxes have their IOMMU disabled.
> So for such systems, you can't have userspace poking at
> device registers. You need a kernel driver to validate
> userspace requests before passing them on to devices.

I think u are describing a HV functionality here. ;) And yes, u are 
absolutely right, HV has to control the non-privileged userland.
For HV/non-virtualized boxes a possible solution could be to allow UIO 
only for some privileged group of processes.

>
>> And again, it's a general issue not DPDK specific.
>> Today one has to develop some proprietary modules (like igb_uio) to
>> workaround the issue and this is lame.
> Of course it is lame. So don't bypass the kernel then, use the upstream 
> drivers.

This would impose a heavy performance penalty. The whole idea is to 
bypass kernel. Especially for networking...

>
>> IMHO uio_pci_generic should
>> be fixed to be able to properly work within any virtualized environment and
>> not only with KVM.
> The motivation for UIO is pretty clear:
>
>  For many types of devices, creating a Linux kernel driver is
>  overkill.  All that is really needed is some way to handle an
>  interrupt and provide access to the memory space of the
>  device.  The logic of controlling the device does not
>  necessarily have to be within the kernel, as the device does
>  not need to take advantage of any of other resources that the
>  kernel provides.  One such common class of devices that are
>  like this are for industrial I/O cards.
>
> Devices doing DMA do need to take advantage of memory protection
> that the kernel provides.
Well, yeah - but who said I has to be forbidden to work with the device 
if MSI-X interrupts is my only option?

Kernel may provide a protection in the way that it would check the 
process permissions and deny the UIO access to non-privileged processes.
I'm not sure it's the case today and if it's not the case then, as 
mentioned above, this would rather be fixed ASAP exactly due to reasons 
u bring up
here. And once it's done there shouldn't be any limitation to allow MSI 
or MSI-X interrupts along with INT#X.

>
>>>   DPDK uses static mappings, so I
>>> doubt it's speed matters at all.
>>>
>>> Anyway, DPDK is doing polling all the time. I don't see why does it
>>> insist on using interrupts to detect link up events. Just poll for that
>>> too.
>>>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Wed, Sep 30, 2015 at 01:37:22PM +0300, Vlad Zolotarov wrote:
> 
> 
> On 09/30/15 00:49, Michael S. Tsirkin wrote:
> >On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> >>On Tue, 29 Sep 2015 23:54:54 +0300
> >>"Michael S. Tsirkin"  wrote:
> >>
> >>>On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> The security breach motivation u brought in "[RFC PATCH] uio:
> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> since one u let the userland access to the bar it may do any funny thing
> using the DMA engine of the device. This kind of stuff should be prevented
> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> configuration will be prevented too.
> 
> I'm about to send the patch to main Linux mailing list. Let's continue 
> this
> discussion there.
> >>>Basically UIO shouldn't be used with devices capable of DMA.
> >>>Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> 
> If there is an IOMMU in the picture there shouldn't be any problem to use
> UIO with DMA capable devices.

UIO doesn't enforce the IOMMU though. That's why it's not a good fit.

> >>>I don't think this can change.
> >>Given there is no PV IOMMU and even if there was it would be too slow for 
> >>DPDK
> >>use, I can't accept that.
> >QEMU does allow emulating an iommu.
> 
> Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an option
> there.

Not only that, a bunch of boxes have their IOMMU disabled.
So for such systems, you can't have userspace poking at
device registers. You need a kernel driver to validate
userspace requests before passing them on to devices.

> And again, it's a general issue not DPDK specific.
> Today one has to develop some proprietary modules (like igb_uio) to
> workaround the issue and this is lame.

Of course it is lame. So don't bypass the kernel then, use the upstream drivers.

> IMHO uio_pci_generic should
> be fixed to be able to properly work within any virtualized environment and
> not only with KVM.

The motivation for UIO is pretty clear:

For many types of devices, creating a Linux kernel driver is
overkill.  All that is really needed is some way to handle an
interrupt and provide access to the memory space of the
device.  The logic of controlling the device does not
necessarily have to be within the kernel, as the device does
not need to take advantage of any of other resources that the
kernel provides.  One such common class of devices that are
like this are for industrial I/O cards.

Devices doing DMA do need to take advantage of memory protection
that the kernel provides.

> 
> >  DPDK uses static mappings, so I
> >doubt it's speed matters at all.
> >
> >Anyway, DPDK is doing polling all the time. I don't see why does it
> >insist on using interrupts to detect link up events. Just poll for that
> >too.
> >

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Vlad Zolotarov

On 09/30/15 00:49, Michael S. Tsirkin wrote:
> On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
>> On Tue, 29 Sep 2015 23:54:54 +0300
>> "Michael S. Tsirkin"  wrote:
>>
>>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
 The security breach motivation u brought in "[RFC PATCH] uio:
 uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
 since one u let the userland access to the bar it may do any funny thing
 using the DMA engine of the device. This kind of stuff should be prevented
 using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
 configuration will be prevented too.

 I'm about to send the patch to main Linux mailing list. Let's continue this
 discussion there.

>>> Basically UIO shouldn't be used with devices capable of DMA.
>>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).

If there is an IOMMU in the picture there shouldn't be any problem to 
use UIO with DMA capable devices.

>>> I don't think this can change.
>> Given there is no PV IOMMU and even if there was it would be too slow for 
>> DPDK
>> use, I can't accept that.
> QEMU does allow emulating an iommu.

Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
option there. And again, it's a general issue not DPDK specific.
Today one has to develop some proprietary modules (like igb_uio) to 
workaround the issue and this is lame. IMHO uio_pci_generic should
be fixed to be able to properly work within any virtualized environment 
and not only with KVM.

>   DPDK uses static mappings, so I
> doubt it's speed matters at all.
>
> Anyway, DPDK is doing polling all the time. I don't see why does it
> insist on using interrupts to detect link up events. Just poll for that
> too.
>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Stephen Hemminger

On Wed, 30 Sep 2015 20:39:43 +0300
"Michael S. Tsirkin"  wrote:

> On Wed, Sep 30, 2015 at 10:28:07AM -0700, Stephen Hemminger wrote:
> > On Wed, 30 Sep 2015 13:37:22 +0300
> > Vlad Zolotarov  wrote:
> > 
> > > 
> > > 
> > > On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > > > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> > > >> On Tue, 29 Sep 2015 23:54:54 +0300
> > > >> "Michael S. Tsirkin"  wrote:
> > > >>
> > > >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > >  The security breach motivation u brought in "[RFC PATCH] uio:
> > >  uio_pci_generic: Add support for MSI interrupts" thread seems a bit 
> > >  weak
> > >  since one u let the userland access to the bar it may do any funny 
> > >  thing
> > >  using the DMA engine of the device. This kind of stuff should be 
> > >  prevented
> > >  using the iommu and if it's enabled then any funny tricks using 
> > >  MSI/MSI-X
> > >  configuration will be prevented too.
> > > 
> > >  I'm about to send the patch to main Linux mailing list. Let's 
> > >  continue this
> > >  discussion there.
> > > 
> > > >>> Basically UIO shouldn't be used with devices capable of DMA.
> > > >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > > 
> > > If there is an IOMMU in the picture there shouldn't be any problem to 
> > > use UIO with DMA capable devices.
> > > 
> > > >>> I don't think this can change.
> > > >> Given there is no PV IOMMU and even if there was it would be too slow 
> > > >> for DPDK
> > > >> use, I can't accept that.
> > > > QEMU does allow emulating an iommu.
> > > 
> > > Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> > > option there. And again, it's a general issue not DPDK specific.
> > > Today one has to develop some proprietary modules (like igb_uio) to 
> > > workaround the issue and this is lame. IMHO uio_pci_generic should
> > > be fixed to be able to properly work within any virtualized environment 
> > > and not only with KVM.
> > > 
> > 
> > Also VMware (bigger problem) has no IOMMU emulation.
> > Other environments as well (Windriver, GCE) have noe IOMMU.
> 
> Because the use-case of userspace drivers is not important enough?
> Without an IOMMU, there's no way to have secure userspace drivers.

Look at Cloudius, there is no necessity of security in guest.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Stephen Hemminger

On Wed, 30 Sep 2015 13:37:22 +0300
Vlad Zolotarov  wrote:

> 
> 
> On 09/30/15 00:49, Michael S. Tsirkin wrote:
> > On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> >> On Tue, 29 Sep 2015 23:54:54 +0300
> >> "Michael S. Tsirkin"  wrote:
> >>
> >>> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
>  The security breach motivation u brought in "[RFC PATCH] uio:
>  uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
>  since one u let the userland access to the bar it may do any funny thing
>  using the DMA engine of the device. This kind of stuff should be 
>  prevented
>  using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
>  configuration will be prevented too.
> 
>  I'm about to send the patch to main Linux mailing list. Let's continue 
>  this
>  discussion there.
> 
> >>> Basically UIO shouldn't be used with devices capable of DMA.
> >>> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> 
> If there is an IOMMU in the picture there shouldn't be any problem to 
> use UIO with DMA capable devices.
> 
> >>> I don't think this can change.
> >> Given there is no PV IOMMU and even if there was it would be too slow for 
> >> DPDK
> >> use, I can't accept that.
> > QEMU does allow emulating an iommu.
> 
> Amazon's EC2 xen HV doesn't. At least today. Therefore VFIO is not an 
> option there. And again, it's a general issue not DPDK specific.
> Today one has to develop some proprietary modules (like igb_uio) to 
> workaround the issue and this is lame. IMHO uio_pci_generic should
> be fixed to be able to properly work within any virtualized environment 
> and not only with KVM.
> 

Also VMware (bigger problem) has no IOMMU emulation.
Other environments as well (Windriver, GCE) have noe IOMMU.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Tue, Sep 29, 2015 at 02:46:16PM -0700, Stephen Hemminger wrote:
> On Tue, 29 Sep 2015 23:54:54 +0300
> "Michael S. Tsirkin"  wrote:
> 
> > On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > > The security breach motivation u brought in "[RFC PATCH] uio:
> > > uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> > > since one u let the userland access to the bar it may do any funny thing
> > > using the DMA engine of the device. This kind of stuff should be prevented
> > > using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> > > configuration will be prevented too.
> > > 
> > > I'm about to send the patch to main Linux mailing list. Let's continue 
> > > this
> > > discussion there.
> > >   
> > 
> > Basically UIO shouldn't be used with devices capable of DMA.
> > Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> > I don't think this can change.
> 
> Given there is no PV IOMMU and even if there was it would be too slow for DPDK
> use, I can't accept that. 

QEMU does allow emulating an iommu.  DPDK uses static mappings, so I
doubt it's speed matters at all.

Anyway, DPDK is doing polling all the time. I don't see why does it
insist on using interrupts to detect link up events. Just poll for that
too.

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-30 Thread Michael S. Tsirkin

On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> The security breach motivation u brought in "[RFC PATCH] uio:
> uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> since one u let the userland access to the bar it may do any funny thing
> using the DMA engine of the device. This kind of stuff should be prevented
> using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> configuration will be prevented too.
> 
> I'm about to send the patch to main Linux mailing list. Let's continue this
> discussion there.
> 

Basically UIO shouldn't be used with devices capable of DMA.
Use VFIO for that (yes, this implies an emulated or PV IOMMU).
I don't think this can change.

> >
> >I think that DPDK should be fixed to not require uio_pci_generic
> >for VF devices (or any devices without INT#x).
> >
> >If DPDK requires a place-holder driver, the pci-stub driver should
> >do this adequately. See ./drivers/pci/pci-stub.c
> >

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-29 Thread Vlad Zolotarov



On 09/27/15 12:43, Michael S. Tsirkin wrote:
> On Sun, Sep 27, 2015 at 10:05:11AM +0300, Vlad Zolotarov wrote:
>> Hi,
>> I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on
>> Amazon EC2 instances with Enhanced Networking enabled.
>> The idea is to create a DPDK environment that doesn't require compiling
>> kernel modules (igb_uio).
>> However I was surprised to discover that uio_pci_generic refuses to work
>> with EN device on AWS:
>>
>> $ lspci
>> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
>> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
>> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
>> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
>> 00:02.0 VGA compatible controller: Cirrus Logic GD 5446
>> 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
>> Virtual Function (rev 01)
>> 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
>> Virtual Function (rev 01)
>> 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
>>
>> $ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0
>> Error: bind failed for :00:04.0 - Cannot bind to driver uio_pci_generic
>> $dmesg
>>
>> --> snip <---
>> [  816.655575] uio_pci_generic :00:04.0: No IRQ assigned to device: no 
>> support for interrupts?
>>
>> $ sudo lspci -s 00:04.0 -vvv
>> 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
>> Virtual Function (rev 01)
>>  Physical Slot: 4
>>  Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
>> Stepping- SERR- FastB2B- DisINTx-
>>  Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > SERR- >  Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K]
>>  Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K]
>>  Capabilities: [70] MSI-X: Enable- Count=3 Masked-
>>  Vector table: BAR=3 offset=
>>  PBA: BAR=3 offset=2000
>>  Kernel modules: ixgbevf
>>
>> So, as we may see the PCI device doesn't have an INTX interrupt line
>> assigned indeed. It has an MSI-X capability however.
>> Looking at the uio_pci_generic code it seems to require the INTX:
>>
>> uio_pci_generic.c: line 74: probe():
>>
>>  if (!pdev->irq) {
>>  dev_warn(>dev, "No IRQ assigned to device: "
>>   "no support for interrupts?\n");
>>  pci_disable_device(pdev);
>>  return -ENODEV;
>>  }
>>
>> Is it a known limitation? Michael, could u, pls., comment on this?
>>
>> thanks,
>> vlad

Michael, I took a look at the pci_stub driver and the reason why DPDK 
uses uio the first place and I have some comments below.

> This is expected. uio_pci_generic forwards INT#x interrupts from device
> to userspace, but VF devices never assert INT#x.
>
> So it doesn't seem to make any sense to bind uio_pci_generic there.

Well, it's not completely correct to put it this way. The thing is that 
DPDK (and it could be any other framework/developer)
uses uio_pci_generic to actually get interrupts from the device and it 
makes a perfect sense to be able to do so
in the SR-IOV devices too. The problem is, like u've described above, 
that the current implementation of uio_pci_generic
wouldn't let them do so and it seems like a bogus behavior to me. There 
is no reason, why uio_pci_generic wouldn't be able to work
the same way as it does today but with MSI-X interrupts. To make things 
simple forwarding just the first vector as an initial implementation.

The security breach motivation u brought in "[RFC PATCH] uio: 
uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
since one u let the userland access to the bar it may do any funny thing 
using the DMA engine of the device. This kind of stuff should be prevented
using the iommu and if it's enabled then any funny tricks using 
MSI/MSI-X configuration will be prevented too.

I'm about to send the patch to main Linux mailing list. Let's continue 
this discussion there.

>
> I think that DPDK should be fixed to not require uio_pci_generic
> for VF devices (or any devices without INT#x).
>
> If DPDK requires a place-holder driver, the pci-stub driver should
> do this adequately. See ./drivers/pci/pci-stub.c
>

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-29 Thread Stephen Hemminger

On Tue, 29 Sep 2015 23:54:54 +0300
"Michael S. Tsirkin"  wrote:

> On Tue, Sep 29, 2015 at 07:41:09PM +0300, Vlad Zolotarov wrote:
> > The security breach motivation u brought in "[RFC PATCH] uio:
> > uio_pci_generic: Add support for MSI interrupts" thread seems a bit weak
> > since one u let the userland access to the bar it may do any funny thing
> > using the DMA engine of the device. This kind of stuff should be prevented
> > using the iommu and if it's enabled then any funny tricks using MSI/MSI-X
> > configuration will be prevented too.
> > 
> > I'm about to send the patch to main Linux mailing list. Let's continue this
> > discussion there.
> >   
> 
> Basically UIO shouldn't be used with devices capable of DMA.
> Use VFIO for that (yes, this implies an emulated or PV IOMMU).
> I don't think this can change.

Given there is no PV IOMMU and even if there was it would be too slow for DPDK
use, I can't accept that.

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-27 Thread Vladislav Zolotarov

On Sep 27, 2015 12:43 PM, "Michael S. Tsirkin"  wrote:
>
> On Sun, Sep 27, 2015 at 10:05:11AM +0300, Vlad Zolotarov wrote:
> > Hi,
> > I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on
> > Amazon EC2 instances with Enhanced Networking enabled.
> > The idea is to create a DPDK environment that doesn't require compiling
> > kernel modules (igb_uio).
> > However I was surprised to discover that uio_pci_generic refuses to work
> > with EN device on AWS:
> >
> > $ lspci
> > 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
(rev 02)
> > 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton
II]
> > 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE
[Natoma/Triton II]
> > 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
> > 00:02.0 VGA compatible controller: Cirrus Logic GD 5446
> > 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet
Controller Virtual Function (rev 01)
> > 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet
Controller Virtual Function (rev 01)
> > 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device
(rev 01)
> >
> > $ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0
> > Error: bind failed for :00:04.0 - Cannot bind to driver
uio_pci_generic
>
> > $dmesg
> >
> > --> snip <---
> > [  816.655575] uio_pci_generic :00:04.0: No IRQ assigned to device:
no support for interrupts?
> >
> > $ sudo lspci -s 00:04.0 -vvv
> > 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet
Controller Virtual Function (rev 01)
> >   Physical Slot: 4
> >   Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx-
> >   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
SERR-  >   Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K]
> >   Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K]
> >   Capabilities: [70] MSI-X: Enable- Count=3 Masked-
> >   Vector table: BAR=3 offset=
> >   PBA: BAR=3 offset=2000
> >   Kernel modules: ixgbevf
> >
> > So, as we may see the PCI device doesn't have an INTX interrupt line
> > assigned indeed. It has an MSI-X capability however.
> > Looking at the uio_pci_generic code it seems to require the INTX:
> >
> > uio_pci_generic.c: line 74: probe():
> >
> >   if (!pdev->irq) {
> >   dev_warn(>dev, "No IRQ assigned to device: "
> >"no support for interrupts?\n");
> >   pci_disable_device(pdev);
> >   return -ENODEV;
> >   }
> >
> > Is it a known limitation? Michael, could u, pls., comment on this?
> >
> > thanks,
> > vlad
>
> This is expected. uio_pci_generic forwards INT#x interrupts from device
> to userspace, but VF devices never assert INT#x.
>
> So it doesn't seem to make any sense to bind uio_pci_generic there.
>
> I think that DPDK should be fixed to not require uio_pci_generic
> for VF devices (or any devices without INT#x).
>
> If DPDK requires a place-holder driver, the pci-stub driver should
> do this adequately. See ./drivers/pci/pci-stub.c

Thank for clarification, Michael. I'll take a look.

>
> --
> MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-27 Thread Michael S. Tsirkin

On Sun, Sep 27, 2015 at 10:05:11AM +0300, Vlad Zolotarov wrote:
> Hi,
> I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on
> Amazon EC2 instances with Enhanced Networking enabled.
> The idea is to create a DPDK environment that doesn't require compiling
> kernel modules (igb_uio).
> However I was surprised to discover that uio_pci_generic refuses to work
> with EN device on AWS:
> 
> $ lspci
> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
> 00:02.0 VGA compatible controller: Cirrus Logic GD 5446
> 00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
> Virtual Function (rev 01)
> 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
> Virtual Function (rev 01)
> 00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
> 
> $ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0
> Error: bind failed for :00:04.0 - Cannot bind to driver uio_pci_generic

> $dmesg
> 
> --> snip <---
> [  816.655575] uio_pci_generic :00:04.0: No IRQ assigned to device: no 
> support for interrupts?
> 
> $ sudo lspci -s 00:04.0 -vvv
> 00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
> Virtual Function (rev 01)
>   Physical Slot: 4
>   Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
> Stepping- SERR- FastB2B- DisINTx-
>   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-  SERR-Region 0: Memory at f3008000 (64-bit, prefetchable) [size=16K]
>   Region 3: Memory at f300c000 (64-bit, prefetchable) [size=16K]
>   Capabilities: [70] MSI-X: Enable- Count=3 Masked-
>   Vector table: BAR=3 offset=
>   PBA: BAR=3 offset=2000
>   Kernel modules: ixgbevf
> 
> So, as we may see the PCI device doesn't have an INTX interrupt line
> assigned indeed. It has an MSI-X capability however.
> Looking at the uio_pci_generic code it seems to require the INTX:
> 
> uio_pci_generic.c: line 74: probe():
> 
>   if (!pdev->irq) {
>   dev_warn(>dev, "No IRQ assigned to device: "
>"no support for interrupts?\n");
>   pci_disable_device(pdev);
>   return -ENODEV;
>   }
> 
> Is it a known limitation? Michael, could u, pls., comment on this?
> 
> thanks,
> vlad

This is expected. uio_pci_generic forwards INT#x interrupts from device
to userspace, but VF devices never assert INT#x.

So it doesn't seem to make any sense to bind uio_pci_generic there.

I think that DPDK should be fixed to not require uio_pci_generic
for VF devices (or any devices without INT#x).

If DPDK requires a place-holder driver, the pci-stub driver should
do this adequately. See ./drivers/pci/pci-stub.c

-- 
MST

[dpdk-dev] Having troubles binding an SR-IOV VF to uio_pci_generic on Amazon instance

2015-09-27 Thread Vlad Zolotarov

Hi,
I was trying to use uio_pci_generic with Intel's 10G SR-IOV devices on 
Amazon EC2 instances with Enhanced Networking enabled.
The idea is to create a DPDK environment that doesn't require compiling 
kernel modules (igb_uio).
However I was surprised to discover that uio_pci_generic refuses to work 
with EN device on AWS:

$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

$ sudo ./dpdk/tools/dpdk_nic_bind.py -b uio_pci_generic 00:04.0
Error: bind failed for :00:04.0 - Cannot bind to driver uio_pci_generic

$dmesg

--> snip <---
[  816.655575] uio_pci_generic :00:04.0: No IRQ assigned to device: no 
support for interrupts?

$ sudo lspci -s 00:04.0 -vvv
00:04.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller 
Virtual Function (rev 01)
Physical Slot: 4
Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- irq) {
dev_warn(>dev, "No IRQ assigned to device: "
 "no support for interrupts?\n");
pci_disable_device(pdev);
return -ENODEV;
}

Is it a known limitation? Michael, could u, pls., comment on this?

thanks,
vlad

1 2 >

100 matches

Mail list logo