Re: [E1000-devel] ixgbe/linux/sparc perf issues

2014-12-19 Thread Sowmini Varadhan
On (12/12/14 11:16), Sowmini Varadhan wrote:
 
 But getting back to linux, 3 Gbps is a far cry from 10 Gbps.
 I need to spend some time collecting data to convince myself that
 this is purely because of HV/IOMMU inefficiency.
 

[e1000-devel has been Bcc'ed]

I collected the stats, and  I have evidence that the HV is not the
bottleneck at this point:

I am running linux as the Tx side (TCP client) with 10 threads 
(iperf -c addr -P 10) against an iperf server that can handle
9-9.5 Gbps. 

  Baseline:
   with default settings (TSO enabled) :9-9.5 Gbps
   Disable TSO using ethtool- drops badly:  2-3 Gbps.  (!)

  With iommu patch to break monolithic lock: 8.5 Gbps. (Note: with no TSO!)
  
I'll share the iommu patch as an RFC in a separate email to sparclinux.

But the Rx side may have other bottle-necks: even with the iommu
patch, it is stuck at 3 Gbps, though I can get something a bit
better merely by disabling GRO (as recommended by intel.com documentation), 
so 3 Gbps is probably not the ceiling here.

I am willing to believe that you can't do much better than
approx 8.5 Gbps without additional churn to the DMA design.
But 3 Gbps Rx out of a max of 10 Gbps suggests that something 
other than the HV is holding linux/sparc/Rx back. 

And it might not even be the DMA overhead, since Tx can pull 8.5 Gbps
even with a map/unmap for each packet. I'm still investigating the Rx
side, but there are a lot of factors here, with RPS, qdisc, etc all
coming into play. Suggestions for things to investigate are welcome.

--Sowmini



--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] ixgbe/linux/sparc perf issues

2014-12-12 Thread Sowmini Varadhan
On (12/11/14 15:27), David Miller wrote:
 
 BTW, Solaris also does things which are remotely exploitable, so
 these optimizations that get them line rate have a serious cost.
 
 In their NIU driver, the recycle all buffers in an RX queue rather
 than allocating new buffers.
 
 This means that a maliciously running TCP application can read a lot
 of data from a bulk sender, then simply stop reading completely.

Just to set the record straight, without digressing too much into
Solaris internals..

Solaris follows the common practice used in such algorithms
of having thresholds on the number of loaned (recycled) buffers for
this sort of thing, to avoid DoS attacks from malicious applications. 
When that threshold is crossed, the driver falls back to the slower
allocate new buffers path, so there is no stalling.

But getting back to linux, 3 Gbps is a far cry from 10 Gbps.
I need to spend some time collecting data to convince myself that
this is purely because of HV/IOMMU inefficiency.

Thanks,
--Sowmini




--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] ixgbe/linux/sparc perf issues

2014-12-11 Thread David Miller
From: Sowmini Varadhan sowmini.varad...@oracle.com
Date: Thu, 11 Dec 2014 14:45:42 -0500

 1. lockstat and perf report that iommu-lock is the hot-lock (in a typical 
instance, I get about 21M contentions out of 27M acquisitions, 25 us
avg wait time). Even if I fix this issue (see below), I see:

The real overhead is unavoidable due to the way the hypervisor access
to the IOMMU is implemented in sun4v.

If we had direct access to the hardware, we could avoid all of the
real overhead in %99 of all IOMMU mappings, as we do for pre-sun4v
systems.

On sun4u systems, we never flush the IOMMU until we wrap around
the end of the IOMMU arena to the beginning in order to service
an allocation.

Such an optimization is impossible with the hypervisor call interface
in sun4v.

I've known about this issue for a decade and I do not think there is
anything we can really do about this.

--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] ixgbe/linux/sparc perf issues

2014-12-11 Thread Sowmini Varadhan
On (12/11/14 15:09), David Miller wrote:
 
 The real overhead is unavoidable due to the way the hypervisor access
 to the IOMMU is implemented in sun4v.
 
 If we had direct access to the hardware, we could avoid all of the
 real overhead in %99 of all IOMMU mappings, as we do for pre-sun4v
 systems.
 
 On sun4u systems, we never flush the IOMMU until we wrap around
 the end of the IOMMU arena to the beginning in order to service
 an allocation.
 
 Such an optimization is impossible with the hypervisor call interface
 in sun4v.
 
 I've known about this issue for a decade and I do not think there is
 anything we can really do about this.

All this may be true, but it would also be true for Solaris, which
manages to do line-speed (for the exact same setup), so there must be
some other bottleneck going on? 

And fwiw, removing the iommu lock contention out of lockstat
did not make any difference to the throughput, which seems to indicate
that the bottleneck is elsewhere. Hence the question about the
ixgbe stats, and tuning that I may be missing.

--Sowmini


--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] ixgbe/linux/sparc perf issues

2014-12-11 Thread David Miller
From: Sowmini Varadhan sowmini.varad...@oracle.com
Date: Thu, 11 Dec 2014 15:21:00 -0500

 All this may be true, but it would also be true for Solaris, which
 manages to do line-speed (for the exact same setup), so there must be
 some other bottleneck going on? 

They have DMA mapping interfaces which pre-allocate large batches
of mapping at a time.

 And fwiw, removing the iommu lock contention out of lockstat
 did not make any difference to the throughput, which seems to indicate
 that the bottleneck is elsewhere.

Like I said, it's in the hypervisor IOMMU interfaces implementing
the hardware accesses to flush the hardware and adjust the DMA
mappings.

The lock just shows because the overhead bubbles up to the closest
non-hypervisor code.

--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired


Re: [E1000-devel] ixgbe/linux/sparc perf issues

2014-12-11 Thread David Miller
From: David Miller da...@davemloft.net
Date: Thu, 11 Dec 2014 15:24:17 -0500 (EST)

 From: Sowmini Varadhan sowmini.varad...@oracle.com
 Date: Thu, 11 Dec 2014 15:21:00 -0500
 
 All this may be true, but it would also be true for Solaris, which
 manages to do line-speed (for the exact same setup), so there must be
 some other bottleneck going on? 
 
 They have DMA mapping interfaces which pre-allocate large batches
 of mapping at a time.

BTW, Solaris also does things which are remotely exploitable, so
these optimizations that get them line rate have a serious cost.

In their NIU driver, the recycle all buffers in an RX queue rather
than allocating new buffers.

This means that a maliciously running TCP application can read a lot
of data from a bulk sender, then simply stop reading completely.

This will put the entire RX queue of packets in limbo in the TCP
stack, which will never be recycled back to the NIU driver, thus
stalling all traffic completely which steers to that RX queue.

So that, is how Solaris gets line rate with this kind of hardware.

--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk
___
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel#174; Ethernet, visit 
http://communities.intel.com/community/wired