Re: [E1000-devel] ixgbe/linux/sparc perf issues
On (12/12/14 11:16), Sowmini Varadhan wrote: But getting back to linux, 3 Gbps is a far cry from 10 Gbps. I need to spend some time collecting data to convince myself that this is purely because of HV/IOMMU inefficiency. [e1000-devel has been Bcc'ed] I collected the stats, and I have evidence that the HV is not the bottleneck at this point: I am running linux as the Tx side (TCP client) with 10 threads (iperf -c addr -P 10) against an iperf server that can handle 9-9.5 Gbps. Baseline: with default settings (TSO enabled) :9-9.5 Gbps Disable TSO using ethtool- drops badly: 2-3 Gbps. (!) With iommu patch to break monolithic lock: 8.5 Gbps. (Note: with no TSO!) I'll share the iommu patch as an RFC in a separate email to sparclinux. But the Rx side may have other bottle-necks: even with the iommu patch, it is stuck at 3 Gbps, though I can get something a bit better merely by disabling GRO (as recommended by intel.com documentation), so 3 Gbps is probably not the ceiling here. I am willing to believe that you can't do much better than approx 8.5 Gbps without additional churn to the DMA design. But 3 Gbps Rx out of a max of 10 Gbps suggests that something other than the HV is holding linux/sparc/Rx back. And it might not even be the DMA overhead, since Tx can pull 8.5 Gbps even with a map/unmap for each packet. I'm still investigating the Rx side, but there are a lot of factors here, with RPS, qdisc, etc all coming into play. Suggestions for things to investigate are welcome. --Sowmini -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] ixgbe/linux/sparc perf issues
On (12/11/14 15:27), David Miller wrote: BTW, Solaris also does things which are remotely exploitable, so these optimizations that get them line rate have a serious cost. In their NIU driver, the recycle all buffers in an RX queue rather than allocating new buffers. This means that a maliciously running TCP application can read a lot of data from a bulk sender, then simply stop reading completely. Just to set the record straight, without digressing too much into Solaris internals.. Solaris follows the common practice used in such algorithms of having thresholds on the number of loaned (recycled) buffers for this sort of thing, to avoid DoS attacks from malicious applications. When that threshold is crossed, the driver falls back to the slower allocate new buffers path, so there is no stalling. But getting back to linux, 3 Gbps is a far cry from 10 Gbps. I need to spend some time collecting data to convince myself that this is purely because of HV/IOMMU inefficiency. Thanks, --Sowmini -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] ixgbe/linux/sparc perf issues
From: Sowmini Varadhan sowmini.varad...@oracle.com Date: Thu, 11 Dec 2014 14:45:42 -0500 1. lockstat and perf report that iommu-lock is the hot-lock (in a typical instance, I get about 21M contentions out of 27M acquisitions, 25 us avg wait time). Even if I fix this issue (see below), I see: The real overhead is unavoidable due to the way the hypervisor access to the IOMMU is implemented in sun4v. If we had direct access to the hardware, we could avoid all of the real overhead in %99 of all IOMMU mappings, as we do for pre-sun4v systems. On sun4u systems, we never flush the IOMMU until we wrap around the end of the IOMMU arena to the beginning in order to service an allocation. Such an optimization is impossible with the hypervisor call interface in sun4v. I've known about this issue for a decade and I do not think there is anything we can really do about this. -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] ixgbe/linux/sparc perf issues
On (12/11/14 15:09), David Miller wrote: The real overhead is unavoidable due to the way the hypervisor access to the IOMMU is implemented in sun4v. If we had direct access to the hardware, we could avoid all of the real overhead in %99 of all IOMMU mappings, as we do for pre-sun4v systems. On sun4u systems, we never flush the IOMMU until we wrap around the end of the IOMMU arena to the beginning in order to service an allocation. Such an optimization is impossible with the hypervisor call interface in sun4v. I've known about this issue for a decade and I do not think there is anything we can really do about this. All this may be true, but it would also be true for Solaris, which manages to do line-speed (for the exact same setup), so there must be some other bottleneck going on? And fwiw, removing the iommu lock contention out of lockstat did not make any difference to the throughput, which seems to indicate that the bottleneck is elsewhere. Hence the question about the ixgbe stats, and tuning that I may be missing. --Sowmini -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] ixgbe/linux/sparc perf issues
From: Sowmini Varadhan sowmini.varad...@oracle.com Date: Thu, 11 Dec 2014 15:21:00 -0500 All this may be true, but it would also be true for Solaris, which manages to do line-speed (for the exact same setup), so there must be some other bottleneck going on? They have DMA mapping interfaces which pre-allocate large batches of mapping at a time. And fwiw, removing the iommu lock contention out of lockstat did not make any difference to the throughput, which seems to indicate that the bottleneck is elsewhere. Like I said, it's in the hypervisor IOMMU interfaces implementing the hardware accesses to flush the hardware and adjust the DMA mappings. The lock just shows because the overhead bubbles up to the closest non-hypervisor code. -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired
Re: [E1000-devel] ixgbe/linux/sparc perf issues
From: David Miller da...@davemloft.net Date: Thu, 11 Dec 2014 15:24:17 -0500 (EST) From: Sowmini Varadhan sowmini.varad...@oracle.com Date: Thu, 11 Dec 2014 15:21:00 -0500 All this may be true, but it would also be true for Solaris, which manages to do line-speed (for the exact same setup), so there must be some other bottleneck going on? They have DMA mapping interfaces which pre-allocate large batches of mapping at a time. BTW, Solaris also does things which are remotely exploitable, so these optimizations that get them line rate have a serious cost. In their NIU driver, the recycle all buffers in an RX queue rather than allocating new buffers. This means that a maliciously running TCP application can read a lot of data from a bulk sender, then simply stop reading completely. This will put the entire RX queue of packets in limbo in the TCP stack, which will never be recycled back to the NIU driver, thus stalling all traffic completely which steers to that RX queue. So that, is how Solaris gets line rate with this kind of hardware. -- Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=164703151iu=/4140/ostg.clktrk ___ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel#174; Ethernet, visit http://communities.intel.com/community/wired