On 02/27/2013 09:55 AM, Eliezer Tamir wrote:
> This patchset adds the ability for the socket layer code to poll directly
> on an Ethernet device's RX queue. This eliminates the cost of the interrupt
> and context switch and with proper tuning allows us to get very close
> to the HW latency.
>
> This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last year
> http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf
>
> Patch 1 adds ndo_ll_poll and the IP code to use it.
> Patch 2 is an example of how TCP can use ndo_ll_poll.
> Patch 3 shows how this method would be implemented for the ixgbe driver.
> Patch 4 adds statistics to the ixgbe driver for ndo_ll_poll events.
> (Optional) Patch 5 is a handy kprobes module to measure detailed latency
> numbers.
>
> this patchset is also available in the following git branch
> git://github.com/jbrandeb/lls.git rfc
>
> Performance numbers:
> Kernel   Config     C3/6  rx-usecs  TCP  UDP
> 3.8rc6   typical    off   adaptive  37k  40k
> 3.8rc6   typical    off   0*        50k  56k
> 3.8rc6   optimized  off   0*        61k  67k
> 3.8rc6   optimized  on    adaptive  26k  29k
> patched  typical    off   adaptive  70k  78k
> patched  optimized  off   adaptive  79k  88k
> patched  optimized  off   100       84k  92k
> patched  optimized  on    adaptive  83k  91k
> *rx-usecs=0 is usually not useful in a production environment.

I would think that latency-sensitive folks would be using rx-usecs=0 in 
production - at least if the NIC in use didn't have low enough latency 
with its default interrupt coalescing/avoidance heuristics.

If I take the first "pure" A/B comparison it seems that the change as 
benchmarked takes latency for TCP from ~27 usec (37k) to ~14 usec (70k). 
  At what request/response size does the benefit taper-off?  13 usec 
seems to be about 16250 bytes at 10 GbE.

When I last looked at netperf TCP_RR performance where something similar 
could happen I think it was IPoIB where it was possible to set things up 
such that polling happened rather than wakeups (perhaps it was with a 
shim library that converted netperf's socket calls to "native" IB).  My 
recollection is that it "did a number" on the netperf service demands 
thanks to the spinning.  It would be a good thing to include those 
figures in any subsequent rounds of benchmarking.

Am I correct in assuming this is a mechanism which would not be used in 
a high aggregate PPS situation?

happy benchmarking,

rick jones

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to