[I sent this to the e1000-devel folks, and they suggested netdev might
have opinions too. the below text has changed a little bit to reflect
feedback from Auke Kok]
attached is a small patch for e1000 that dynamically changes Interrupt
Throttle Rate for best performance - both latency and bandwidth.
it makes e1000 look really good on netpipe with a ~28 us latency and
890 Mbit/s bandwidth.
the basic idea is that high InterruptThrottleRate (~200k) is best for
small messages, whilst low ITR (~15k) is best for large messages.
leaving the ITR high for large messages burns outrageous amounts of cpu,
and any less than ~15k ITR is bad for bandwidth.
so this patch creates a new "performance dynamic" mode
InterruptThrottleRate=2 (2,2 for dual NICS)
which changes the ITR on the fly. the patch is based on the existing
"dynamic" mode (ITR=1) which seems to be optimised for low cpu usage
with little concern for performance.
hopefully the thresholds chosen for ITR changeovers will be ok on other
people's hardware too, but I really have no idea how universal it'll be.
we've been running it for a few months on our cluster and it appears stable.
10M 20M 100M as thresholds for changing between the 200k 90k 30 15k ITRs
were set pretty much by eye - by doing a bunch of netpipe runs and
trying to minimise cpu usage (ITR) for a target latency/bandwidth.
I've done an analysis of performance on this page:
http://www.cita.utoronto.ca/mediawiki/index.php/E1000_performance_patch
our hardware details are there too.
there's also a link to another analysis of how the patch affects routing
performance and cpu usage (surprisingly better).
despite the netpipe improvements, I haven't seen much in the way of real
world code differences (either +ve or -ve) from a regular 15k ITR. I've
seen an improvement in one code, and a slight degradation (~1%) in HPL
(top500.org benchmark). it should probably make the most difference for
codes that consistantly send small (< 1k) messages.
one possible improvement would be if the watchdog routine was called
more than once every 2 seconds - that would allow the ITR to adapt more
often.
ideally (I think) for traffic with mixed packet sizes the ITR would be
adapted 100's of times a second, but I'm not sure how practical that is.
cheers,
robin
diff -ru e1000-7.0.33/src/e1000_main.c
e1000-7.0.33-rjh-performance/src/e1000_main.c
--- e1000-7.0.33/src/e1000_main.c 2006-02-03 16:53:41.000000000 -0500
+++ e1000-7.0.33-rjh-performance/src/e1000_main.c 2006-04-01
21:44:21.000000000 -0500
@@ -1732,7 +1732,7 @@
if (hw->mac_type >= e1000_82540) {
E1000_WRITE_REG(hw, RADV, adapter->rx_abs_int_delay);
- if (adapter->itr > 1)
+ if (adapter->itr > 2)
E1000_WRITE_REG(hw, ITR,
1000000000 / (adapter->itr * 256));
}
@@ -2394,17 +2394,30 @@
}
}
- /* Dynamic mode for Interrupt Throttle Rate (ITR) */
- if (adapter->hw.mac_type >= e1000_82540 && adapter->itr == 1) {
- /* Symmetric Tx/Rx gets a reduced ITR=2000; Total
- * asymmetrical Tx or Rx gets ITR=8000; everyone
- * else is between 2000-8000. */
- uint32_t goc = (adapter->gotcl + adapter->gorcl) / 10000;
- uint32_t dif = (adapter->gotcl > adapter->gorcl ?
- adapter->gotcl - adapter->gorcl :
- adapter->gorcl - adapter->gotcl) / 10000;
- uint32_t itr = goc > 0 ? (dif * 6000 / goc + 2000) : 8000;
- E1000_WRITE_REG(&adapter->hw, ITR, 1000000000 / (itr * 256));
+ /* Dynamic modes for Interrupt Throttle Rate (ITR) */
+ if (adapter->hw.mac_type >= e1000_82540) {
+ if (adapter->itr == 1) {
+ /* Symmetric Tx/Rx gets a reduced ITR=2000; Total
+ * asymmetrical Tx or Rx gets ITR=8000; everyone
+ * else is between 2000-8000. */
+ uint32_t goc = (adapter->gotcl + adapter->gorcl) /
10000;
+ uint32_t dif = (adapter->gotcl > adapter->gorcl ?
+ adapter->gotcl - adapter->gorcl :
+ adapter->gorcl - adapter->gotcl) / 10000;
+ uint32_t itr = goc > 0 ? (dif * 6000 / goc + 2000) :
8000;
+ E1000_WRITE_REG(&adapter->hw, ITR, 1000000000 / (itr *
256));
+ }
+ else if (adapter->itr == 2) { /* low latency, high bandwidth,
moderate cpu usage */
+ /* range from high itr at low cl, to low itr at high cl
+ * < 10M => large itr
+ * 10M to 20M => 90k itr
+ * 20M to 100M => 30k itr
+ * > 100M => 15k itr */
+ uint32_t goc = max(adapter->gotcl, adapter->gorcl) /
1000000;
+ uint32_t itr = goc > 10 ? (goc > 20 ? (goc > 100 ?
15000: 30000): 90000): 200000;
+ /* DPRINTK(PROBE, INFO, "e1000 ITR %d - [tr]cl
min/ave/max %dm / %dm/ %dm\n", itr, min(adapter->gotcl, adapter->gorcl) /
1000000, (adapter->gotcl + adapter->gorcl) / 2000000, max(adapter->gotcl,
adapter->gorcl) / 1000000 ); */
+ E1000_WRITE_REG(&adapter->hw, ITR, 1000000000 / (itr *
256));
+ }
}
/* Cause software interrupt to ensure rx ring is cleaned */
diff -ru e1000-7.0.33/src/e1000_param.c
e1000-7.0.33-rjh-performance/src/e1000_param.c
--- e1000-7.0.33/src/e1000_param.c 2006-02-03 16:53:41.000000000 -0500
+++ e1000-7.0.33-rjh-performance/src/e1000_param.c 2006-03-29
21:42:00.000000000 -0500
@@ -538,6 +538,10 @@
DPRINTK(PROBE, INFO, "%s set to dynamic mode\n",
opt.name);
break;
+ case 2:
+ DPRINTK(PROBE, INFO, "%s set to performance
dynamic mode\n",
+ opt.name);
+ break;
default:
e1000_validate_option(&adapter->itr, &opt,
adapter);