Re: [PATCH] LRO ack aggregation
From: Andrew Gallatin [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 06:47:57 -0500 David Miller wrote: From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 14:09:18 +0800 David Miller [EMAIL PROTECTED] wrote: Fundamentally, I really don't like this change, it batches to the point where it begins to erode the natural ACK clocking of TCP, and I therefore am very likely to revert it before merging to Linus. Perhaps make it a tunable that defaults to off? That's one idea. I'd certainly prefer the option to have a tunable to having our customers see performance regressions when they switch to the kernel's LRO. Please qualify this because by itself it's an inaccurate statement. It would cause a performance regression in situations where the is nearly no packet loss, no packet reordering, and the receiver has strong enough cpu power. Because in fact, some customers might see performance regressions by using this ack aggregation code. In particular if they are talking to the real internet at all. If vendors are going to pick this up, there is the risk of them just applying this patch (which currently has no tunable to disable it), leaving their users stuck with it enabled. I will watch out for this and make sure to advise them strongly not to do this. So you can be sure this won't happen. :-) FWIW, we've seen TCP perform well in a WAN setting using our NICs and our LRO which does this ack aggregation. For example, the last 2 Supercomputing Bandwidth Challenges (making the most of a 10Gb/s WAN connection) were won by teams using our NICs, with drivers that did this sort of ack aggregation. And basically nearly no packet loss, which just supports my objections to this even more. ANd this doesn't even begin to consider the RX cpu limited cases, where again ACK stretching hurts a lot. The bandwidth challenge cases are not very realistic at all, and are about as far from the realities of real internet traffic as you can get. Show me something over real backbones, talking to hundres or thousands of clients scattered all over the world. That's what people will be using these high end NICs for front facing services, and that's where loss happens and stretch ACKs hurt performance. ACK stretching is bad bad bad for everything outside of some well controlled test network bubble. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
David Miller wrote: From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 14:09:18 +0800 David Miller [EMAIL PROTECTED] wrote: Fundamentally, I really don't like this change, it batches to the point where it begins to erode the natural ACK clocking of TCP, and I therefore am very likely to revert it before merging to Linus. Perhaps make it a tunable that defaults to off? That's one idea. I'd certainly prefer the option to have a tunable to having our customers see performance regressions when they switch to the kernel's LRO. But if it's there the risk it to have it end up being turned on always by distribution vendors, making our off-default pointless and the internet gets crapped up with misbehaving Linux TCP stacks anyways. If vendors are going to pick this up, there is the risk of them just applying this patch (which currently has no tunable to disable it), leaving their users stuck with it enabled. At least with a tunable, it would be easy for them to turn it off. And the comments surrounding it could make it clear why it should default to off. FWIW, we've seen TCP perform well in a WAN setting using our NICs and our LRO which does this ack aggregation. For example, the last 2 Supercomputing Bandwidth Challenges (making the most of a 10Gb/s WAN connection) were won by teams using our NICs, with drivers that did this sort of ack aggregation. Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
Hi. On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin ([EMAIL PROTECTED]) wrote: Hmm.. rather than a global tunable, what if it was a network driver managed tunable which toggled a flag in the lro_mgr features? Would that be better? What about ethtool control to set LRO_simple and LRO_ACK_aggregation? -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
David Miller wrote: From: Andrew Gallatin [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 06:47:57 -0500 David Miller wrote: From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 14:09:18 +0800 David Miller [EMAIL PROTECTED] wrote: Fundamentally, I really don't like this change, it batches to the point where it begins to erode the natural ACK clocking of TCP, and I therefore am very likely to revert it before merging to Linus. Perhaps make it a tunable that defaults to off? That's one idea. I'd certainly prefer the option to have a tunable to having our customers see performance regressions when they switch to the kernel's LRO. Please qualify this because by itself it's an inaccurate statement. It would cause a performance regression in situations where the is nearly no packet loss, no packet reordering, and the receiver has strong enough cpu power. Yes, a regression of nearly 1Gb/s in some cases as I mentioned when I submitted the patch. Show me something over real backbones, talking to hundres or thousands of clients scattered all over the world. That's what people will be using these high end NICs for front facing services, and that's where loss happens and stretch ACKs hurt performance. I can't. I think most 10GbE on endstations is used either in the sever room, or on dedicated links. My experience with 10GbE users is limited to my interactions with people using our NICs who contact our support. Of those, I can recall only a tiny handful who were using 10GbE on a normal internet facing connection (and the ones I dealt with were actually running a different OS). The vast majority were in a well controlled, lossless environment. It is quite ironic. The very fact that I cannot provide you with examples of internet facing people using LRO (w/ack aggr) in more normal applications tends to support my point that most 10GbE users seem to be in lossless environments. ACK stretching is bad bad bad for everything outside of some well controlled test network bubble. I just want those in the bubble to continue have the best performance possible in their situation. If it is a tunable the defaults to off, that is great. Hmm.. rather than a global tunable, what if it was a network driver managed tunable which toggled a flag in the lro_mgr features? Would that be better? Drew - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
On Tue, Nov 20, 2007 at 09:50:56PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote: On Tue, Nov 20, 2007 at 04:35:09PM +0300, Evgeniy Polyakov wrote: On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin ([EMAIL PROTECTED]) wrote: Hmm.. rather than a global tunable, what if it was a network driver managed tunable which toggled a flag in the lro_mgr features? Would that be better? What about ethtool control to set LRO_simple and LRO_ACK_aggregation? I have two concerns about this: 1) That same option can still be turned on by distros. FC and Debian turn on hardware checksumm offloading in e1000 and I have a card where this results in more than 10% performance _decrease_. I do not know why, but Im able to run script which disables it via ethtool. 2) This doesn't make sense because the code is actually in the core networking stack. It depends. Software lro can be controlled by simple procfs switch, but hardware one? I recall it was number of times pointed that hardware LRO is possible and likely being implemented in some asics. I'm particular unhappy about 2) because I don't want be in a situation down the track where every driver is going to add this option so that they're not left behind in the arms race. For software lro I agree, but this looks exactly like gso/tso case and additional tweak for software gso. Having it per-system is fine, and I believe no one should ever care that some distro will do bad/good things with it. Actually we do have so much tricky options in procfs already which can kill performance... -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
On Tue, Nov 20, 2007 at 04:35:09PM +0300, Evgeniy Polyakov wrote: On Tue, Nov 20, 2007 at 08:27:05AM -0500, Andrew Gallatin ([EMAIL PROTECTED]) wrote: Hmm.. rather than a global tunable, what if it was a network driver managed tunable which toggled a flag in the lro_mgr features? Would that be better? What about ethtool control to set LRO_simple and LRO_ACK_aggregation? I have two concerns about this: 1) That same option can still be turned on by distros. 2) This doesn't make sense because the code is actually in the core networking stack. I'm particular unhappy about 2) because I don't want be in a situation down the track where every driver is going to add this option so that they're not left behind in the arms race. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
On Tue, Nov 20, 2007 at 05:03:12PM +0300, Evgeniy Polyakov wrote: For software lro I agree, but this looks exactly like gso/tso case and additional tweak for software gso. Having it per-system is fine, and I believe no one should ever care that some distro will do bad/good things with it. Actually we do have so much tricky options in procfs already which can kill performance... Right, if you're doing it such that the same option automatically shows up for every driver that uses software LRO then my second concern goes away. Of course we still have the problem with the option in general that Dave raised. That is this may cause the proliferation of TCP receiver behaviour that may be undesirable. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
On Tue, Nov 20, 2007 at 10:08:31PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote: Of course we still have the problem with the option in general that Dave raised. That is this may cause the proliferation of TCP receiver behaviour that may be undesirable. Yes, it results in bursts of traffic because of delayed acks accumulated in sender's lro engine, but from the first point, if receiver is slow, then it will slowly send acks and they will be slowly accumulated, thus changing not only seq/ack numbers, but also timings, which is equal to increasing length of the pipe between users. TCP is able to balance on this edge. I'm sure it depends on workload, but heavy bulk transfers, where only lro with and without ack agregation can win, are quite usual on long pipes with high performance numbers. Until it is tested, I doubt it is possible to say it is 100% good or bad, so my proposal is to write the code, which is tunable from userspace, turn it off and allow people to test the change. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
David Miller wrote: From: Andrew Gallatin [EMAIL PROTECTED] Date: Tue, 23 Oct 2007 11:11:55 -0400 I've attached a patch which adds support to inet_lro for aggregating pure acks. I've applied this patch to net-2.6.25... but! This needs some serious thinking. What this patch ends up doing is creating big stretch-ACKs, and those can hurt performance. Stretch ACKs are particularly harmful when either the receiver is cpu weak (lacking enough cpu power to fill the pipe completely no matter what optimizations are applied) or when there is packet loss (less feedback information and ACK clocking). It also means that the sender will be more bursty, because it will now swallow ACKs covering huge portions of the send window, and then have large chunks of it's send queue it can send out all at once. Fundamentally, I really don't like this change, it batches to the point where it begins to erode the natural ACK clocking of TCP, and I therefore am very likely to revert it before merging to Linus. Sounds like one might as well go ahead and implement HP-UX/Solaris-like ACK sending avoidance at the receiver and not bother with LRO-ACK on the sender. In some experiements a while back I thought I saw that LRO on the receiver was causing him to send fewer ACKs already? IIRC that was with a Myricom card, perhaps I was fooled by it's own ACK LRO it was doing. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
From: Rick Jones [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 11:45:54 -0800 Sounds like one might as well go ahead and implement HP-UX/Solaris-like ACK sending avoidance at the receiver and not bother with LRO-ACK on the sender. In some experiements a while back I thought I saw that LRO on the receiver was causing him to send fewer ACKs already? IIRC that was with a Myricom card, perhaps I was fooled by it's own ACK LRO it was doing. Linux used to do aggressive ACK deferral, especially when the ucopy code paths triggered. I removed that code because I had several scenerios where it hurt more than it helped performance, and the IETF has explicitly stated in several documents the (proven) perils of such stretch ACKs. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
On Tue, 20 Nov 2007, Andrew Gallatin wrote: David Miller wrote: From: Andrew Gallatin [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 06:47:57 -0500 David Miller wrote: From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 14:09:18 +0800 David Miller [EMAIL PROTECTED] wrote: Fundamentally, I really don't like this change, it batches to the point where it begins to erode the natural ACK clocking of TCP, and I therefore am very likely to revert it before merging to Linus. I have mixed feelings about this topic. In general I agree with the importance of maintaining the natural ACK clocking of TCP for normal usage. But there may also be some special cases that could benefit significantly from such a new LRO pure ACK aggregation feature. The rest of my comments are in support of such a new feature, although I haven't completely made up my own mind yet about the tradeoffs involved in implementing such a new capability (good arguments are being made on both sides). Perhaps make it a tunable that defaults to off? That's one idea. I'd certainly prefer the option to have a tunable to having our customers see performance regressions when they switch to the kernel's LRO. Please qualify this because by itself it's an inaccurate statement. It would cause a performance regression in situations where the is nearly no packet loss, no packet reordering, and the receiver has strong enough cpu power. You are basically describing the HPC universe, which while not the multitudes of the general Internet, is a very real and valid special community of interest where maximum performance is critical. For example, we're starting to see dynamic provisioning of dedicated 10-GigE lambda paths to meet various HPC requirements, just for the purpose of insuring nearly no packet loss, no packet reordering. See for example Internet2's Dynamic Circuit Network (DCN). In the general Internet case, many smaller flows tend to be aggregated together up to perhaps a 10-GigE interface, while in the HPC universe, there tend to be fewer, but much higher individual bandwidth flows. But both are totally valid usage scenarios. So a tunable that defaults to off for the general case makes sense to me. Yes, a regression of nearly 1Gb/s in some cases as I mentioned when I submitted the patch. Which is a significant performance penalty. But the CPU savings may be an even more important benefit. Show me something over real backbones, talking to hundres or thousands of clients scattered all over the world. That's what people will be using these high end NICs for front facing services, and that's where loss happens and stretch ACKs hurt performance. The HPC universe uses real backbones, just not the general Internet backbones. Their backbones are engineered to have the characteristics required for enabling very high performance applications. And if performance would take a hit in the general Internet 10-GigE server case, and that's clearly documented and understood, I don't see what incentive the distros would have to enable the tunable for their normal users, since why would they want to cause poorer performance relative to other distros that stuck with the recommended default. The special HPC users could easily enable the option if it was desired and proven beneficial in their environment. I can't. I think most 10GbE on endstations is used either in the sever room, or on dedicated links. My experience with 10GbE users is limited to my interactions with people using our NICs who contact our support. Of those, I can recall only a tiny handful who were using 10GbE on a normal internet facing connection (and the ones I dealt with were actually running a different OS). The vast majority were in a well controlled, lossless environment. It is quite ironic. The very fact that I cannot provide you with examples of internet facing people using LRO (w/ack aggr) in more normal applications tends to support my point that most 10GbE users seem to be in lossless environments. Most use of 10-GigE that I'm familiar with is related to the HPC universe, but then that's the environment I work in. I'm sure that over time the use of 10-GigE in general Internet facing servers will predominate, since that's where the great mass of users is. But I would argue that that doesn't make it the sole usage arena that matters. ACK stretching is bad bad bad for everything outside of some well controlled test network bubble. It's not just for network bubbles. That's where the technology tends to first be shaken out, but the real goal is use in real-world, production HPC environments. I just want those in the bubble to continue have the best performance possible in their situation. If it is a tunable the defaults to off, that is great. I totally agree, and think that the tunable (defaulting to off), allows both the general
Re: [PATCH] LRO ack aggregation
From: Andrew Gallatin [EMAIL PROTECTED] Date: Tue, 23 Oct 2007 11:11:55 -0400 I've attached a patch which adds support to inet_lro for aggregating pure acks. I've applied this patch to net-2.6.25... but! This needs some serious thinking. What this patch ends up doing is creating big stretch-ACKs, and those can hurt performance. Stretch ACKs are particularly harmful when either the receiver is cpu weak (lacking enough cpu power to fill the pipe completely no matter what optimizations are applied) or when there is packet loss (less feedback information and ACK clocking). It also means that the sender will be more bursty, because it will now swallow ACKs covering huge portions of the send window, and then have large chunks of it's send queue it can send out all at once. Fundamentally, I really don't like this change, it batches to the point where it begins to erode the natural ACK clocking of TCP, and I therefore am very likely to revert it before merging to Linus. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
From: Herbert Xu [EMAIL PROTECTED] Date: Tue, 20 Nov 2007 14:09:18 +0800 David Miller [EMAIL PROTECTED] wrote: Fundamentally, I really don't like this change, it batches to the point where it begins to erode the natural ACK clocking of TCP, and I therefore am very likely to revert it before merging to Linus. Perhaps make it a tunable that defaults to off? That's one idea. But if it's there the risk it to have it end up being turned on always by distribution vendors, making our off-default pointless and the internet gets crapped up with misbehaving Linux TCP stacks anyways. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] LRO ack aggregation
David Miller [EMAIL PROTECTED] wrote: Fundamentally, I really don't like this change, it batches to the point where it begins to erode the natural ACK clocking of TCP, and I therefore am very likely to revert it before merging to Linus. Perhaps make it a tunable that defaults to off? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] LRO ack aggregation
Hi, We recently did some performance comparisons between the new inet_lro LRO support in the kernel, and our Myri10GE in-driver LRO. For receive, we found they were nearly identical. However, for transmit, we found that Myri10GE's LRO shows much lower CPU utilization. We traced the CPU utilization difference to our driver LRO aggregating TCP acks, and the inet_lro module not doing so. I've attached a patch which adds support to inet_lro for aggregating pure acks. Aggregating pure acks (segments with TCP_PAYLOAD_LENGTH == 0) entails freeing the skb (or putting the page in the frags case). The patch also handles trimming (typical for 54-byte pure ack frames which have been padded to the ethernet minimum 60 byte frame size). In the frags case, I tried to keep things simple by only doing the trim when the entire frame fits in the first frag. To be safe, I ensure that the padding is all 0 (or, more exactly, was some pattern whose checksum is -0) so that it doesn't impact hardware checksums. This patch also fixes a small bug in the skb LRO path dealing with vlans that I found when doing my own testing. Specifically, in the desc-active case, the existing code always fails the lro_tcp_ip_check() for NICs without LRO_F_EXTRACT_VLAN_ID, because it fails to subtract the vlan_hdr_len from the skb-len. Jan-Bernd Themann ([EMAIL PROTECTED]) has tested the patch using the eHEA driver (skb codepath), and I have tested it using Myri10GE (both frags and skb codepath). Using a pair of identical low-end 2.0GHz Athlon 64 x2 3800+ with Myri10GE 10GbE NICs, I ran 10 iterations of netperf TCP_SENDFILE tests, taking the median run for comparison purposes. The receiver was running Myri10GE + patched inet_lro: TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to rome-my (192.168.1.16) port 0 AF_INET : cpu bind Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB Myri10GE driver-specific LRO: 87380 65536 6553660.02 9442.65 16.2469.310.282 1.203 Myri10GE + unpatched inet_lro: 87380 65536 6553660.02 9442.88 20.1069.110.349 1.199 Myri10GE + patched inet_lro: 87380 65536 6553660.02 9443.30 16.9568.970.294 1.197 The important bit here is the sender's CPU utilization, and service demand (cost per byte). As you can see, without aggregating ack's, the overhead on the sender is roughly 20% higher, even when sending to a receiver which uses LRO. The differences are even more dramatic when sending to a receiver which does not use LRO (and hence sends more frequent acks). Below is the same benchmark, run between a pair of 4-way 3.0GHz Xeon 5160 machines (Dell 2950) with Myri10GE NICs. The receiver is running Solaris 10U4, which does not do LRO, and is acking at approximately 8:1 (or ~100K acks/sec): Myri10GE driver-specific LRO: 196712 65536 6553660.01 9280.09 7.14 45.370.252 1.602 Myri10GE + unpatched inet_lro: 196712 65536 6553660.01 8530.80 10.5144.600.404 1.713 Myri10GE + patched inet_lro: 196712 65536 6553660.00 9249.65 7.21 45.900.255 1.626 Signed off by: Andrew Gallatin [EMAIL PROTECTED] Andrew Gallatin Myricom Inc. diff --git a/net/ipv4/inet_lro.c b/net/ipv4/inet_lro.c index ac3b1d3..eba145b 100644 --- a/net/ipv4/inet_lro.c +++ b/net/ipv4/inet_lro.c @@ -58,9 +58,6 @@ static int lro_tcp_ip_check(struct iphdr if (ntohs(iph-tot_len) != len) return -1; - if (TCP_PAYLOAD_LENGTH(iph, tcph) == 0) - return -1; - if (iph-ihl != IPH_LEN_WO_OPTIONS) return -1; @@ -223,6 +220,11 @@ static void lro_add_packet(struct net_lr lro_add_common(lro_desc, iph, tcph, tcp_data_len); + if (tcp_data_len == 0) { + dev_kfree_skb_any(skb); + return; + } + skb_pull(skb, (skb-len - tcp_data_len)); parent-truesize += skb-truesize; @@ -244,6 +246,11 @@ static void lro_add_frags(struct net_lro lro_add_common(lro_desc, iph, tcph, tcp_data_len); + if (tcp_data_len == 0) { + put_page(skb_frags[0].page); + return; + } + skb-truesize += truesize; skb_frags[0].page_offset += hlen; @@ -338,6 +345,8 @@ static int __lro_proc_skb(struct net_lro struct tcphdr *tcph; u64 flags; int vlan_hdr_len = 0; + int pkt_len; + int trim; if (!lro_mgr-get_skb_header || lro_mgr-get_skb_header(skb, (void *)iph, (void *)tcph, @@ -355,6 +364,17 @@ static int __lro_proc_skb(struct net_lro !test_bit(LRO_F_EXTRACT_VLAN_ID, lro_mgr-features)) vlan_hdr_len = VLAN_HLEN; + /* strip padding