Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Wed, Feb 20, 2013 at 11:59 AM, Lawrence Stewart lstew...@freebsd.org wrote: Hi Sephe, On 02/20/13 13:37, Sepherosa Ziehau wrote: On Wed, Feb 20, 2013 at 9:46 AM, Lawrence Stewart lstew...@room52.net wrote: *crickets chirping* Time to move this discussion forward... If any robust counter-arguments exist, now is the time for us to hear them. I haven't read anything thus far which convinces me that we should not provide knobs to tune our stack's dynamics. In the absence of any compelling counter-arguments, I would like to propose the following: - We rename the net.inet.tcp.experimental sysctl node introduced in r242266 for IW10 support to net.inet.tcp.nonstandard, and re-parent the initcwnd10 sysctl under this node. I should also add that I think initcwnd10 should be changed to initcwnd and take the number of segments as a value. Yeah, I would suggest the same. - We introduce a new net.inet.tcp.nonstandard.allowed sysctl variable and default it to 0. Only when it is changed to 1 will we allow starkly non standards compliant behaviour to be enabled in the stack. As a more complex but expressive alternative, we can make the sysctl take a bit mask or CSV string which specifies which non-standard options the sys admin permits (I'd prefer this as we can easily test non-standard options like IW10 in head without blanket enabling all non standard behaviour). To be clear, my proposal is that specifying an allowed option in net.inet.tcp.nonstandard.allowed would not enable it as the default on all connections, but would allow the per-application mechanism we define to set the option. Setting net.inet.tcp.nonstandard.option_x to 1 would enable the option as default for all connections. - We introduce a new net.inet.tcp.nonstandard.noidlereset sysctl variable, and use it to enable/disable window-reset-after-idle behaviour as proposed by John. - We don't introduce a TF_IGNOREIDLE sockopt, and instead introduce a more generic sockopt and/or mechanism for per-application tuning of all options which affect stack dynamics (both standard and non-standard options). I'm open to suggestions on what this could/should look like. Lawrence, A route metric? BTW, as for IW10, it could also become a route metric (as proposed by the draft author's presentation http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf) Are you suggesting having the ability to set knobs as route metrics in addition to sysctl and a per-app mechanism? If so then I am very much in favour of this. Assuming an option has been allowed in net.inet.tcp.nonstandard.allowed, it should be able to be set by an application or on a route, perhaps with a precedence hierarchy of app request trumps route metric trumps system default setting? I suggest using route metrics in addition to the global sysctls; route metrics take precedence over global sysctls. I don't object the per-socket settings though. However, IMHO, these options (IW10 and ignoring idle restart, and probably others) are administrative, so applications probably should not mess with them. Best Regards, sephe -- Tomorrow Will Never Die ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 02/21/13 20:20, Sepherosa Ziehau wrote: On Wed, Feb 20, 2013 at 11:59 AM, Lawrence Stewart lstew...@freebsd.org wrote: Hi Sephe, On 02/20/13 13:37, Sepherosa Ziehau wrote: On Wed, Feb 20, 2013 at 9:46 AM, Lawrence Stewart lstew...@room52.net wrote: *crickets chirping* Time to move this discussion forward... If any robust counter-arguments exist, now is the time for us to hear them. I haven't read anything thus far which convinces me that we should not provide knobs to tune our stack's dynamics. In the absence of any compelling counter-arguments, I would like to propose the following: - We rename the net.inet.tcp.experimental sysctl node introduced in r242266 for IW10 support to net.inet.tcp.nonstandard, and re-parent the initcwnd10 sysctl under this node. I should also add that I think initcwnd10 should be changed to initcwnd and take the number of segments as a value. Yeah, I would suggest the same. - We introduce a new net.inet.tcp.nonstandard.allowed sysctl variable and default it to 0. Only when it is changed to 1 will we allow starkly non standards compliant behaviour to be enabled in the stack. As a more complex but expressive alternative, we can make the sysctl take a bit mask or CSV string which specifies which non-standard options the sys admin permits (I'd prefer this as we can easily test non-standard options like IW10 in head without blanket enabling all non standard behaviour). To be clear, my proposal is that specifying an allowed option in net.inet.tcp.nonstandard.allowed would not enable it as the default on all connections, but would allow the per-application mechanism we define to set the option. Setting net.inet.tcp.nonstandard.option_x to 1 would enable the option as default for all connections. - We introduce a new net.inet.tcp.nonstandard.noidlereset sysctl variable, and use it to enable/disable window-reset-after-idle behaviour as proposed by John. - We don't introduce a TF_IGNOREIDLE sockopt, and instead introduce a more generic sockopt and/or mechanism for per-application tuning of all options which affect stack dynamics (both standard and non-standard options). I'm open to suggestions on what this could/should look like. Lawrence, A route metric? BTW, as for IW10, it could also become a route metric (as proposed by the draft author's presentation http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf) Are you suggesting having the ability to set knobs as route metrics in addition to sysctl and a per-app mechanism? If so then I am very much in favour of this. Assuming an option has been allowed in net.inet.tcp.nonstandard.allowed, it should be able to be set by an application or on a route, perhaps with a precedence hierarchy of app request trumps route metric trumps system default setting? I suggest using route metrics in addition to the global sysctls; Agreed. route metrics take precedence over global sysctls. Agreed. I don't object the per-socket settings though. However, IMHO, these options (IW10 and ignoring idle restart, and probably others) are administrative, so applications probably should not mess with them. Messing with individual options like IW10 on a per-socket basis is definitely in the generally should not basket, but I would not want to stop an application from doing so subject to the option being specified by the administrator in the net.inet.tcp.nonstandard.allowed option list. What I expect applications would want to do more frequently is hint the socket with a higher level goal e.g. I want maximum throughput, I want low latency, etc. This can come later though. I think we have enough agreement on the basic infrastructure to move forward at this point with some patches. I would initially like to get the basic sysctl infrastructure to support all this sorted, then look at supporting these options as route metrics, and finally look at the higher level API. Anyone else with further input, please speak up! Cheers, Lawrence ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Tuesday, February 19, 2013 9:37:54 pm Sepherosa Ziehau wrote: John, I came across this draft several days ago, you may be interested: http://tools.ietf.org/html/draft-ietf-tcpm-newcwv-00 Yes, that is extremely relevant. My application does use its own rate-limiting. And now that I've read this in full, this does seem to very much be what I want and is a better solution than ignoring idle handling entirely. Ironic that this was posted a few weeks after my patch. :) Clearly this is not an isolated workflow. -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Wed, Feb 20, 2013 at 9:46 AM, Lawrence Stewart lstew...@room52.net wrote: *crickets chirping* Time to move this discussion forward... If any robust counter-arguments exist, now is the time for us to hear them. I haven't read anything thus far which convinces me that we should not provide knobs to tune our stack's dynamics. In the absence of any compelling counter-arguments, I would like to propose the following: - We rename the net.inet.tcp.experimental sysctl node introduced in r242266 for IW10 support to net.inet.tcp.nonstandard, and re-parent the initcwnd10 sysctl under this node. - We introduce a new net.inet.tcp.nonstandard.allowed sysctl variable and default it to 0. Only when it is changed to 1 will we allow starkly non standards compliant behaviour to be enabled in the stack. As a more complex but expressive alternative, we can make the sysctl take a bit mask or CSV string which specifies which non-standard options the sys admin permits (I'd prefer this as we can easily test non-standard options like IW10 in head without blanket enabling all non standard behaviour). - We introduce a new net.inet.tcp.nonstandard.noidlereset sysctl variable, and use it to enable/disable window-reset-after-idle behaviour as proposed by John. - We don't introduce a TF_IGNOREIDLE sockopt, and instead introduce a more generic sockopt and/or mechanism for per-application tuning of all options which affect stack dynamics (both standard and non-standard options). I'm open to suggestions on what this could/should look like. Lawrence, A route metric? BTW, as for IW10, it could also become a route metric (as proposed by the draft author's presentation http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf) John, I came across this draft several days ago, you may be interested: http://tools.ietf.org/html/draft-ietf-tcpm-newcwv-00 This one is a bit old, but it is still interesting to read (cited by the above draft): http://tools.ietf.org/html/draft-hughes-restart-00 Best Regards, sephe -- Tomorrow Will Never Die ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
FYI I've read the whole thread as of this reply and plan to follow up to a few of the other posts separately, but first for my initial thoughts... On 01/23/13 07:11, John Baldwin wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. Got it. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. I can't think of, nor have I read any convincing argument why we shouldn't support your use case out of the box. You're not the only user of FreeBSD over dedicated lines who knows what you're doing. We should provide some way to support this use case. We're therefore left with the question of how to implement this. As noted in the Some questions about the new TCP congestion control code thread [1], it was always my intention to axe the ss_flightsize variables and replace them with a better mechanism. Andre swung the axe before I did and 10.x is looming so it's a good time to discuss all of this. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. rwatson@ mentioned an idea in private discussion which I've also thought about over the years. The real goal here should be to subsume your use case (and others) into a much richer framework for hinting desired behaviour/tradeoff preferences (some aspects of which relate to parts of my PhD work, which will hopefully be coming to a kernel near you in 2013 ;). My main concern with your patch is that I'm a bit uneasy about enshrining a socket option in a public API and documentation that is so specific. I suspect apps probably want to set higher level goals like low latency *at any cost* and have the stack opaquely interpret that as this guy is willing to blow his foot off, so let's disable idle window reset, tweak X, disable Y and hand the man his loaded shotgun. TCP_IGNOREIDLE as currently proposed misses this bigger picture, though doesn't preclude it either. I would also echo Kevin/Grenville's thoughts about keying the socket option's activation off a tunable (sysctl or kernel option is up for discussion, though I'd be leaning towards sysctl) that is disabled by default i.e. only skip after idle window reset if the app sets the option *and* the sysadmin has pulled the I like me some bursty network lever. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. The idea is useful. I'd just like to discuss the implementation specifics a little further before recommending whether the patch should go in as is to provide a stop gap, or we rework the patch to be a little less specific in readiness for the future work I have in mind. Cheers, Lawrence [1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 13.02.2013 09:25, Lawrence Stewart wrote: FYI I've read the whole thread as of this reply and plan to follow up to a few of the other posts separately, but first for my initial thoughts... On 01/23/13 07:11, John Baldwin wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. Got it. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. I can't think of, nor have I read any convincing argument why we shouldn't support your use case out of the box. You're not the only user of FreeBSD over dedicated lines who knows what you're doing. We should provide some way to support this use case. We're therefore left with the question of how to implement this. As noted in the Some questions about the new TCP congestion control code thread [1], it was always my intention to axe the ss_flightsize variables and replace them with a better mechanism. Andre swung the axe before I did and 10.x is looming so it's a good time to discuss all of this. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. rwatson@ mentioned an idea in private discussion which I've also thought about over the years. The real goal here should be to subsume your use case (and others) into a much richer framework for hinting desired behaviour/tradeoff preferences (some aspects of which relate to parts of my PhD work, which will hopefully be coming to a kernel near you in 2013 ;). My main concern with your patch is that I'm a bit uneasy about enshrining a socket option in a public API and documentation that is so specific. I suspect apps probably want to set higher level goals like low latency *at any cost* and have the stack opaquely interpret that as this guy is willing to blow his foot off, so let's disable idle window reset, tweak X, disable Y and hand the man his loaded shotgun. TCP_IGNOREIDLE as currently proposed misses this bigger picture, though doesn't preclude it either. I would also echo Kevin/Grenville's thoughts about keying the socket option's activation off a tunable (sysctl or kernel option is up for discussion, though I'd be leaning towards sysctl) that is disabled by default i.e. only skip after idle window reset if the app sets the option *and* the sysadmin has pulled the I like me some bursty network lever. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. The idea is useful. I'd just like to discuss the implementation specifics a little further before recommending whether the patch should go in as is to provide a stop gap, or we rework the patch to be a little less specific in readiness for the future work I have in mind. Again I'd like to point out that this sort of modification should be implemented as a congestion control module. All the hook points are already there and can readily be used instead of adding more special cases to the generic part of TCP. The CC algorithm can be selected per socket. For such a special CC module it'd get a nice fat warning that it is not suitable for Internet use. Additionally I speculate that for the use-case of John he may also be willing to forgo congestion avoidance and always operate in (ill-named) slow start mode. With a special CC module this can easily be tweaked. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 02/08/13 07:04, George Neville-Neil wrote: On Feb 6, 2013, at 12:28 , Alfred Perlstein bri...@mu.org wrote: On 2/6/13 4:46 AM, John Baldwin wrote: On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote: John: A burst at line rate will *often* cause drops. This is because router queues are at a finite size. Also such a burst (especially on a long delay bandwidth network) cause your RTT to increase even if there is no drop which is going to hurt you as well. A SHOULD in an RFC says you really really really really need to do it unless there is some thing that makes you willing to override it. It is slight wiggle room. In this I agree with Andre, we should not be *not* doing it. Otherwise folks will be turning this on and it is plain wrong. It may be fine for your network but I would not want to see it in FreeBSD. In my testing here at home I have put back into our stack max-burst. This uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at no more than 4 packets larger than your flight. All of my testing high-bw-delay or lan has shown this to improve TCP performance. This is because it helps you avoid bursting out so many packets that you overflow a queue. In your long-delay bw link if you do burst out too many (and you never know how many that is since you can not predict how full all those MPLS queues are or how big they are) you will really hurt yourself even worse. Note that generally in Cisco routers the default queue size is somewhere between 100-300 packets depending on the router. Due to the way our application works this never happens, but I am fine with just keeping this patch private. If there are other shops that need this they can always dig the patch up from the archives. This is yet another time when I'm sad about how things happen in FreeBSD. A developer come forward with a non-default option that's very useful for some specific workloads, specifically one that contributes much time and $$$ to the project and the community rejects the patches even though it's been successful in other OSes. It makes zero sense. John, can you repost the patch? Maybe there is a way to refactor this somehow so it's like accept filters where we can plug in a hook for TCP? I am very disappointed, but not surprised. I take away the complete opposite feeling. This is how we work through these issues. It's clear from the discussion that this need not be a default in the system, and is a special case. We had a reasoned discussion of what would be best to do and at least two experts in TCP weighed in on the effect this change might have. Not everything proposed by a developer need go into the tree, in particular since these discussions are archived we can always revisit this later. This is exactly how collaborative development should look, whether or not the patch is integrated now, next week, next year, or ever. +1 Whilst I would argue that some red herrings have been put forward in this thread, its progression is far from disappointing IMO. This is a sensitive area that requires careful scrutiny, independent of what our peers working on other OSes have decided is best for them. Cheers, Lawrence ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 02/10/13 16:05, Kevin Oberman wrote: On Sat, Feb 9, 2013 at 6:41 AM, Alfred Perlstein bri...@mu.org wrote: On 2/7/13 12:04 PM, George Neville-Neil wrote: On Feb 6, 2013, at 12:28 , Alfred Perlstein bri...@mu.org wrote: On 2/6/13 4:46 AM, John Baldwin wrote: On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote: John: A burst at line rate will *often* cause drops. This is because router queues are at a finite size. Also such a burst (especially on a long delay bandwidth network) cause your RTT to increase even if there is no drop which is going to hurt you as well. A SHOULD in an RFC says you really really really really need to do it unless there is some thing that makes you willing to override it. It is slight wiggle room. In this I agree with Andre, we should not be *not* doing it. Otherwise folks will be turning this on and it is plain wrong. It may be fine for your network but I would not want to see it in FreeBSD. In my testing here at home I have put back into our stack max-burst. This uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at no more than 4 packets larger than your flight. All of my testing high-bw-delay or lan has shown this to improve TCP performance. This is because it helps you avoid bursting out so many packets that you overflow a queue. In your long-delay bw link if you do burst out too many (and you never know how many that is since you can not predict how full all those MPLS queues are or how big they are) you will really hurt yourself even worse. Note that generally in Cisco routers the default queue size is somewhere between 100-300 packets depending on the router. Due to the way our application works this never happens, but I am fine with just keeping this patch private. If there are other shops that need this they can always dig the patch up from the archives. This is yet another time when I'm sad about how things happen in FreeBSD. A developer come forward with a non-default option that's very useful for some specific workloads, specifically one that contributes much time and $$$ to the project and the community rejects the patches even though it's been successful in other OSes. It makes zero sense. John, can you repost the patch? Maybe there is a way to refactor this somehow so it's like accept filters where we can plug in a hook for TCP? I am very disappointed, but not surprised. I take away the complete opposite feeling. This is how we work through these issues. It's clear from the discussion that this need not be a default in the system, and is a special case. We had a reasoned discussion of what would be best to do and at least two experts in TCP weighed in on the effect this change might have. Not everything proposed by a developer need go into the tree, in particular since these discussions are archived we can always revisit this later. This is exactly how collaborative development should look, whether or not the patch is integrated now, next week, next year, or ever. I agree that discussion is great, we have all learned quite a bit from it, about TCP and the dangers of adjusting buffering without considerable thought. I would not be involved in FreeBSD had this type of discussion and information not be discussed on the lists so readily. However, the end result must be far different than what has occurred so far. If the code was deemed unacceptable for general inclusion, then we must find a way to provide a light framework to accomplish the needs of the community member. Take for instance someone who is starting a company that needs this facility. Which OS will they choose? One who has integrated a useful feature? Or one who has rejected it and left that code in the mailing list archives? As much as expert opinion is valuable, it must include understanding and need of handling special cases and the ability to facilitate those special cases for our users and developers. This is a subject rather near to my heart, having fought battles with congestion back in the dark days of Windows when it essentially defaulted to TCPIGNOREIDLE. It was a huge pain, but it was the only way Windows did TCP in the early days. It simply did not implement slow-start. This was really evil, but in the days when lots of links were 56K and T-1 was mostly used for network core links, the Internet, small as it was back then, did not melt, though it glowed a frightening shade of red fairly often. Today too many systems running like this would melt thins very quickly. OTOH, I can certainly see cases, like John's, where it would be very beneficial. And, yes, Linux has it. (I don't see this a relevant in any way except as proof tat not enough people have turned it on to cause serious problems... yet!) It seems a shame to make everyone who really has a need develop their own patches or dig though old mail to find John's. What I would
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 02/13/13 21:27, Andre Oppermann wrote: On 13.02.2013 09:25, Lawrence Stewart wrote: FYI I've read the whole thread as of this reply and plan to follow up to a few of the other posts separately, but first for my initial thoughts... On 01/23/13 07:11, John Baldwin wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. Got it. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. I can't think of, nor have I read any convincing argument why we shouldn't support your use case out of the box. You're not the only user of FreeBSD over dedicated lines who knows what you're doing. We should provide some way to support this use case. We're therefore left with the question of how to implement this. As noted in the Some questions about the new TCP congestion control code thread [1], it was always my intention to axe the ss_flightsize variables and replace them with a better mechanism. Andre swung the axe before I did and 10.x is looming so it's a good time to discuss all of this. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. rwatson@ mentioned an idea in private discussion which I've also thought about over the years. The real goal here should be to subsume your use case (and others) into a much richer framework for hinting desired behaviour/tradeoff preferences (some aspects of which relate to parts of my PhD work, which will hopefully be coming to a kernel near you in 2013 ;). My main concern with your patch is that I'm a bit uneasy about enshrining a socket option in a public API and documentation that is so specific. I suspect apps probably want to set higher level goals like low latency *at any cost* and have the stack opaquely interpret that as this guy is willing to blow his foot off, so let's disable idle window reset, tweak X, disable Y and hand the man his loaded shotgun. TCP_IGNOREIDLE as currently proposed misses this bigger picture, though doesn't preclude it either. I would also echo Kevin/Grenville's thoughts about keying the socket option's activation off a tunable (sysctl or kernel option is up for discussion, though I'd be leaning towards sysctl) that is disabled by default i.e. only skip after idle window reset if the app sets the option *and* the sysadmin has pulled the I like me some bursty network lever. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. The idea is useful. I'd just like to discuss the implementation specifics a little further before recommending whether the patch should go in as is to provide a stop gap, or we rework the patch to be a little less specific in readiness for the future work I have in mind. Again I'd like to point out that this sort of modification should be implemented as a congestion control module. All the hook points are already there and can readily be used instead of adding more special cases to the generic part of TCP. The CC algorithm can be selected per socket. For such a special CC module it'd get a nice fat warning that it is not suitable for Internet use. As a local hack, sure, a CC module would do the job assuming you were happy to use a single algorithm as the base. John's patch transcends the algorithm in use on a particular connection, so it has wider applicability than a CC module. I would also strongly oppose the inclusion of such a module in FreeBSD proper - it's the wrong way to implement the functionality. The patch as posted is technically appropriate, though I'm interested in discussing whether the public API should be tweaked to capture higher level goals instead e.g. low delay at all costs or maximum throughput. We could initially map low delay at all costs to a TCP stack meaning of disable idle window reset and expand the meaning later (e.g. relaxing the silly window checks as briefly discussed in the other thread). Additionally I speculate that for the use-case of John he may also be willing to forgo congestion avoidance and always operate in (ill-named) slow start mode. With a
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 13.02.2013 15:26, Lawrence Stewart wrote: On 02/13/13 21:27, Andre Oppermann wrote: On 13.02.2013 09:25, Lawrence Stewart wrote: The idea is useful. I'd just like to discuss the implementation specifics a little further before recommending whether the patch should go in as is to provide a stop gap, or we rework the patch to be a little less specific in readiness for the future work I have in mind. Again I'd like to point out that this sort of modification should be implemented as a congestion control module. All the hook points are already there and can readily be used instead of adding more special cases to the generic part of TCP. The CC algorithm can be selected per socket. For such a special CC module it'd get a nice fat warning that it is not suitable for Internet use. As a local hack, sure, a CC module would do the job assuming you were happy to use a single algorithm as the base. John's patch transcends the algorithm in use on a particular connection, so it has wider applicability than a CC module. The algorithm is becoming somewhat meaningless when your goal is to have an open pipe and push data as fast as possible without regard to other traffic. NewReno, Cubic and what have you is becoming meaningless. I would also strongly oppose the inclusion of such a module in FreeBSD proper - it's the wrong way to implement the functionality. The patch as posted is technically appropriate, though I'm interested in discussing whether the public API should be tweaked to capture higher level goals instead e.g. low delay at all costs or maximum throughput. I strongly disagree. The patch is a hack. From the description John gave on his use-case I read that he would actually take more than just ignoring idle-cwnd-reset. And actually if I were in his situation I would use a very aggressive congestion control algorithm doing away with more than idle-cwnd-reset. We could initially map low delay at all costs to a TCP stack meaning of disable idle window reset and expand the meaning later (e.g. relaxing the silly window checks as briefly discussed in the other thread). Ugh, if you go that far fork it, obtain a fresh protocol number and don't call it TCP anymore. Additionally I speculate that for the use-case of John he may also be willing to forgo congestion avoidance and always operate in (ill-named) slow start mode. With a special CC module this can easily be tweaked. John already has the functionality he needs in this local tree - this discussion is no longer about John per se, but rather about other people who may want the functionality John has implemented. That's what I'm worried most about. So far no real other people have spoken out, only cheering from the sidelines. We need to figure out how to provide the functionality in FreeBSD proper, and a CC module is not the answer. I totally disagree. This functionality (removal) is not at all a part of TCP and should not be supported directly. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 13 February 2013 02:27, Andre Oppermann an...@freebsd.org wrote: Again I'd like to point out that this sort of modification should be implemented as a congestion control module. All the hook points are already there and can readily be used instead of adding more special cases to the generic part of TCP. The CC algorithm can be selected per socket. For such a special CC module it'd get a nice fat warning that it is not suitable for Internet use. Additionally I speculate that for the use-case of John he may also be willing to forgo congestion avoidance and always operate in (ill-named) slow start mode. With a special CC module this can easily be tweaked. There are some cute things that could be done here - eg, having an L3 route table entry map to a congestion control (like having an MSS in the L3 entry too.) But I'd love to see some modelling / data showing competing congestion control algorithms on the same set of congested pipes. Doubly so on multiple congested pipes (ie, modelling a handful of parallel user-last-mile-IX-various transit feeds with different levels of congestion/RTT-IX-last mile-user connections.) You all know much more about this than I do. :-) Thanks, Adrian ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
.. and I should say, competing / parallel congestion algorithms. Ie - how multiple CC's work for/against each other on the same internet at the same time. Adrian ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 02/14/13 01:48, Andre Oppermann wrote: On 13.02.2013 15:26, Lawrence Stewart wrote: On 02/13/13 21:27, Andre Oppermann wrote: On 13.02.2013 09:25, Lawrence Stewart wrote: The idea is useful. I'd just like to discuss the implementation specifics a little further before recommending whether the patch should go in as is to provide a stop gap, or we rework the patch to be a little less specific in readiness for the future work I have in mind. Again I'd like to point out that this sort of modification should be implemented as a congestion control module. All the hook points are already there and can readily be used instead of adding more special cases to the generic part of TCP. The CC algorithm can be selected per socket. For such a special CC module it'd get a nice fat warning that it is not suitable for Internet use. As a local hack, sure, a CC module would do the job assuming you were happy to use a single algorithm as the base. John's patch transcends the algorithm in use on a particular connection, so it has wider applicability than a CC module. The algorithm is becoming somewhat meaningless when your goal is to have an open pipe and push data as fast as possible without regard to other traffic. NewReno, Cubic and what have you is becoming meaningless. But that's not the goal. We're not discussing unbounded or unreactive congestion windows. If a burst causes drops, we still back off. The algorithm does still matter. I would also strongly oppose the inclusion of such a module in FreeBSD proper - it's the wrong way to implement the functionality. The patch as posted is technically appropriate, though I'm interested in discussing whether the public API should be tweaked to capture higher level goals instead e.g. low delay at all costs or maximum throughput. I strongly disagree. The patch is a hack. I agree it's hacky in its current form, but for different reasons to you as outlined in my previous email. You are arguing that idle window resetting is an intrinsic and non negotiable part of TCP. This is demonstrably not true. As long as something doesn't change the wire format, then it is fair game for being tunable. How we make something tunable and what we set as defaults are where we need to be conservative. From the description John gave on his use-case I read that he would actually take more than just ignoring idle-cwnd-reset. And actually if I were in his situation I would use a very aggressive congestion control algorithm doing away with more than idle-cwnd-reset. Congestion control is only one aspect of what we're discussing. We could initially map low delay at all costs to a TCP stack meaning of disable idle window reset and expand the meaning later (e.g. relaxing the silly window checks as briefly discussed in the other thread). Ugh, if you go that far fork it, obtain a fresh protocol number and don't call it TCP anymore. You're channelling Joe Touch ;) What exactly is TCP? As far as interop is concerned, it's just a wire protocol - as long as I format my headers/segments correctly and ignore options I don't understand, I can communicate with other TCP stacks, many of which implement a different set of TCP features and options. The dynamics of the protocol have evolved significantly over time and continue to do so because of its ubiquity - it flows freely across the public internet and gets used for all manner of things it wasn't initially designed to handle (well). A lot of the dynamics are also controlled by optional parameters. So no, we don't need a new protocol number. We need to provide knobs that allow people to tune TCP dynamics to their particular use case. Additionally I speculate that for the use-case of John he may also be willing to forgo congestion avoidance and always operate in (ill-named) slow start mode. With a special CC module this can easily be tweaked. John already has the functionality he needs in this local tree - this discussion is no longer about John per se, but rather about other people who may want the functionality John has implemented. That's what I'm worried most about. So far no real other people have spoken out, only cheering from the sidelines. We surely don't need them to speak out explicitly - the use case is not obscure enough that I am having difficulty imagining other places it would be useful. We need to figure out how to provide the functionality in FreeBSD proper, and a CC module is not the answer. I totally disagree. This functionality (removal) is not at all a part of TCP and should not be supported directly. I don't understand how you can argue that idle window resetting is an intrinsic and non negotiable part of TCP. There is no one true set of options and features that is TCP. It is not only one idea. Let's work on providing a rich set of knobs to tune every aspect of our TCP stack's dynamics and operation that don't break wire format, set conservative defaults
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 02/14/13 05:37, Adrian Chadd wrote: On 13 February 2013 02:27, Andre Oppermann an...@freebsd.org wrote: Again I'd like to point out that this sort of modification should be implemented as a congestion control module. All the hook points are already there and can readily be used instead of adding more special cases to the generic part of TCP. The CC algorithm can be selected per socket. For such a special CC module it'd get a nice fat warning that it is not suitable for Internet use. Additionally I speculate that for the use-case of John he may also be willing to forgo congestion avoidance and always operate in (ill-named) slow start mode. With a special CC module this can easily be tweaked. There are some cute things that could be done here - eg, having an L3 route table entry map to a congestion control (like having an MSS in the L3 entry too.) This is an area I've thought about and would form the basis for an interesting applied research project. On a related tangent, we (CAIA) also have some ongoing research looking at using different CC algorithms per subflow of a multipath TCP connection. But I'd love to see some modelling / data showing competing congestion control algorithms on the same set of congested pipes. Doubly so on multiple congested pipes (ie, modelling a handful of parallel user-last-mile-IX-various transit feeds with different levels of congestion/RTT-IX-last mile-user connections.) You all know much more about this than I do. :-) There is quite a bit of relevant literature out there. You could start with some of the stuff CAIA has had a hand in (e.g. [1]) and follow the citation trail from there... Cheers, Lawrence [1] http://caia.swin.edu.au/urp/newtcp/papers.html ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 2/11/13 3:18 PM, Andre Oppermann wrote: Smaller RTO (1s) has become a RFC so there was very broad consensus in TCPM that is a good thing. We don't have it yet because we were not fully compliant in one case (loss of first segment). I've fixed that a while back and will bring 1s RTO soon to HEAD. They use 300ms at least for me/my link/ISP/etc. -- Andrey Zonov signature.asc Description: OpenPGP digital signature
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 12.02.2013 11:55, Andrey Zonov wrote: On 2/11/13 3:18 PM, Andre Oppermann wrote: Smaller RTO (1s) has become a RFC so there was very broad consensus in TCPM that is a good thing. We don't have it yet because we were not fully compliant in one case (loss of first segment). I've fixed that a while back and will bring 1s RTO soon to HEAD. They use 300ms at least for me/my link/ISP/etc. Let me be more precise: An initial RTO of 1s was published as RFC. This is what I'm referring to. It affects the setup phase of a connection. A separate issue is the minimum RTO during a connection. According to the RFC the RTO during the lifetime of the connection should also not be less than 1s. The RTO being determined based on the RTT measurement done using timestamps or Karn's algorithm. However on fast links this has been shown to be too long to wait for. So FreeBSD decreased the allowed lower bound to hz/33. This is only effective if your RTO was actually calculated to be equal or lower than that. The result is a quicker re-probing and discovery of the current line conditions. Since the RTO was measured to be less-equal than hz/33, the possible negative downside is very limited. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 11.02.2013 19:56, Adrian Chadd wrote: On 11 February 2013 03:18, Andre Oppermann an...@freebsd.org wrote: In general Google does provide quite a bit of data with their experiments showing that it isn't harmful and that it helps the case. Smaller RTO (1s) has become a RFC so there was very broad consensus in TCPM that is a good thing. We don't have it yet because we were not fully compliant in one case (loss of first segment). I've fixed that a while back and will bring 1s RTO soon to HEAD. I'm pretty sure that Google doesn't ignore idle on their Internet facing servers. They may have proposed a decay mechanism in the past. I'd have to check the TCPM archives for that. Argh, the If google does, it it must be fine argument. Please. You removed what I was replying to. There is no doubt IW10 originated from Google. However Google took it to TCPM and provided measurement data with it. After some forth and back they provided more data which began to convince more people on TCPM. Eventually the proposal was adopted as official TCPM working group draft and likely will become a RFC later this year. If you want to argue against RTO1s (RFC6298) then the lead authors are from ICSI/UC Berkeley. Google did participate in that one by providing additional measurement data. Does Google publish the data for these experiments with the international and local links broken down? Yes. Have you followed the evolution and discussion of IW10 on TCPM? Google run a highly distributed infrastructure (this isn't news for anyone, I know) and thus the link distance, RTT, number of hops, etc may not accurately reflect the internet. It may accurately reflect the internet from the perspective of being roughly within the same city or state in a lot of cases. The TCP congestion algorithms aren't just for avoiding congestion over a peering fabric and last-mile ISP infrastructure. IW10 is not a congestion control algorithm. It is a change to the initial state of it at the beginning of an connection when not much other data is available. Many years ago the same thing happend with RFC3390 which increased the IW to 3 segments. The effects of tweaking congestion algorithms for delivery over a local peering infrastructure where you try to run things as un-congested as possible (where congestion is now The ISPs Problem) where you maintain tight control over as much of the network infrastructure as you can is likely going to be very different to the congestion algorithm behaviour needed for some end-node speaking to a variety of end-nodes over a longer, more varying set of international links. You know, what TCP congestion algorithms are also trying to play fair with. I agree but not relevant to this case. Please - as much as I applaud Google for what they do, please don't generalise their results to the greater internet without looking at the many caveats/assumptions. Well, that's exactly what I'm trying to do here. Except not only for ideas sourced from Google but also other places. Like it's in Linux and the Internet hasn't broken down yet. Without any measurement date whatsoever. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Feb 10, 2013, at 11:36, Andrey Zonov z...@freebsd.org wrote: Google made many many TCP tweaks. Increased initial window, small RTO, enabled ignore after idle and others. They published that, other people just blindly applied these tunings and the Internet still works. MANY people are experimenting with the changes Google is proposing, in order to evaluate if and how well they work. Sure, some folks may blindly apply them, but please don't generalize. Lars ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 09.02.2013 15:41, Alfred Perlstein wrote: However, the end result must be far different than what has occurred so far. If the code was deemed unacceptable for general inclusion, then we must find a way to provide a light framework to accomplish the needs of the community member. We've got pluggable congestion control modules thanks to lstewart. You can implement any non-standard congestion control method by adding your own module. They can be compiled into the kernel or loaded as KLD. I consider implementing this as a CC module the correct approach instead of adding yet another sysctl. Doing a CC module like this is very easy. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 05.02.2013 22:40, John Baldwin wrote: On Tuesday, February 05, 2013 12:44:27 pm Andre Oppermann wrote: I would prefer to encapsulate it into its own not-so-much-congestion-management algorithm so you can eventually do other tweaks as well like more aggressive loss recovery which would fit your objective as well. Since you have to modify your app anyways to do the sockopt call this seems a more complete solution to me. At least better than to do a non-portable hack that violates one of the most fundamental TCP concepts. This is real rich from the guy pushing the increased IW that came from Linux. :) IW10 came from Google and obviously was implemented in Linux first because that is what they use. However, and this is the big difference, they also provided significant real-world data on the effects of their changes. TCPM was very skeptical at first but the data from the experiments has convinced many that it is not harmful first and actually beneficial second. Tools not policy yadda yadda, but I digress. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 10.02.2013 11:36, Andrey Zonov wrote: On 2/10/13 9:05 AM, Kevin Oberman wrote: This is a subject rather near to my heart, having fought battles with congestion back in the dark days of Windows when it essentially defaulted to TCPIGNOREIDLE. It was a huge pain, but it was the only way Windows did TCP in the early days. It simply did not implement slow-start. This was really evil, but in the days when lots of links were 56K and T-1 was mostly used for network core links, the Internet, small as it was back then, did not melt, though it glowed a frightening shade of red fairly often. Today too many systems running like this would melt thins very quickly. Google made many many TCP tweaks. Increased initial window, small RTO, enabled ignore after idle and others. They published that, other people just blindly applied these tunings and the Internet still works. In general Google does provide quite a bit of data with their experiments showing that it isn't harmful and that it helps the case. Smaller RTO (1s) has become a RFC so there was very broad consensus in TCPM that is a good thing. We don't have it yet because we were not fully compliant in one case (loss of first segment). I've fixed that a while back and will bring 1s RTO soon to HEAD. I'm pretty sure that Google doesn't ignore idle on their Internet facing servers. They may have proposed a decay mechanism in the past. I'd have to check the TCPM archives for that. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 11 February 2013 03:18, Andre Oppermann an...@freebsd.org wrote: In general Google does provide quite a bit of data with their experiments showing that it isn't harmful and that it helps the case. Smaller RTO (1s) has become a RFC so there was very broad consensus in TCPM that is a good thing. We don't have it yet because we were not fully compliant in one case (loss of first segment). I've fixed that a while back and will bring 1s RTO soon to HEAD. I'm pretty sure that Google doesn't ignore idle on their Internet facing servers. They may have proposed a decay mechanism in the past. I'd have to check the TCPM archives for that. Argh, the If google does, it it must be fine argument. Does Google publish the data for these experiments with the international and local links broken down? Google run a highly distributed infrastructure (this isn't news for anyone, I know) and thus the link distance, RTT, number of hops, etc may not accurately reflect the internet. It may accurately reflect the internet from the perspective of being roughly within the same city or state in a lot of cases. The TCP congestion algorithms aren't just for avoiding congestion over a peering fabric and last-mile ISP infrastructure. The effects of tweaking congestion algorithms for delivery over a local peering infrastructure where you try to run things as un-congested as possible (where congestion is now The ISPs Problem) where you maintain tight control over as much of the network infrastructure as you can is likely going to be very different to the congestion algorithm behaviour needed for some end-node speaking to a variety of end-nodes over a longer, more varying set of international links. You know, what TCP congestion algorithms are also trying to play fair with. Please - as much as I applaud Google for what they do, please don't generalise their results to the greater internet without looking at the many caveats/assumptions. Adrian ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 2/11/13 3:10 AM, Andre Oppermann wrote: On 09.02.2013 15:41, Alfred Perlstein wrote: However, the end result must be far different than what has occurred so far. If the code was deemed unacceptable for general inclusion, then we must find a way to provide a light framework to accomplish the needs of the community member. We've got pluggable congestion control modules thanks to lstewart. You can implement any non-standard congestion control method by adding your own module. They can be compiled into the kernel or loaded as KLD. I consider implementing this as a CC module the correct approach instead of adding yet another sysctl. Doing a CC module like this is very easy. That sounds like a win. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 02/10/2013 18:30, Eggert, Lars wrote: On Feb 10, 2013, at 6:05, Kevin Oberman kob6...@gmail.com wrote: One idea that popped into my head (and may be completely ridiculous, is to make its availability dependent on a kernel option and have warning in NOTES about it contravening normal and accepted practice and that it can cause serious problems both for yourself and for others using the network. Also, if it gets merged, don't call it TCP_IGNOREIDLE. Call it TCP_BLAST_DANGEROUSLY_AFTER_IDLE. TCP_AVALANCHE cheers, gja ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
I'm somewhat sympathetic to the purity of TCP. Nevertheless... On 02/10/2013 16:05, Kevin Oberman wrote: [..] What I would like to see is a way to have it available, but make it unlikely to be enabled except in a way that would put up flashing red warnings and sound sirens to warn people that it is very dangerous and can be a way to blow off a few of one's own toes. +1 I rather doubt the Internet will be crushed by adding a non-default option that allows FreeBSD TCP to behave More Aggressively Than It Really Should(tm) under certain circumstances. I'm certainly not denying that the sky would likely fall if everyone turned on John's proposed socket option all the time. (Such might also be said of allowing UDP applications to be free of any CC at all, or allowing new TCP CC algorithms that deviate from the prevalent norm.) But I think that FreeBSD benefits from adding more special-case knobs for the cognoscenti to twiddle, on the basis that most end-users wont bother. One idea that popped into my head (and may be completely ridiculous, is to make its availability dependent on a kernel option and have warning in NOTES about it contravening normal and accepted practice and that it can cause serious problems both for yourself and for others using the network. Perhaps also require a sysctl to be set before John's per-socket TCP_IGNOREIDLE option has any effect. (Thus requiring a sending host's administrator to at least be complicit in enabling any subsequent ruination of their nearest bottleneck.) cheers, gja ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 2/10/13 9:05 AM, Kevin Oberman wrote: This is a subject rather near to my heart, having fought battles with congestion back in the dark days of Windows when it essentially defaulted to TCPIGNOREIDLE. It was a huge pain, but it was the only way Windows did TCP in the early days. It simply did not implement slow-start. This was really evil, but in the days when lots of links were 56K and T-1 was mostly used for network core links, the Internet, small as it was back then, did not melt, though it glowed a frightening shade of red fairly often. Today too many systems running like this would melt thins very quickly. Google made many many TCP tweaks. Increased initial window, small RTO, enabled ignore after idle and others. They published that, other people just blindly applied these tunings and the Internet still works. -- Andrey Zonov signature.asc Description: OpenPGP digital signature
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 2/7/13 12:04 PM, George Neville-Neil wrote: On Feb 6, 2013, at 12:28 , Alfred Perlstein bri...@mu.org wrote: On 2/6/13 4:46 AM, John Baldwin wrote: On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote: John: A burst at line rate will *often* cause drops. This is because router queues are at a finite size. Also such a burst (especially on a long delay bandwidth network) cause your RTT to increase even if there is no drop which is going to hurt you as well. A SHOULD in an RFC says you really really really really need to do it unless there is some thing that makes you willing to override it. It is slight wiggle room. In this I agree with Andre, we should not be *not* doing it. Otherwise folks will be turning this on and it is plain wrong. It may be fine for your network but I would not want to see it in FreeBSD. In my testing here at home I have put back into our stack max-burst. This uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at no more than 4 packets larger than your flight. All of my testing high-bw-delay or lan has shown this to improve TCP performance. This is because it helps you avoid bursting out so many packets that you overflow a queue. In your long-delay bw link if you do burst out too many (and you never know how many that is since you can not predict how full all those MPLS queues are or how big they are) you will really hurt yourself even worse. Note that generally in Cisco routers the default queue size is somewhere between 100-300 packets depending on the router. Due to the way our application works this never happens, but I am fine with just keeping this patch private. If there are other shops that need this they can always dig the patch up from the archives. This is yet another time when I'm sad about how things happen in FreeBSD. A developer come forward with a non-default option that's very useful for some specific workloads, specifically one that contributes much time and $$$ to the project and the community rejects the patches even though it's been successful in other OSes. It makes zero sense. John, can you repost the patch? Maybe there is a way to refactor this somehow so it's like accept filters where we can plug in a hook for TCP? I am very disappointed, but not surprised. I take away the complete opposite feeling. This is how we work through these issues. It's clear from the discussion that this need not be a default in the system, and is a special case. We had a reasoned discussion of what would be best to do and at least two experts in TCP weighed in on the effect this change might have. Not everything proposed by a developer need go into the tree, in particular since these discussions are archived we can always revisit this later. This is exactly how collaborative development should look, whether or not the patch is integrated now, next week, next year, or ever. I agree that discussion is great, we have all learned quite a bit from it, about TCP and the dangers of adjusting buffering without considerable thought. I would not be involved in FreeBSD had this type of discussion and information not be discussed on the lists so readily. However, the end result must be far different than what has occurred so far. If the code was deemed unacceptable for general inclusion, then we must find a way to provide a light framework to accomplish the needs of the community member. Take for instance someone who is starting a company that needs this facility. Which OS will they choose? One who has integrated a useful feature? Or one who has rejected it and left that code in the mailing list archives? As much as expert opinion is valuable, it must include understanding and need of handling special cases and the ability to facilitate those special cases for our users and developers. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Sat, Feb 9, 2013 at 6:41 AM, Alfred Perlstein bri...@mu.org wrote: On 2/7/13 12:04 PM, George Neville-Neil wrote: On Feb 6, 2013, at 12:28 , Alfred Perlstein bri...@mu.org wrote: On 2/6/13 4:46 AM, John Baldwin wrote: On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote: John: A burst at line rate will *often* cause drops. This is because router queues are at a finite size. Also such a burst (especially on a long delay bandwidth network) cause your RTT to increase even if there is no drop which is going to hurt you as well. A SHOULD in an RFC says you really really really really need to do it unless there is some thing that makes you willing to override it. It is slight wiggle room. In this I agree with Andre, we should not be *not* doing it. Otherwise folks will be turning this on and it is plain wrong. It may be fine for your network but I would not want to see it in FreeBSD. In my testing here at home I have put back into our stack max-burst. This uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at no more than 4 packets larger than your flight. All of my testing high-bw-delay or lan has shown this to improve TCP performance. This is because it helps you avoid bursting out so many packets that you overflow a queue. In your long-delay bw link if you do burst out too many (and you never know how many that is since you can not predict how full all those MPLS queues are or how big they are) you will really hurt yourself even worse. Note that generally in Cisco routers the default queue size is somewhere between 100-300 packets depending on the router. Due to the way our application works this never happens, but I am fine with just keeping this patch private. If there are other shops that need this they can always dig the patch up from the archives. This is yet another time when I'm sad about how things happen in FreeBSD. A developer come forward with a non-default option that's very useful for some specific workloads, specifically one that contributes much time and $$$ to the project and the community rejects the patches even though it's been successful in other OSes. It makes zero sense. John, can you repost the patch? Maybe there is a way to refactor this somehow so it's like accept filters where we can plug in a hook for TCP? I am very disappointed, but not surprised. I take away the complete opposite feeling. This is how we work through these issues. It's clear from the discussion that this need not be a default in the system, and is a special case. We had a reasoned discussion of what would be best to do and at least two experts in TCP weighed in on the effect this change might have. Not everything proposed by a developer need go into the tree, in particular since these discussions are archived we can always revisit this later. This is exactly how collaborative development should look, whether or not the patch is integrated now, next week, next year, or ever. I agree that discussion is great, we have all learned quite a bit from it, about TCP and the dangers of adjusting buffering without considerable thought. I would not be involved in FreeBSD had this type of discussion and information not be discussed on the lists so readily. However, the end result must be far different than what has occurred so far. If the code was deemed unacceptable for general inclusion, then we must find a way to provide a light framework to accomplish the needs of the community member. Take for instance someone who is starting a company that needs this facility. Which OS will they choose? One who has integrated a useful feature? Or one who has rejected it and left that code in the mailing list archives? As much as expert opinion is valuable, it must include understanding and need of handling special cases and the ability to facilitate those special cases for our users and developers. This is a subject rather near to my heart, having fought battles with congestion back in the dark days of Windows when it essentially defaulted to TCPIGNOREIDLE. It was a huge pain, but it was the only way Windows did TCP in the early days. It simply did not implement slow-start. This was really evil, but in the days when lots of links were 56K and T-1 was mostly used for network core links, the Internet, small as it was back then, did not melt, though it glowed a frightening shade of red fairly often. Today too many systems running like this would melt thins very quickly. OTOH, I can certainly see cases, like John's, where it would be very beneficial. And, yes, Linux has it. (I don't see this a relevant in any way except as proof tat not enough people have turned it on to cause serious problems... yet!) It seems a shame to make everyone who really has a need develop their own patches or dig though old mail to find John's. What I would like to see is a way to have it available, but make it unlikely
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Feb 10, 2013, at 6:05, Kevin Oberman kob6...@gmail.com wrote: One idea that popped into my head (and may be completely ridiculous, is to make its availability dependent on a kernel option and have warning in NOTES about it contravening normal and accepted practice and that it can cause serious problems both for yourself and for others using the network. Also, if it gets merged, don't call it TCP_IGNOREIDLE. Call it TCP_BLAST_DANGEROUSLY_AFTER_IDLE. Lars ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Feb 6, 2013, at 12:28 , Alfred Perlstein bri...@mu.org wrote: On 2/6/13 4:46 AM, John Baldwin wrote: On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote: John: A burst at line rate will *often* cause drops. This is because router queues are at a finite size. Also such a burst (especially on a long delay bandwidth network) cause your RTT to increase even if there is no drop which is going to hurt you as well. A SHOULD in an RFC says you really really really really need to do it unless there is some thing that makes you willing to override it. It is slight wiggle room. In this I agree with Andre, we should not be *not* doing it. Otherwise folks will be turning this on and it is plain wrong. It may be fine for your network but I would not want to see it in FreeBSD. In my testing here at home I have put back into our stack max-burst. This uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at no more than 4 packets larger than your flight. All of my testing high-bw-delay or lan has shown this to improve TCP performance. This is because it helps you avoid bursting out so many packets that you overflow a queue. In your long-delay bw link if you do burst out too many (and you never know how many that is since you can not predict how full all those MPLS queues are or how big they are) you will really hurt yourself even worse. Note that generally in Cisco routers the default queue size is somewhere between 100-300 packets depending on the router. Due to the way our application works this never happens, but I am fine with just keeping this patch private. If there are other shops that need this they can always dig the patch up from the archives. This is yet another time when I'm sad about how things happen in FreeBSD. A developer come forward with a non-default option that's very useful for some specific workloads, specifically one that contributes much time and $$$ to the project and the community rejects the patches even though it's been successful in other OSes. It makes zero sense. John, can you repost the patch? Maybe there is a way to refactor this somehow so it's like accept filters where we can plug in a hook for TCP? I am very disappointed, but not surprised. I take away the complete opposite feeling. This is how we work through these issues. It's clear from the discussion that this need not be a default in the system, and is a special case. We had a reasoned discussion of what would be best to do and at least two experts in TCP weighed in on the effect this change might have. Not everything proposed by a developer need go into the tree, in particular since these discussions are archived we can always revisit this later. This is exactly how collaborative development should look, whether or not the patch is integrated now, next week, next year, or ever. Best, George ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
John: A burst at line rate will *often* cause drops. This is because router queues are at a finite size. Also such a burst (especially on a long delay bandwidth network) cause your RTT to increase even if there is no drop which is going to hurt you as well. A SHOULD in an RFC says you really really really really need to do it unless there is some thing that makes you willing to override it. It is slight wiggle room. In this I agree with Andre, we should not be *not* doing it. Otherwise folks will be turning this on and it is plain wrong. It may be fine for your network but I would not want to see it in FreeBSD. In my testing here at home I have put back into our stack max-burst. This uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at no more than 4 packets larger than your flight. All of my testing high-bw-delay or lan has shown this to improve TCP performance. This is because it helps you avoid bursting out so many packets that you overflow a queue. In your long-delay bw link if you do burst out too many (and you never know how many that is since you can not predict how full all those MPLS queues are or how big they are) you will really hurt yourself even worse. Note that generally in Cisco routers the default queue size is somewhere between 100-300 packets depending on the router. bottom line IMO this is a bad idea. If you want to really improve that link, let me get with you off line and we can see about getting you a couple of our boxes again :-D. R On Jan 22, 2013, at 4:37 PM, Andre Oppermann wrote: On 22.01.2013 21:35, Alfred Perlstein wrote: On 1/22/13 12:11 PM, John Baldwin wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. This looks good, but it almost sounds like a bug for TCP to be doing this anyhow. It's not a bug. It's by design. It's required by the RFC. Why would one want this behavior? Network conditions change all the time. Traffic and congestion comes and goes. Connections can go idle for milliseconds to minutes to hours. Whenever enough time has passed network capacity probing has to start anew. Wouldn't it make sense to keep the window large until there was a problem rather than unconditionally chop it down? I almost think TCP is afraid that you might wind up swapping out a 10gig interface for a modem? I'm just not getting it. (probably simple oversight on my part). The very real fear is congestion meltdown. That is the reason we ended up with TCP's AIMD mechanism in the first place. If everybody were to blast into the network anyone will suffer. The bufferbloat issue identified recently makes things even worse. What do you think about also making this a sysctl for global on/off by default? Please don't. The correct fix is either a) to use the initial window as the restart window (up to 10 MSS nowadays); b) to use a decay mechanism based on the time since the last network condition probe. Even the latter must decay to initCWND within at most 1MSL. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org -- Randall Stewart 803-317-4952 (cell) ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
John: In-line On Jan 24, 2013, at 11:14 AM, John Baldwin wrote: On Thursday, January 24, 2013 3:03:31 am Andre Oppermann wrote: On 24.01.2013 03:31, Sepherosa Ziehau wrote: On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin j...@freebsd.org wrote: On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote: On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. I think what you need is the RFC2861, however, you probably should ignore the application-limited period part of RFC2861. Hummm. It appears btw, that Linux uses RFC 2861, but has a global knob to disable it due to applictions having problems. When it is disabled, it doesn't decay the congestion window at all during idle handling. That is, it appears to act the same as if TCP_IGNOREIDLE were enabled. From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html: tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 2.6.18) If enabled, provide RFC 2861 behavior and time out the congestion window after an idle period. An idle period is defined as the current RTO (retransmission timeout). If disabled, the congestion window will not be timed out after an idle period. Also, in this thread on tcp-m it appears no one on that list realizes that there are any implementations which follow the SHOULD in RFC 2581 for idle handling (which is what we do currently): Nah, I don't think the idle detection in FreeBSD follows the RFC2581/RFC5681 4.1 (the paragraph before the SHOULD). IMHO, that's probably why the author in the following email requestioned about the implementation of SHOULD in RFC2581/RFC5681. http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html So if we were to implement RFC 2861, the new socket option would be equivalent to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket basis rather than globally. Agree, per-socket option could be useful than global sysctls under certain situation. However, in addition to the per-socket option, could global sysctl nodes to disable idle_restart/idle_cwv help too? No. This is far too dangerous once it makes it into some tuning guide. The threat of congestion breakdown is real. The Internet, or any packet network, can only survive in the long term if almost all follow the rules and self-constrain to remain fair to the others. What would happen if nobody would respect the traffic lights anymore? The problem with this argument is Linux has already had this as a tunable option for years and the Internet hasn't melted as a result. Just because Linux does bad-behaviour does *not* mean that we have to. They also put Bic CC in by default, and this makes things bad for users even more so than RFC2581 in the buffer-bloat sense. The buffer-bloat problems reported by John Getty would not near has been as bad (they still would have existed) if he had been using standard RFC2581 CC. There are much better (and safer) ways to handle this type of network. Putting this in is not a good idea IMO. Besides that bursting into unknown network conditions is very likely to result in burst losses as well. TCP isn't good at recovering from it. In the end you most likely come out ahead if you decay the restartCWND. We have two cases primarily: a) long distance, medium to high RTT, and wildly varying bandwidth (a.k.a. the Internet); b) short distance, low RTT and mostly plenty of bandwidth (a.k.a. Datacenter). The former absolutely definately requires a decayed restartCWND. The latter less so but even there bursting at 10Gig TSO assisted wirespeed isn't going to end too happy more often than not. You
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote: John: A burst at line rate will *often* cause drops. This is because router queues are at a finite size. Also such a burst (especially on a long delay bandwidth network) cause your RTT to increase even if there is no drop which is going to hurt you as well. A SHOULD in an RFC says you really really really really need to do it unless there is some thing that makes you willing to override it. It is slight wiggle room. In this I agree with Andre, we should not be *not* doing it. Otherwise folks will be turning this on and it is plain wrong. It may be fine for your network but I would not want to see it in FreeBSD. In my testing here at home I have put back into our stack max-burst. This uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at no more than 4 packets larger than your flight. All of my testing high-bw-delay or lan has shown this to improve TCP performance. This is because it helps you avoid bursting out so many packets that you overflow a queue. In your long-delay bw link if you do burst out too many (and you never know how many that is since you can not predict how full all those MPLS queues are or how big they are) you will really hurt yourself even worse. Note that generally in Cisco routers the default queue size is somewhere between 100-300 packets depending on the router. Due to the way our application works this never happens, but I am fine with just keeping this patch private. If there are other shops that need this they can always dig the patch up from the archives. -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 2/6/13 4:46 AM, John Baldwin wrote: On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote: John: A burst at line rate will *often* cause drops. This is because router queues are at a finite size. Also such a burst (especially on a long delay bandwidth network) cause your RTT to increase even if there is no drop which is going to hurt you as well. A SHOULD in an RFC says you really really really really need to do it unless there is some thing that makes you willing to override it. It is slight wiggle room. In this I agree with Andre, we should not be *not* doing it. Otherwise folks will be turning this on and it is plain wrong. It may be fine for your network but I would not want to see it in FreeBSD. In my testing here at home I have put back into our stack max-burst. This uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at no more than 4 packets larger than your flight. All of my testing high-bw-delay or lan has shown this to improve TCP performance. This is because it helps you avoid bursting out so many packets that you overflow a queue. In your long-delay bw link if you do burst out too many (and you never know how many that is since you can not predict how full all those MPLS queues are or how big they are) you will really hurt yourself even worse. Note that generally in Cisco routers the default queue size is somewhere between 100-300 packets depending on the router. Due to the way our application works this never happens, but I am fine with just keeping this patch private. If there are other shops that need this they can always dig the patch up from the archives. This is yet another time when I'm sad about how things happen in FreeBSD. A developer come forward with a non-default option that's very useful for some specific workloads, specifically one that contributes much time and $$$ to the project and the community rejects the patches even though it's been successful in other OSes. It makes zero sense. John, can you repost the patch? Maybe there is a way to refactor this somehow so it's like accept filters where we can plug in a hook for TCP? I am very disappointed, but not surprised. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Wednesday, January 30, 2013 12:26:17 pm Andre Oppermann wrote: You can simply create your own congestion control algorithm with only the restart window changed. See (pseudo) code below. BTW, I just noticed that the other cc algos don't do not reset the idle window. *sigh* I am fully competent at maintaining my own local changes. The point was to share this so that other people with similar workloads could make use of it. Also, a custom CC algo is not the right approach as we would want this change regardless of the CC algo used for handling non-idle periods (so that this is an orthogonal knob). Linux also makes this an orthogonal knob rather than requiring a separate CC algo. -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 05.02.2013 18:11, John Baldwin wrote: On Wednesday, January 30, 2013 12:26:17 pm Andre Oppermann wrote: You can simply create your own congestion control algorithm with only the restart window changed. See (pseudo) code below. BTW, I just noticed that the other cc algos don't do not reset the idle window. *sigh* I am fully competent at maintaining my own local changes. The point was to share this so that other people with similar workloads could make use of it. Also, a custom CC algo is not the right approach as we would want this change regardless of the CC algo used for handling non-idle periods (so that this is an orthogonal knob). Linux also makes this an orthogonal knob rather than requiring a separate CC algo. If everything Linux does is good, then go ahead and commit it. Discussing this change further then is pointless. I don't mind too much and I have stated my case why I think it's the wrong thing to do. I would prefer to encapsulate it into its own not-so-much-congestion-management algorithm so you can eventually do other tweaks as well like more aggressive loss recovery which would fit your objective as well. Since you have to modify your app anyways to do the sockopt call this seems a more complete solution to me. At least better than to do a non-portable hack that violates one of the most fundamental TCP concepts. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Tuesday, February 05, 2013 12:44:27 pm Andre Oppermann wrote: On 05.02.2013 18:11, John Baldwin wrote: On Wednesday, January 30, 2013 12:26:17 pm Andre Oppermann wrote: You can simply create your own congestion control algorithm with only the restart window changed. See (pseudo) code below. BTW, I just noticed that the other cc algos don't do not reset the idle window. *sigh* I am fully competent at maintaining my own local changes. The point was to share this so that other people with similar workloads could make use of it. Also, a custom CC algo is not the right approach as we would want this change regardless of the CC algo used for handling non-idle periods (so that this is an orthogonal knob). Linux also makes this an orthogonal knob rather than requiring a separate CC algo. If everything Linux does is good, then go ahead and commit it. Discussing this change further then is pointless. I don't mind too much and I have stated my case why I think it's the wrong thing to do. Not everything Linux does is good, nor is everything Linux does bad. I would prefer to encapsulate it into its own not-so-much-congestion-management algorithm so you can eventually do other tweaks as well like more aggressive loss recovery which would fit your objective as well. Since you have to modify your app anyways to do the sockopt call this seems a more complete solution to me. At least better than to do a non-portable hack that violates one of the most fundamental TCP concepts. This is real rich from the guy pushing the increased IW that came from Linux. :) Tools not policy yadda yadda, but I digress. -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote: On 29.01.2013 19:50, John Baldwin wrote: On Thursday, January 24, 2013 11:14:40 am John Baldwin wrote: Agree, per-socket option could be useful than global sysctls under certain situation. However, in addition to the per-socket option, could global sysctl nodes to disable idle_restart/idle_cwv help too? No. This is far too dangerous once it makes it into some tuning guide. The threat of congestion breakdown is real. The Internet, or any packet network, can only survive in the long term if almost all follow the rules and self-constrain to remain fair to the others. What would happen if nobody would respect the traffic lights anymore? The problem with this argument is Linux has already had this as a tunable option for years and the Internet hasn't melted as a result. Since this seems to be a burning issue I'll come up with a patch in the next days to add a decaying restartCWND that'll be fair and allow a very quick ramp up if no loss occurs. I think this could be useful. OTOH, I still think the TCP_IGNOREIDLE option is useful both with and without a decaying restartCWND? *ping* Andre, do you object to adding the new socket option? Yes, unfortunately I do object. This option, combined with the inflated CWND at the end of a burst, effectively removes much, if not all, of the congestion control mechanisms originally put in place to allow multiple [TCP] streams co-exist on the same pipe. Not having any decay or timeout makes it even worse by doing this burst after an arbitrary amount of time when network conditions and the congestion situation have certainly changed. You have completely ignored the fact that Linux has had this as a global option for years and the Internet has not melted. A socket option is far more fine-grained than their tunable (and requires code changes, not something a random sysadmin can just toggle as tuning). The primary principle of TCP is be cooperative with competing streams and fairly share bandwidth on a given link. Whenever the ACK clock came to a halt for some time we must re-probe (slowstart from a restartCWND) the link to compensate for our lack of knowledge of the current link and congestion situation. Doing that with a decay function and floor equaling the IW (10 segments nowadays) gives a rapid ramp up especially on LAN RTTs while avoiding a blind burst and subsequent loss cycle. I understand all that, but it isn't applicable to my use case. I'm not sharing the bandwidth with anyone but other connections of my own (and they are all lower priority than this one). Also, I have idle periods of hundreds of milliseconds (large than an RTT on this cross-continental link that also has high bandwidth), so it seems that even a decayed restartCWND will be useless to me as it will have decayed down to nothing before I finally restart after long idle periods. If you absolutely know that you're the only one on that network and you want pure wirespeed then a TCP cc_null module doing away with all congestion control may be the right answer. The infrastructure is in place and it can be selected per socket. Plus it can be loaded as a module and thus doesn't have to be part of the base system. No, I do not think that doing away with all congestion control will work for my case. Even though we have a dedicated line, etc. that doesn't mean congestion is impossible and that I don't want the normal feedback to apply during the non-restart cases. BTW, I looked at using alternate congestion control algorithms (cc_cubic and some of the others) first before resorting to adding this option and they either did not fix the issue or were buggy. -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 1/30/13 11:58 AM, John Baldwin wrote: On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote: Yes, unfortunately I do object. This option, combined with the inflated CWND at the end of a burst, effectively removes much, if not all, of the congestion control mechanisms originally put in place to allow multiple [TCP] streams co-exist on the same pipe. Not having any decay or timeout makes it even worse by doing this burst after an arbitrary amount of time when network conditions and the congestion situation have certainly changed. You have completely ignored the fact that Linux has had this as a global option for years and the Internet has not melted. A socket option is far more fine-grained than their tunable (and requires code changes, not something a random sysadmin can just toggle as tuning). I agree with John here. While Andre's objection makes sense, since the majority of Linux/Unix hosts now have this as a global option I can't think of why you would force FreeBSD to be a final holdout. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 30.01.2013 17:58, John Baldwin wrote: On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote: On 29.01.2013 19:50, John Baldwin wrote: On Thursday, January 24, 2013 11:14:40 am John Baldwin wrote: Agree, per-socket option could be useful than global sysctls under certain situation. However, in addition to the per-socket option, could global sysctl nodes to disable idle_restart/idle_cwv help too? No. This is far too dangerous once it makes it into some tuning guide. The threat of congestion breakdown is real. The Internet, or any packet network, can only survive in the long term if almost all follow the rules and self-constrain to remain fair to the others. What would happen if nobody would respect the traffic lights anymore? The problem with this argument is Linux has already had this as a tunable option for years and the Internet hasn't melted as a result. Since this seems to be a burning issue I'll come up with a patch in the next days to add a decaying restartCWND that'll be fair and allow a very quick ramp up if no loss occurs. I think this could be useful. OTOH, I still think the TCP_IGNOREIDLE option is useful both with and without a decaying restartCWND? *ping* Andre, do you object to adding the new socket option? Yes, unfortunately I do object. This option, combined with the inflated CWND at the end of a burst, effectively removes much, if not all, of the congestion control mechanisms originally put in place to allow multiple [TCP] streams co-exist on the same pipe. Not having any decay or timeout makes it even worse by doing this burst after an arbitrary amount of time when network conditions and the congestion situation have certainly changed. You have completely ignored the fact that Linux has had this as a global option for years and the Internet has not melted. Sure. A friend of mine does free climbing and he hasn't crashed yet. He also runs all filesystems async with disk write cache enabled, no backup and hasn't lost a file yet. ;-) A socket option is far more fine-grained than their tunable (and requires code changes, not something a random sysadmin can just toggle as tuning). Agreed that a socket option is much more difficult to use. The primary principle of TCP is be cooperative with competing streams and fairly share bandwidth on a given link. Whenever the ACK clock came to a halt for some time we must re-probe (slowstart from a restartCWND) the link to compensate for our lack of knowledge of the current link and congestion situation. Doing that with a decay function and floor equaling the IW (10 segments nowadays) gives a rapid ramp up especially on LAN RTTs while avoiding a blind burst and subsequent loss cycle. I understand all that, but it isn't applicable to my use case. I'm not sharing the bandwidth with anyone but other connections of my own (and they are all lower priority than this one). Also, I have idle periods of hundreds of milliseconds (large than an RTT on this cross-continental link that also has high bandwidth), so it seems that even a decayed restartCWND will be useless to me as it will have decayed down to nothing before I finally restart after long idle periods. OK. If you absolutely know that you're the only one on that network and you want pure wirespeed then a TCP cc_null module doing away with all congestion control may be the right answer. The infrastructure is in place and it can be selected per socket. Plus it can be loaded as a module and thus doesn't have to be part of the base system. No, I do not think that doing away with all congestion control will work for my case. Even though we have a dedicated line, etc. that doesn't mean congestion is impossible and that I don't want the normal feedback to apply during the non-restart cases. BTW, I looked at using alternate congestion control algorithms (cc_cubic and some of the others) first before resorting to adding this option and they either did not fix the issue or were buggy. You can simply create your own congestion control algorithm with only the restart window changed. See (pseudo) code below. BTW, I just noticed that the other cc algos don't do not reset the idle window. -- Andre /* boilerplate from netinet/cc/cc_newreno.c here. */ struct cc_algo jhb_cc_algo = { .name = jhb_full_restartCWND, .ack_received = newreno_ack_received, .after_idle = jhb_after_idle, .cong_signal = newreno_cong_signal, .post_recovery = newreno_post_recovery, }; static void jhb_after_idle(struct cc_var *ccv) { return; } ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 30.01.2013 18:11, Alfred Perlstein wrote: On 1/30/13 11:58 AM, John Baldwin wrote: On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote: Yes, unfortunately I do object. This option, combined with the inflated CWND at the end of a burst, effectively removes much, if not all, of the congestion control mechanisms originally put in place to allow multiple [TCP] streams co-exist on the same pipe. Not having any decay or timeout makes it even worse by doing this burst after an arbitrary amount of time when network conditions and the congestion situation have certainly changed. You have completely ignored the fact that Linux has had this as a global option for years and the Internet has not melted. A socket option is far more fine-grained than their tunable (and requires code changes, not something a random sysadmin can just toggle as tuning). I agree with John here. While Andre's objection makes sense, since the majority of Linux/Unix hosts now have this as a global option I can't think of why you would force FreeBSD to be a final holdout. Unless OpenBSD, NetBSD, Solaris/Ilumos also support this it is hardly a majority of Linux/Unix hosts. And this isn't something a sysadmin should tune at all. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 1/30/13 12:29 PM, Andre Oppermann wrote: On 30.01.2013 18:11, Alfred Perlstein wrote: On 1/30/13 11:58 AM, John Baldwin wrote: On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote: Yes, unfortunately I do object. This option, combined with the inflated CWND at the end of a burst, effectively removes much, if not all, of the congestion control mechanisms originally put in place to allow multiple [TCP] streams co-exist on the same pipe. Not having any decay or timeout makes it even worse by doing this burst after an arbitrary amount of time when network conditions and the congestion situation have certainly changed. You have completely ignored the fact that Linux has had this as a global option for years and the Internet has not melted. A socket option is far more fine-grained than their tunable (and requires code changes, not something a random sysadmin can just toggle as tuning). I agree with John here. While Andre's objection makes sense, since the majority of Linux/Unix hosts now have this as a global option I can't think of why you would force FreeBSD to be a final holdout. Unless OpenBSD, NetBSD, Solaris/Ilumos also support this it is hardly a majority of Linux/Unix hosts. And this isn't something a sysadmin should tune at all. My apologies, I should have been more clear. I was speaking of majority of install base, not majority of distros. -Alfred ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Thursday, January 24, 2013 11:14:40 am John Baldwin wrote: Agree, per-socket option could be useful than global sysctls under certain situation. However, in addition to the per-socket option, could global sysctl nodes to disable idle_restart/idle_cwv help too? No. This is far too dangerous once it makes it into some tuning guide. The threat of congestion breakdown is real. The Internet, or any packet network, can only survive in the long term if almost all follow the rules and self-constrain to remain fair to the others. What would happen if nobody would respect the traffic lights anymore? The problem with this argument is Linux has already had this as a tunable option for years and the Internet hasn't melted as a result. Since this seems to be a burning issue I'll come up with a patch in the next days to add a decaying restartCWND that'll be fair and allow a very quick ramp up if no loss occurs. I think this could be useful. OTOH, I still think the TCP_IGNOREIDLE option is useful both with and without a decaying restartCWND? *ping* Andre, do you object to adding the new socket option? -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 29.01.2013 19:50, John Baldwin wrote: On Thursday, January 24, 2013 11:14:40 am John Baldwin wrote: Agree, per-socket option could be useful than global sysctls under certain situation. However, in addition to the per-socket option, could global sysctl nodes to disable idle_restart/idle_cwv help too? No. This is far too dangerous once it makes it into some tuning guide. The threat of congestion breakdown is real. The Internet, or any packet network, can only survive in the long term if almost all follow the rules and self-constrain to remain fair to the others. What would happen if nobody would respect the traffic lights anymore? The problem with this argument is Linux has already had this as a tunable option for years and the Internet hasn't melted as a result. Since this seems to be a burning issue I'll come up with a patch in the next days to add a decaying restartCWND that'll be fair and allow a very quick ramp up if no loss occurs. I think this could be useful. OTOH, I still think the TCP_IGNOREIDLE option is useful both with and without a decaying restartCWND? *ping* Andre, do you object to adding the new socket option? Yes, unfortunately I do object. This option, combined with the inflated CWND at the end of a burst, effectively removes much, if not all, of the congestion control mechanisms originally put in place to allow multiple [TCP] streams co-exist on the same pipe. Not having any decay or timeout makes it even worse by doing this burst after an arbitrary amount of time when network conditions and the congestion situation have certainly changed. The primary principle of TCP is be cooperative with competing streams and fairly share bandwidth on a given link. Whenever the ACK clock came to a halt for some time we must re-probe (slowstart from a restartCWND) the link to compensate for our lack of knowledge of the current link and congestion situation. Doing that with a decay function and floor equaling the IW (10 segments nowadays) gives a rapid ramp up especially on LAN RTTs while avoiding a blind burst and subsequent loss cycle. If you absolutely know that you're the only one on that network and you want pure wirespeed then a TCP cc_null module doing away with all congestion control may be the right answer. The infrastructure is in place and it can be selected per socket. Plus it can be loaded as a module and thus doesn't have to be part of the base system. I'm currently re-emerging finishing up from the startup and auto-scaling rabbit- hole and will post patches for review shortly. After that I'm looking after the restartCWND issue. A first quick patch (untested) to update the restartCWND to the IW is below. -- Andre $ svn diff netinet/cc/cc_newreno.c Index: netinet/cc/cc_newreno.c === --- netinet/cc/cc_newreno.c (revision 246082) +++ netinet/cc/cc_newreno.c (working copy) @@ -166,12 +166,21 @@ * * See RFC5681 Section 4.1. Restarting Idle Connections. */ - if (V_tcp_do_rfc3390) + if (V_tcp_do_initcwnd10) + rw = min(10 * CCV(ccv, t_maxseg), + max(2 * CCV(ccv, t_maxseg), 14600)); + else if (V_tcp_do_rfc3390) rw = min(4 * CCV(ccv, t_maxseg), max(2 * CCV(ccv, t_maxseg), 4380)); - else - rw = CCV(ccv, t_maxseg) * 2; - + else { + /* Per RFC5681 Section 3.1 */ + if (CCV(ccv, t_maxseg) 2190) + rw = 2 * CCV(ccv, t_maxseg); + else if (CCV(ccv, t_maxseg) 1095) + rw = 3 * CCV(ccv, t_maxseg); + else + rw = 4 * CCV(ccv, t_maxseg); + } CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd)); } ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 24.01.2013 03:31, Sepherosa Ziehau wrote: On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin j...@freebsd.org wrote: On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote: On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. I think what you need is the RFC2861, however, you probably should ignore the application-limited period part of RFC2861. Hummm. It appears btw, that Linux uses RFC 2861, but has a global knob to disable it due to applictions having problems. When it is disabled, it doesn't decay the congestion window at all during idle handling. That is, it appears to act the same as if TCP_IGNOREIDLE were enabled. From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html: tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 2.6.18) If enabled, provide RFC 2861 behavior and time out the congestion window after an idle period. An idle period is defined as the current RTO (retransmission timeout). If disabled, the congestion window will not be timed out after an idle period. Also, in this thread on tcp-m it appears no one on that list realizes that there are any implementations which follow the SHOULD in RFC 2581 for idle handling (which is what we do currently): Nah, I don't think the idle detection in FreeBSD follows the RFC2581/RFC5681 4.1 (the paragraph before the SHOULD). IMHO, that's probably why the author in the following email requestioned about the implementation of SHOULD in RFC2581/RFC5681. http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html So if we were to implement RFC 2861, the new socket option would be equivalent to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket basis rather than globally. Agree, per-socket option could be useful than global sysctls under certain situation. However, in addition to the per-socket option, could global sysctl nodes to disable idle_restart/idle_cwv help too? No. This is far too dangerous once it makes it into some tuning guide. The threat of congestion breakdown is real. The Internet, or any packet network, can only survive in the long term if almost all follow the rules and self-constrain to remain fair to the others. What would happen if nobody would respect the traffic lights anymore? Besides that bursting into unknown network conditions is very likely to result in burst losses as well. TCP isn't good at recovering from it. In the end you most likely come out ahead if you decay the restartCWND. We have two cases primarily: a) long distance, medium to high RTT, and wildly varying bandwidth (a.k.a. the Internet); b) short distance, low RTT and mostly plenty of bandwidth (a.k.a. Datacenter). The former absolutely definately requires a decayed restartCWND. The latter less so but even there bursting at 10Gig TSO assisted wirespeed isn't going to end too happy more often than not. Since this seems to be a burning issue I'll come up with a patch in the next days to add a decaying restartCWND that'll be fair and allow a very quick ramp up if no loss occurs. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Thursday, January 24, 2013 3:03:31 am Andre Oppermann wrote: On 24.01.2013 03:31, Sepherosa Ziehau wrote: On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin j...@freebsd.org wrote: On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote: On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. I think what you need is the RFC2861, however, you probably should ignore the application-limited period part of RFC2861. Hummm. It appears btw, that Linux uses RFC 2861, but has a global knob to disable it due to applictions having problems. When it is disabled, it doesn't decay the congestion window at all during idle handling. That is, it appears to act the same as if TCP_IGNOREIDLE were enabled. From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html: tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 2.6.18) If enabled, provide RFC 2861 behavior and time out the congestion window after an idle period. An idle period is defined as the current RTO (retransmission timeout). If disabled, the congestion window will not be timed out after an idle period. Also, in this thread on tcp-m it appears no one on that list realizes that there are any implementations which follow the SHOULD in RFC 2581 for idle handling (which is what we do currently): Nah, I don't think the idle detection in FreeBSD follows the RFC2581/RFC5681 4.1 (the paragraph before the SHOULD). IMHO, that's probably why the author in the following email requestioned about the implementation of SHOULD in RFC2581/RFC5681. http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html So if we were to implement RFC 2861, the new socket option would be equivalent to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket basis rather than globally. Agree, per-socket option could be useful than global sysctls under certain situation. However, in addition to the per-socket option, could global sysctl nodes to disable idle_restart/idle_cwv help too? No. This is far too dangerous once it makes it into some tuning guide. The threat of congestion breakdown is real. The Internet, or any packet network, can only survive in the long term if almost all follow the rules and self-constrain to remain fair to the others. What would happen if nobody would respect the traffic lights anymore? The problem with this argument is Linux has already had this as a tunable option for years and the Internet hasn't melted as a result. Besides that bursting into unknown network conditions is very likely to result in burst losses as well. TCP isn't good at recovering from it. In the end you most likely come out ahead if you decay the restartCWND. We have two cases primarily: a) long distance, medium to high RTT, and wildly varying bandwidth (a.k.a. the Internet); b) short distance, low RTT and mostly plenty of bandwidth (a.k.a. Datacenter). The former absolutely definately requires a decayed restartCWND. The latter less so but even there bursting at 10Gig TSO assisted wirespeed isn't going to end too happy more often than not. You forgot my case: c) dedicated long distance links with high bandwidth. Since this seems to be a burning issue I'll come up with a patch in the next days to add a decaying restartCWND that'll be fair and allow a very quick ramp up if no loss occurs. I think this could be useful. OTOH, I still think the TCP_IGNOREIDLE option is useful both with and without a decaying restartCWND? -- John Baldwin ___
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 1/24/13 11:14 AM, John Baldwin wrote: On Thursday, January 24, 2013 3:03:31 am Andre Oppermann wrote: On 24.01.2013 03:31, Sepherosa Ziehau wrote: On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin j...@freebsd.org wrote: On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote: On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. I think what you need is the RFC2861, however, you probably should ignore the application-limited period part of RFC2861. Hummm. It appears btw, that Linux uses RFC 2861, but has a global knob to disable it due to applictions having problems. When it is disabled, it doesn't decay the congestion window at all during idle handling. That is, it appears to act the same as if TCP_IGNOREIDLE were enabled. From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html: tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 2.6.18) If enabled, provide RFC 2861 behavior and time out the congestion window after an idle period. An idle period is defined as the current RTO (retransmission timeout). If disabled, the congestion window will not be timed out after an idle period. Also, in this thread on tcp-m it appears no one on that list realizes that there are any implementations which follow the SHOULD in RFC 2581 for idle handling (which is what we do currently): Nah, I don't think the idle detection in FreeBSD follows the RFC2581/RFC5681 4.1 (the paragraph before the SHOULD). IMHO, that's probably why the author in the following email requestioned about the implementation of SHOULD in RFC2581/RFC5681. http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html So if we were to implement RFC 2861, the new socket option would be equivalent to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket basis rather than globally. Agree, per-socket option could be useful than global sysctls under certain situation. However, in addition to the per-socket option, could global sysctl nodes to disable idle_restart/idle_cwv help too? No. This is far too dangerous once it makes it into some tuning guide. The threat of congestion breakdown is real. The Internet, or any packet network, can only survive in the long term if almost all follow the rules and self-constrain to remain fair to the others. What would happen if nobody would respect the traffic lights anymore? The problem with this argument is Linux has already had this as a tunable option for years and the Internet hasn't melted as a result. Besides that bursting into unknown network conditions is very likely to result in burst losses as well. TCP isn't good at recovering from it. In the end you most likely come out ahead if you decay the restartCWND. We have two cases primarily: a) long distance, medium to high RTT, and wildly varying bandwidth (a.k.a. the Internet); b) short distance, low RTT and mostly plenty of bandwidth (a.k.a. Datacenter). The former absolutely definately requires a decayed restartCWND. The latter less so but even there bursting at 10Gig TSO assisted wirespeed isn't going to end too happy more often than not. You forgot my case: c) dedicated long distance links with high bandwidth. Since this seems to be a burning issue I'll come up with a patch in the next days to add a decaying restartCWND that'll be fair and allow a very quick ramp up if no loss occurs. I think this could be useful. OTOH, I still think the TCP_IGNOREIDLE option is useful both with and without a decaying restartCWND? Linux seems to be doing just fine with it for what seems to be a long while. Can we get this committed? -Alfred ___ freebsd-net@freebsd.org mailing list
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote: On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. I think what you need is the RFC2861, however, you probably should ignore the application-limited period part of RFC2861. Hummm. It appears btw, that Linux uses RFC 2861, but has a global knob to disable it due to applictions having problems. When it is disabled, it doesn't decay the congestion window at all during idle handling. That is, it appears to act the same as if TCP_IGNOREIDLE were enabled. From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html: tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 2.6.18) If enabled, provide RFC 2861 behavior and time out the congestion window after an idle period. An idle period is defined as the current RTO (retransmission timeout). If disabled, the congestion window will not be timed out after an idle period. Also, in this thread on tcp-m it appears no one on that list realizes that there are any implementations which follow the SHOULD in RFC 2581 for idle handling (which is what we do currently): http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html So if we were to implement RFC 2861, the new socket option would be equivalent to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket basis rather than globally. -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin j...@freebsd.org wrote: On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote: On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. I think what you need is the RFC2861, however, you probably should ignore the application-limited period part of RFC2861. Hummm. It appears btw, that Linux uses RFC 2861, but has a global knob to disable it due to applictions having problems. When it is disabled, it doesn't decay the congestion window at all during idle handling. That is, it appears to act the same as if TCP_IGNOREIDLE were enabled. From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html: tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 2.6.18) If enabled, provide RFC 2861 behavior and time out the congestion window after an idle period. An idle period is defined as the current RTO (retransmission timeout). If disabled, the congestion window will not be timed out after an idle period. Also, in this thread on tcp-m it appears no one on that list realizes that there are any implementations which follow the SHOULD in RFC 2581 for idle handling (which is what we do currently): Nah, I don't think the idle detection in FreeBSD follows the RFC2581/RFC5681 4.1 (the paragraph before the SHOULD). IMHO, that's probably why the author in the following email requestioned about the implementation of SHOULD in RFC2581/RFC5681. http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html So if we were to implement RFC 2861, the new socket option would be equivalent to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket basis rather than globally. Agree, per-socket option could be useful than global sysctls under certain situation. However, in addition to the per-socket option, could global sysctl nodes to disable idle_restart/idle_cwv help too? Best Regards, sephe -- Tomorrow Will Never Die ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
[PATCH] Add a new TCP_IGNOREIDLE socket option
As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. Index: share/man/man4/tcp.4 === --- share/man/man4/tcp.4(revision 245742) +++ share/man/man4/tcp.4(working copy) @@ -205,6 +205,18 @@ in the .Sx MIB Variables section further down. +.It Dv TCP_IGNOREIDLE +If a TCP connection is idle for more than one retransmit timeout, +it enters slow start when new data is available to transmit. +This avoids flooding the network with a full window of traffic at line rate. +It also allows the connection to adjust to changes to network conditions +that occurred while the connection was idle. A connection that sends +bursts of data separated by large idle periods can be permamently stuck in +slow start as a result. +The boolean option +.Dv TCP_IGNOREIDLE +disables the idle connection handling allowing connections to maintain the +existing congestion window when restarting after an idle period. .It Dv TCP_NODELAY Under most circumstances, .Tn TCP Index: sys/netinet/tcp_var.h === --- sys/netinet/tcp_var.h (revision 245742) +++ sys/netinet/tcp_var.h (working copy) @@ -230,6 +230,7 @@ #defineTF_NEEDFIN 0x000800/* send FIN (implicit state) */ #defineTF_NOPUSH 0x001000/* don't push */ #defineTF_PREVVALID0x002000/* saved values for bad rxmit valid */ +#defineTF_IGNOREIDLE 0x004000/* connection is never idle */ #defineTF_MORETOCOME 0x01/* More data to be appended to sock */ #defineTF_LQ_OVERFLOW 0x02/* listen queue overflow */ #defineTF_LASTIDLE 0x04/* connection was previously idle */ Index: sys/netinet/tcp_output.c === --- sys/netinet/tcp_output.c(revision 245742) +++ sys/netinet/tcp_output.c(working copy) @@ -206,7 +206,8 @@ * to send, then transmit; otherwise, investigate further. */ idle = (tp-t_flags TF_LASTIDLE) || (tp-snd_max == tp-snd_una); - if (idle ticks - tp-t_rcvtime = tp-t_rxtcur) + if (!(tp-t_flags TF_IGNOREIDLE) + idle ticks - tp-t_rcvtime = tp-t_rxtcur) cc_after_idle(tp); tp-t_flags = ~TF_LASTIDLE; if (idle) { Index: sys/netinet/tcp.h === --- sys/netinet/tcp.h (revision 245823) +++ sys/netinet/tcp.h (working copy) @@ -156,6 +156,7 @@ #defineTCP_NODELAY 1 /* don't delay send to coalesce packets */ #if __BSD_VISIBLE #defineTCP_MAXSEG 2 /* set maximum segment size */ +#defineTCP_IGNOREIDLE 3 /* disable idle connection handling */ #define TCP_NOPUSH 4 /* don't push last block of write */ #define TCP_NOOPT 8 /* don't use TCP options */ #define TCP_MD5SIG 16 /* use MD5 digests (RFC2385) */ Index: sys/netinet/tcp_usrreq.c === --- sys/netinet/tcp_usrreq.c(revision 245742) +++ sys/netinet/tcp_usrreq.c(working copy) @@ -1354,6 +1354,7 @@ case TCP_NODELAY: case TCP_NOOPT: + case TCP_IGNOREIDLE: INP_WUNLOCK(inp); error = sooptcopyin(sopt, optval, sizeof optval, sizeof optval); @@ -1368,6 +1369,9 @@ case TCP_NOOPT: opt = TF_NOOPT; break; + case TCP_IGNOREIDLE: + opt = TF_IGNOREIDLE; + break; default:
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 1/22/13 12:11 PM, John Baldwin wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. This looks good, but it almost sounds like a bug for TCP to be doing this anyhow. Why would one want this behavior? Wouldn't it make sense to keep the window large until there was a problem rather than unconditionally chop it down? I almost think TCP is afraid that you might wind up swapping out a 10gig interface for a modem? I'm just not getting it. (probably simple oversight on my part). What do you think about also making this a sysctl for global on/off by default? -Alfred Index: share/man/man4/tcp.4 === --- share/man/man4/tcp.4(revision 245742) +++ share/man/man4/tcp.4(working copy) @@ -205,6 +205,18 @@ in the .Sx MIB Variables section further down. +.It Dv TCP_IGNOREIDLE +If a TCP connection is idle for more than one retransmit timeout, +it enters slow start when new data is available to transmit. +This avoids flooding the network with a full window of traffic at line rate. +It also allows the connection to adjust to changes to network conditions +that occurred while the connection was idle. A connection that sends +bursts of data separated by large idle periods can be permamently stuck in +slow start as a result. +The boolean option +.Dv TCP_IGNOREIDLE +disables the idle connection handling allowing connections to maintain the +existing congestion window when restarting after an idle period. .It Dv TCP_NODELAY Under most circumstances, .Tn TCP Index: sys/netinet/tcp_var.h === --- sys/netinet/tcp_var.h (revision 245742) +++ sys/netinet/tcp_var.h (working copy) @@ -230,6 +230,7 @@ #define TF_NEEDFIN 0x000800/* send FIN (implicit state) */ #define TF_NOPUSH 0x001000/* don't push */ #define TF_PREVVALID0x002000/* saved values for bad rxmit valid */ +#defineTF_IGNOREIDLE 0x004000/* connection is never idle */ #define TF_MORETOCOME 0x01/* More data to be appended to sock */ #define TF_LQ_OVERFLOW 0x02/* listen queue overflow */ #define TF_LASTIDLE 0x04/* connection was previously idle */ Index: sys/netinet/tcp_output.c === --- sys/netinet/tcp_output.c(revision 245742) +++ sys/netinet/tcp_output.c(working copy) @@ -206,7 +206,8 @@ * to send, then transmit; otherwise, investigate further. */ idle = (tp-t_flags TF_LASTIDLE) || (tp-snd_max == tp-snd_una); - if (idle ticks - tp-t_rcvtime = tp-t_rxtcur) + if (!(tp-t_flags TF_IGNOREIDLE) + idle ticks - tp-t_rcvtime = tp-t_rxtcur) cc_after_idle(tp); tp-t_flags = ~TF_LASTIDLE; if (idle) { Index: sys/netinet/tcp.h === --- sys/netinet/tcp.h (revision 245823) +++ sys/netinet/tcp.h (working copy) @@ -156,6 +156,7 @@ #define TCP_NODELAY 1 /* don't delay send to coalesce packets */ #if __BSD_VISIBLE #define TCP_MAXSEG 2 /* set maximum segment size */ +#defineTCP_IGNOREIDLE 3 /* disable idle connection handling */ #define TCP_NOPUSH4 /* don't push last block of write */ #define TCP_NOOPT 8 /* don't use TCP options */ #define TCP_MD5SIG16 /* use MD5 digests (RFC2385) */ Index: sys/netinet/tcp_usrreq.c === --- sys/netinet/tcp_usrreq.c(revision 245742) +++ sys/netinet/tcp_usrreq.c(working copy) @@ -1354,6 +1354,7 @@ case TCP_NODELAY: case
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On 22.01.2013 21:35, Alfred Perlstein wrote: On 1/22/13 12:11 PM, John Baldwin wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. This looks good, but it almost sounds like a bug for TCP to be doing this anyhow. It's not a bug. It's by design. It's required by the RFC. Why would one want this behavior? Network conditions change all the time. Traffic and congestion comes and goes. Connections can go idle for milliseconds to minutes to hours. Whenever enough time has passed network capacity probing has to start anew. Wouldn't it make sense to keep the window large until there was a problem rather than unconditionally chop it down? I almost think TCP is afraid that you might wind up swapping out a 10gig interface for a modem? I'm just not getting it. (probably simple oversight on my part). The very real fear is congestion meltdown. That is the reason we ended up with TCP's AIMD mechanism in the first place. If everybody were to blast into the network anyone will suffer. The bufferbloat issue identified recently makes things even worse. What do you think about also making this a sysctl for global on/off by default? Please don't. The correct fix is either a) to use the initial window as the restart window (up to 10 MSS nowadays); b) to use a decay mechanism based on the time since the last network condition probe. Even the latter must decay to initCWND within at most 1MSL. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Tuesday, January 22, 2013 3:35:40 pm Alfred Perlstein wrote: On 1/22/13 12:11 PM, John Baldwin wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. This looks good, but it almost sounds like a bug for TCP to be doing this anyhow. Why would one want this behavior? Wouldn't it make sense to keep the window large until there was a problem rather than unconditionally chop it down? I almost think TCP is afraid that you might wind up swapping out a 10gig interface for a modem? I'm just not getting it. (probably simple oversight on my part). What do you think about also making this a sysctl for global on/off by default? No, I think this is the proper default and RFC 5681 makes this a SHOULD. The burst at line rate argument is a very good one. Normally if you have a stream of data your data rate is clocked by the arrival of return ACKs (once you have filled the window), and slow starts keeps you throttled at the beginning from flooding the pipe. However, if your connection becomes idle then you will accumulate a large number of ACKs and be able to spend them all at once when you get a burst of data to send. This burst can then use a higher effective bandwidth than the normal flow of traffic and could overwhelm a switch. Also, for the cases where this is most useful (high RTT), it is not at all unimaginable for network conditions to change dramatically. In my use case we have dedicated lines and control what goes across them so we don't have to worry about that, but the general use case certainly needs to take that into account. -- John Baldwin ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: [PATCH] Add a new TCP_IGNOREIDLE socket option
On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote: As I mentioned in an earlier thread, I recently had to debug an issue we were seeing across a link with a high bandwidth-delay product (both high bandwidth and high RTT). Our specific use case was to use a TCP connection to reliably forward a latency-sensitive datagram stream across a WAN connection. We would often see spikes in the latency of individual datagrams. I eventually tracked this down to the connection entering slow start when it would transmit data after being idle. The data stream was quite bursty and would often attempt to transmit a burst of data after being idle for far longer than a retransmit timeout. In 7.x we had worked around this in the past by disabling RFC 3390 and jacking the slow start window size up via a sysctl. On 8.x this no longer worked. The solution I came up with was to add a new socket option to disable idle handling completely. That is, when an idle connection restarts with this new option enabled, it keeps its current congestion window and doesn't enter slow start. There are only a few cases where such an option is useful, but if anyone else thinks this might be useful I'd be happy to add the option to FreeBSD. I think what you need is the RFC2861, however, you probably should ignore the application-limited period part of RFC2861. Best Regards, sephe Index: share/man/man4/tcp.4 === --- share/man/man4/tcp.4(revision 245742) +++ share/man/man4/tcp.4(working copy) @@ -205,6 +205,18 @@ in the .Sx MIB Variables section further down. +.It Dv TCP_IGNOREIDLE +If a TCP connection is idle for more than one retransmit timeout, +it enters slow start when new data is available to transmit. +This avoids flooding the network with a full window of traffic at line rate. +It also allows the connection to adjust to changes to network conditions +that occurred while the connection was idle. A connection that sends +bursts of data separated by large idle periods can be permamently stuck in +slow start as a result. +The boolean option +.Dv TCP_IGNOREIDLE +disables the idle connection handling allowing connections to maintain the +existing congestion window when restarting after an idle period. .It Dv TCP_NODELAY Under most circumstances, .Tn TCP Index: sys/netinet/tcp_var.h === --- sys/netinet/tcp_var.h (revision 245742) +++ sys/netinet/tcp_var.h (working copy) @@ -230,6 +230,7 @@ #defineTF_NEEDFIN 0x000800/* send FIN (implicit state) */ #defineTF_NOPUSH 0x001000/* don't push */ #defineTF_PREVVALID0x002000/* saved values for bad rxmit valid */ +#defineTF_IGNOREIDLE 0x004000/* connection is never idle */ #defineTF_MORETOCOME 0x01/* More data to be appended to sock */ #defineTF_LQ_OVERFLOW 0x02/* listen queue overflow */ #defineTF_LASTIDLE 0x04/* connection was previously idle */ Index: sys/netinet/tcp_output.c === --- sys/netinet/tcp_output.c(revision 245742) +++ sys/netinet/tcp_output.c(working copy) @@ -206,7 +206,8 @@ * to send, then transmit; otherwise, investigate further. */ idle = (tp-t_flags TF_LASTIDLE) || (tp-snd_max == tp-snd_una); - if (idle ticks - tp-t_rcvtime = tp-t_rxtcur) + if (!(tp-t_flags TF_IGNOREIDLE) + idle ticks - tp-t_rcvtime = tp-t_rxtcur) cc_after_idle(tp); tp-t_flags = ~TF_LASTIDLE; if (idle) { Index: sys/netinet/tcp.h === --- sys/netinet/tcp.h (revision 245823) +++ sys/netinet/tcp.h (working copy) @@ -156,6 +156,7 @@ #defineTCP_NODELAY 1 /* don't delay send to coalesce packets */ #if __BSD_VISIBLE #defineTCP_MAXSEG 2 /* set maximum segment size */ +#defineTCP_IGNOREIDLE 3 /* disable idle connection handling */ #define TCP_NOPUSH 4 /* don't push last block of write */ #define TCP_NOOPT 8 /* don't use TCP options */ #define TCP_MD5SIG 16 /* use MD5 digests (RFC2385) */ Index: sys/netinet/tcp_usrreq.c === --- sys/netinet/tcp_usrreq.c(revision 245742) +++ sys/netinet/tcp_usrreq.c(working copy) @@ -1354,6 +1354,7 @@ case TCP_NODELAY: case TCP_NOOPT: + case TCP_IGNOREIDLE: INP_WUNLOCK(inp); error = sooptcopyin(sopt, optval, sizeof optval, sizeof optval); @@ -1368,6 +1369,9 @@