Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-21 Thread Sepherosa Ziehau
On Wed, Feb 20, 2013 at 11:59 AM, Lawrence Stewart lstew...@freebsd.org wrote:
 Hi Sephe,

 On 02/20/13 13:37, Sepherosa Ziehau wrote:
 On Wed, Feb 20, 2013 at 9:46 AM, Lawrence Stewart lstew...@room52.net 
 wrote:
 *crickets chirping*

 Time to move this discussion forward...


 If any robust counter-arguments exist, now is the time for us to hear
 them. I haven't read anything thus far which convinces me that we should
 not provide knobs to tune our stack's dynamics.

 In the absence of any compelling counter-arguments, I would like to
 propose the following:

 - We rename the net.inet.tcp.experimental sysctl node introduced in
 r242266 for IW10 support to net.inet.tcp.nonstandard, and re-parent the
 initcwnd10 sysctl under this node.

 I should also add that I think initcwnd10 should be changed to initcwnd
 and take the number of segments as a value.

Yeah, I would suggest the same.


 - We introduce a new net.inet.tcp.nonstandard.allowed sysctl variable
 and default it to 0. Only when it is changed to 1 will we allow starkly
 non standards compliant behaviour to be enabled in the stack. As a more
 complex but expressive alternative, we can make the sysctl take a bit
 mask or CSV string which specifies which non-standard options the sys
 admin permits (I'd prefer this as we can easily test non-standard
 options like IW10 in head without blanket enabling all non standard
 behaviour).

 To be clear, my proposal is that specifying an allowed option in
 net.inet.tcp.nonstandard.allowed would not enable it as the default on
 all connections, but would allow the per-application mechanism we define
 to set the option. Setting net.inet.tcp.nonstandard.option_x to 1 would
 enable the option as default for all connections.

 - We introduce a new net.inet.tcp.nonstandard.noidlereset sysctl
 variable, and use it to enable/disable window-reset-after-idle behaviour
 as proposed by John.

 - We don't introduce a TF_IGNOREIDLE sockopt, and instead introduce a
 more generic sockopt and/or mechanism for per-application tuning of all
 options which affect stack dynamics (both standard and non-standard
 options). I'm open to suggestions on what this could/should look like.

 Lawrence,

 A route metric?  BTW, as for IW10, it could also become a route metric
 (as proposed by the draft author's presentation
 http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf)

 Are you suggesting having the ability to set knobs as route metrics in
 addition to sysctl and a per-app mechanism? If so then I am very much in
 favour of this. Assuming an option has been allowed in
 net.inet.tcp.nonstandard.allowed, it should be able to be set by an
 application or on a route, perhaps with a precedence hierarchy of app
 request trumps route metric trumps system default setting?

I suggest using route metrics in addition to the global sysctls; route
metrics take precedence over global sysctls.  I don't object the
per-socket settings though.  However, IMHO, these options (IW10 and
ignoring idle restart, and probably others) are administrative, so
applications probably should not mess with them.

Best Regards,
sephe

-- 
Tomorrow Will Never Die
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-21 Thread Lawrence Stewart
On 02/21/13 20:20, Sepherosa Ziehau wrote:
 On Wed, Feb 20, 2013 at 11:59 AM, Lawrence Stewart lstew...@freebsd.org 
 wrote:
 Hi Sephe,

 On 02/20/13 13:37, Sepherosa Ziehau wrote:
 On Wed, Feb 20, 2013 at 9:46 AM, Lawrence Stewart lstew...@room52.net 
 wrote:
 *crickets chirping*

 Time to move this discussion forward...


 If any robust counter-arguments exist, now is the time for us to hear
 them. I haven't read anything thus far which convinces me that we should
 not provide knobs to tune our stack's dynamics.

 In the absence of any compelling counter-arguments, I would like to
 propose the following:

 - We rename the net.inet.tcp.experimental sysctl node introduced in
 r242266 for IW10 support to net.inet.tcp.nonstandard, and re-parent the
 initcwnd10 sysctl under this node.

 I should also add that I think initcwnd10 should be changed to initcwnd
 and take the number of segments as a value.
 
 Yeah, I would suggest the same.
 

 - We introduce a new net.inet.tcp.nonstandard.allowed sysctl variable
 and default it to 0. Only when it is changed to 1 will we allow starkly
 non standards compliant behaviour to be enabled in the stack. As a more
 complex but expressive alternative, we can make the sysctl take a bit
 mask or CSV string which specifies which non-standard options the sys
 admin permits (I'd prefer this as we can easily test non-standard
 options like IW10 in head without blanket enabling all non standard
 behaviour).

 To be clear, my proposal is that specifying an allowed option in
 net.inet.tcp.nonstandard.allowed would not enable it as the default on
 all connections, but would allow the per-application mechanism we define
 to set the option. Setting net.inet.tcp.nonstandard.option_x to 1 would
 enable the option as default for all connections.

 - We introduce a new net.inet.tcp.nonstandard.noidlereset sysctl
 variable, and use it to enable/disable window-reset-after-idle behaviour
 as proposed by John.

 - We don't introduce a TF_IGNOREIDLE sockopt, and instead introduce a
 more generic sockopt and/or mechanism for per-application tuning of all
 options which affect stack dynamics (both standard and non-standard
 options). I'm open to suggestions on what this could/should look like.

 Lawrence,

 A route metric?  BTW, as for IW10, it could also become a route metric
 (as proposed by the draft author's presentation
 http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf)

 Are you suggesting having the ability to set knobs as route metrics in
 addition to sysctl and a per-app mechanism? If so then I am very much in
 favour of this. Assuming an option has been allowed in
 net.inet.tcp.nonstandard.allowed, it should be able to be set by an
 application or on a route, perhaps with a precedence hierarchy of app
 request trumps route metric trumps system default setting?
 
 I suggest using route metrics in addition to the global sysctls;

Agreed.

 route metrics take precedence over global sysctls. 

Agreed.

 I don't object the per-socket settings though.  However, IMHO, these
 options (IW10 and ignoring idle restart, and probably others) are
 administrative, so applications probably should not mess with them.

Messing with individual options like IW10 on a per-socket basis is
definitely in the generally should not basket, but I would not want to
stop an application from doing so subject to the option being specified
by the administrator in the net.inet.tcp.nonstandard.allowed option list.

What I expect applications would want to do more frequently is hint the
socket with a higher level goal e.g. I want maximum throughput, I
want low latency, etc. This can come later though. I think we have
enough agreement on the basic infrastructure to move forward at this
point with some patches.

I would initially like to get the basic sysctl infrastructure to support
all this sorted, then look at supporting these options as route metrics,
and finally look at the higher level API.

Anyone else with further input, please speak up!

Cheers,
Lawrence
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-20 Thread John Baldwin
On Tuesday, February 19, 2013 9:37:54 pm Sepherosa Ziehau wrote:
 John,
 
 I came across this draft several days ago, you may be interested:
 http://tools.ietf.org/html/draft-ietf-tcpm-newcwv-00

Yes, that is extremely relevant.  My application does use its own
rate-limiting.  And now that I've read this in full, this does seem
to very much be what I want and is a better solution than ignoring
idle handling entirely.  Ironic that this was posted a few weeks after my 
patch. :)  Clearly this is not an isolated workflow.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-19 Thread Sepherosa Ziehau
On Wed, Feb 20, 2013 at 9:46 AM, Lawrence Stewart lstew...@room52.net wrote:
 *crickets chirping*

 Time to move this discussion forward...


 If any robust counter-arguments exist, now is the time for us to hear
 them. I haven't read anything thus far which convinces me that we should
 not provide knobs to tune our stack's dynamics.

 In the absence of any compelling counter-arguments, I would like to
 propose the following:

 - We rename the net.inet.tcp.experimental sysctl node introduced in
 r242266 for IW10 support to net.inet.tcp.nonstandard, and re-parent the
 initcwnd10 sysctl under this node.

 - We introduce a new net.inet.tcp.nonstandard.allowed sysctl variable
 and default it to 0. Only when it is changed to 1 will we allow starkly
 non standards compliant behaviour to be enabled in the stack. As a more
 complex but expressive alternative, we can make the sysctl take a bit
 mask or CSV string which specifies which non-standard options the sys
 admin permits (I'd prefer this as we can easily test non-standard
 options like IW10 in head without blanket enabling all non standard
 behaviour).

 - We introduce a new net.inet.tcp.nonstandard.noidlereset sysctl
 variable, and use it to enable/disable window-reset-after-idle behaviour
 as proposed by John.

 - We don't introduce a TF_IGNOREIDLE sockopt, and instead introduce a
 more generic sockopt and/or mechanism for per-application tuning of all
 options which affect stack dynamics (both standard and non-standard
 options). I'm open to suggestions on what this could/should look like.

Lawrence,

A route metric?  BTW, as for IW10, it could also become a route metric
(as proposed by the draft author's presentation
http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf)


John,

I came across this draft several days ago, you may be interested:
http://tools.ietf.org/html/draft-ietf-tcpm-newcwv-00

This one is a bit old, but it is still interesting to read (cited by
the above draft):
http://tools.ietf.org/html/draft-hughes-restart-00


Best Regards,
sephe

--
Tomorrow Will Never Die
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-13 Thread Lawrence Stewart
FYI I've read the whole thread as of this reply and plan to follow up to
a few of the other posts separately, but first for my initial thoughts...

On 01/23/13 07:11, John Baldwin wrote:
 As I mentioned in an earlier thread, I recently had to debug an issue we were 
 seeing across a link with a high bandwidth-delay product (both high bandwidth 
 and high RTT).  Our specific use case was to use a TCP connection to reliably 
 forward a latency-sensitive datagram stream across a WAN connection.  We 
 would 
 often see spikes in the latency of individual datagrams.  I eventually 
 tracked 
 this down to the connection entering slow start when it would transmit data 
 after being idle.  The data stream was quite bursty and would often attempt 
 to 
 transmit a burst of data after being idle for far longer than a retransmit 
 timeout.

Got it.

 In 7.x we had worked around this in the past by disabling RFC 3390 and 
 jacking 
 the slow start window size up via a sysctl.  On 8.x this no longer worked.

I can't think of, nor have I read any convincing argument why we
shouldn't support your use case out of the box. You're not the only user
of FreeBSD over dedicated lines who knows what you're doing. We should
provide some way to support this use case.

We're therefore left with the question of how to implement this.

As noted in the Some questions about the new TCP congestion control
code thread [1], it was always my intention to axe the ss_flightsize
variables and replace them with a better mechanism. Andre swung the axe
before I did and 10.x is looming so it's a good time to discuss all of this.

 The solution I came up with was to add a new socket option to disable idle 
 handling completely.  That is, when an idle connection restarts with this new 
 option enabled, it keeps its current congestion window and doesn't enter slow 
 start.

rwatson@ mentioned an idea in private discussion which I've also thought
about over the years. The real goal here should be to subsume your use
case (and others) into a much richer framework for hinting desired
behaviour/tradeoff preferences (some aspects of which relate to parts of
my PhD work, which will hopefully be coming to a kernel near you in 2013 ;).

My main concern with your patch is that I'm a bit uneasy about
enshrining a socket option in a public API and documentation that is so
specific. I suspect apps probably want to set higher level goals like
low latency *at any cost* and have the stack opaquely interpret that
as this guy is willing to blow his foot off, so let's disable idle
window reset, tweak X, disable Y and hand the man his loaded shotgun.
TCP_IGNOREIDLE as currently proposed misses this bigger picture, though
doesn't preclude it either.

I would also echo Kevin/Grenville's thoughts about keying the socket
option's activation off a tunable (sysctl or kernel option is up for
discussion, though I'd be leaning towards sysctl) that is disabled by
default i.e. only skip after idle window reset if the app sets the
option *and* the sysadmin has pulled the I like me some bursty network
lever.

 There are only a few cases where such an option is useful, but if anyone else 
 thinks this might be useful I'd be happy to add the option to FreeBSD.

The idea is useful. I'd just like to discuss the implementation
specifics a little further before recommending whether the patch should
go in as is to provide a stop gap, or we rework the patch to be a little
less specific in readiness for the future work I have in mind.

Cheers,
Lawrence

[1] http://lists.freebsd.org/pipermail/freebsd-net/2013-January/034297.html
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-13 Thread Andre Oppermann

On 13.02.2013 09:25, Lawrence Stewart wrote:

FYI I've read the whole thread as of this reply and plan to follow up to
a few of the other posts separately, but first for my initial thoughts...

On 01/23/13 07:11, John Baldwin wrote:

As I mentioned in an earlier thread, I recently had to debug an issue we were
seeing across a link with a high bandwidth-delay product (both high bandwidth
and high RTT).  Our specific use case was to use a TCP connection to reliably
forward a latency-sensitive datagram stream across a WAN connection.  We would
often see spikes in the latency of individual datagrams.  I eventually tracked
this down to the connection entering slow start when it would transmit data
after being idle.  The data stream was quite bursty and would often attempt to
transmit a burst of data after being idle for far longer than a retransmit
timeout.


Got it.


In 7.x we had worked around this in the past by disabling RFC 3390 and jacking
the slow start window size up via a sysctl.  On 8.x this no longer worked.


I can't think of, nor have I read any convincing argument why we
shouldn't support your use case out of the box. You're not the only user
of FreeBSD over dedicated lines who knows what you're doing. We should
provide some way to support this use case.

We're therefore left with the question of how to implement this.

As noted in the Some questions about the new TCP congestion control
code thread [1], it was always my intention to axe the ss_flightsize
variables and replace them with a better mechanism. Andre swung the axe
before I did and 10.x is looming so it's a good time to discuss all of this.


The solution I came up with was to add a new socket option to disable idle
handling completely.  That is, when an idle connection restarts with this new
option enabled, it keeps its current congestion window and doesn't enter slow
start.


rwatson@ mentioned an idea in private discussion which I've also thought
about over the years. The real goal here should be to subsume your use
case (and others) into a much richer framework for hinting desired
behaviour/tradeoff preferences (some aspects of which relate to parts of
my PhD work, which will hopefully be coming to a kernel near you in 2013 ;).

My main concern with your patch is that I'm a bit uneasy about
enshrining a socket option in a public API and documentation that is so
specific. I suspect apps probably want to set higher level goals like
low latency *at any cost* and have the stack opaquely interpret that
as this guy is willing to blow his foot off, so let's disable idle
window reset, tweak X, disable Y and hand the man his loaded shotgun.
TCP_IGNOREIDLE as currently proposed misses this bigger picture, though
doesn't preclude it either.

I would also echo Kevin/Grenville's thoughts about keying the socket
option's activation off a tunable (sysctl or kernel option is up for
discussion, though I'd be leaning towards sysctl) that is disabled by
default i.e. only skip after idle window reset if the app sets the
option *and* the sysadmin has pulled the I like me some bursty network
lever.


There are only a few cases where such an option is useful, but if anyone else
thinks this might be useful I'd be happy to add the option to FreeBSD.


The idea is useful. I'd just like to discuss the implementation
specifics a little further before recommending whether the patch should
go in as is to provide a stop gap, or we rework the patch to be a little
less specific in readiness for the future work I have in mind.


Again I'd like to point out that this sort of modification should
be implemented as a congestion control module.  All the hook points
are already there and can readily be used instead of adding more special
cases to the generic part of TCP.  The CC algorithm can be selected per
socket.  For such a special CC module it'd get a nice fat warning that
it is not suitable for Internet use.

Additionally I speculate that for the use-case of John he may also be
willing to forgo congestion avoidance and always operate in (ill-named)
slow start mode.  With a special CC module this can easily be tweaked.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-13 Thread Lawrence Stewart
On 02/08/13 07:04, George Neville-Neil wrote:
 
 On Feb 6, 2013, at 12:28 , Alfred Perlstein bri...@mu.org wrote:
 
 On 2/6/13 4:46 AM, John Baldwin wrote:
 On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote:
 John:

 A burst at line rate will *often* cause drops. This is because
 router queues are at a finite size. Also such a burst (especially
 on a long delay bandwidth network) cause your RTT to increase even
 if there is no drop which is going to hurt you as well.

 A SHOULD in an RFC says you really really really really need to do it
 unless there is some thing that makes you willing to override it. It is
 slight wiggle room.

 In this I agree with Andre, we should not be *not* doing it. Otherwise
 folks will be turning this on and it is plain wrong. It may be fine
 for your network but I would not want to see it in FreeBSD.

 In my testing here at home I have put back into our stack max-burst. This
 uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd 
 at
 no more than 4 packets larger than your flight. All of my testing
 high-bw-delay or lan has shown this to improve TCP performance. This
 is because it helps you avoid bursting out so many packets that you 
 overflow
 a queue.

 In your long-delay bw link if you do burst out too many (and you never
 know how many that is since you can not predict how full all those
 MPLS queues are or how big they are) you will really hurt yourself even 
 worse.
 Note that generally in Cisco routers the default queue size is somewhere 
 between
 100-300 packets depending on the router.
 Due to the way our application works this never happens, but I am fine with
 just keeping this patch private.  If there are other shops that need this 
 they
 can always dig the patch up from the archives.

 This is yet another time when I'm sad about how things happen in FreeBSD.

 A developer come forward with a non-default option that's very useful for 
 some specific workloads, specifically one that contributes much time and $$$ 
 to the project and the community rejects the patches even though it's been 
 successful in other OSes.

 It makes zero sense.

 John, can you repost the patch?  Maybe there is a way to refactor this 
 somehow so it's like accept filters where we can plug in a hook for TCP?

 I am very disappointed, but not surprised.

 
 I take away the complete opposite feeling.  This is how we work through these 
 issues.
 It's clear from the discussion that this need not be a default in the system,
 and is a special case.  We had a reasoned discussion of what would be best to 
 do
 and at least two experts in TCP weighed in on the effect this change might 
 have.
 
 Not everything proposed by a developer need go into the tree, in particular 
 since these
 discussions are archived we can always revisit this later.
 
 This is exactly how collaborative development should look, whether or not the 
 patch
 is integrated now, next week, next year, or ever.

+1

Whilst I would argue that some red herrings have been put forward in
this thread, its progression is far from disappointing IMO. This is a
sensitive area that requires careful scrutiny, independent of what our
peers working on other OSes have decided is best for them.

Cheers,
Lawrence
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-13 Thread Lawrence Stewart
On 02/10/13 16:05, Kevin Oberman wrote:
 On Sat, Feb 9, 2013 at 6:41 AM, Alfred Perlstein bri...@mu.org wrote:
 On 2/7/13 12:04 PM, George Neville-Neil wrote:

 On Feb 6, 2013, at 12:28 , Alfred Perlstein bri...@mu.org wrote:

 On 2/6/13 4:46 AM, John Baldwin wrote:

 On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote:

 John:

 A burst at line rate will *often* cause drops. This is because
 router queues are at a finite size. Also such a burst (especially
 on a long delay bandwidth network) cause your RTT to increase even
 if there is no drop which is going to hurt you as well.

 A SHOULD in an RFC says you really really really really need to do it
 unless there is some thing that makes you willing to override it. It is
 slight wiggle room.

 In this I agree with Andre, we should not be *not* doing it. Otherwise
 folks will be turning this on and it is plain wrong. It may be fine
 for your network but I would not want to see it in FreeBSD.

 In my testing here at home I have put back into our stack max-burst.
 This
 uses Mark Allman's version (not Kacheong Poon's) where you clamp the
 cwnd at
 no more than 4 packets larger than your flight. All of my testing
 high-bw-delay or lan has shown this to improve TCP performance. This
 is because it helps you avoid bursting out so many packets that you
 overflow
 a queue.

 In your long-delay bw link if you do burst out too many (and you never
 know how many that is since you can not predict how full all those
 MPLS queues are or how big they are) you will really hurt yourself even
 worse.
 Note that generally in Cisco routers the default queue size is
 somewhere between
 100-300 packets depending on the router.

 Due to the way our application works this never happens, but I am fine
 with
 just keeping this patch private.  If there are other shops that need
 this they
 can always dig the patch up from the archives.

 This is yet another time when I'm sad about how things happen in FreeBSD.

 A developer come forward with a non-default option that's very useful for
 some specific workloads, specifically one that contributes much time and 
 $$$
 to the project and the community rejects the patches even though it's been
 successful in other OSes.

 It makes zero sense.

 John, can you repost the patch?  Maybe there is a way to refactor this
 somehow so it's like accept filters where we can plug in a hook for TCP?

 I am very disappointed, but not surprised.

 I take away the complete opposite feeling.  This is how we work through
 these issues.
 It's clear from the discussion that this need not be a default in the
 system,
 and is a special case.  We had a reasoned discussion of what would be best
 to do
 and at least two experts in TCP weighed in on the effect this change might
 have.

 Not everything proposed by a developer need go into the tree, in
 particular since these
 discussions are archived we can always revisit this later.

 This is exactly how collaborative development should look, whether or not
 the patch
 is integrated now, next week, next year, or ever.


 I agree that discussion is great, we have all learned quite a bit from it,
 about TCP and the dangers of adjusting buffering without considerable
 thought.  I would not be involved in FreeBSD had this type of discussion and
 information not be discussed on the lists so readily.

 However, the end result must be far different than what has occurred so far.

 If the code was deemed unacceptable for general inclusion, then we must find
 a way to provide a light framework to accomplish the needs of the community
 member.

 Take for instance someone who is starting a company that needs this
 facility.  Which OS will they choose?  One who has integrated a useful
 feature?  Or one who has rejected it and left that code in the mailing list
 archives?

 As much as expert opinion is valuable, it must include understanding and
 need of handling special cases and the ability to facilitate those special
 cases for our users and developers.
 
 This is a subject rather near to my heart, having fought battles with
 congestion back in the dark days of Windows when it essentially
 defaulted to TCPIGNOREIDLE. It was a huge pain, but it was the only
 way Windows did TCP in the early days. It simply did not implement
 slow-start. This was really evil, but in the days when lots of links
 were 56K and T-1 was mostly used for network core links, the Internet,
 small as it was back then, did not melt, though it glowed a
 frightening shade of red fairly often. Today too many systems running
 like this would melt thins very quickly.
 
 OTOH, I can certainly see cases, like John's,  where it would be very
 beneficial. And, yes, Linux has it. (I don't see this a relevant in
 any way except as proof tat not enough people have turned it on to
 cause serious problems... yet!) It seems a shame to make everyone who
 really has a need develop their own patches or dig though old mail to
 find John's.
 
 What I would 

Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-13 Thread Lawrence Stewart
On 02/13/13 21:27, Andre Oppermann wrote:
 On 13.02.2013 09:25, Lawrence Stewart wrote:
 FYI I've read the whole thread as of this reply and plan to follow up to
 a few of the other posts separately, but first for my initial thoughts...

 On 01/23/13 07:11, John Baldwin wrote:
 As I mentioned in an earlier thread, I recently had to debug an issue
 we were
 seeing across a link with a high bandwidth-delay product (both high
 bandwidth
 and high RTT).  Our specific use case was to use a TCP connection to
 reliably
 forward a latency-sensitive datagram stream across a WAN connection. 
 We would
 often see spikes in the latency of individual datagrams.  I
 eventually tracked
 this down to the connection entering slow start when it would
 transmit data
 after being idle.  The data stream was quite bursty and would often
 attempt to
 transmit a burst of data after being idle for far longer than a
 retransmit
 timeout.

 Got it.

 In 7.x we had worked around this in the past by disabling RFC 3390
 and jacking
 the slow start window size up via a sysctl.  On 8.x this no longer
 worked.

 I can't think of, nor have I read any convincing argument why we
 shouldn't support your use case out of the box. You're not the only user
 of FreeBSD over dedicated lines who knows what you're doing. We should
 provide some way to support this use case.

 We're therefore left with the question of how to implement this.

 As noted in the Some questions about the new TCP congestion control
 code thread [1], it was always my intention to axe the ss_flightsize
 variables and replace them with a better mechanism. Andre swung the axe
 before I did and 10.x is looming so it's a good time to discuss all of
 this.

 The solution I came up with was to add a new socket option to disable
 idle
 handling completely.  That is, when an idle connection restarts with
 this new
 option enabled, it keeps its current congestion window and doesn't
 enter slow
 start.

 rwatson@ mentioned an idea in private discussion which I've also thought
 about over the years. The real goal here should be to subsume your use
 case (and others) into a much richer framework for hinting desired
 behaviour/tradeoff preferences (some aspects of which relate to parts of
 my PhD work, which will hopefully be coming to a kernel near you in
 2013 ;).

 My main concern with your patch is that I'm a bit uneasy about
 enshrining a socket option in a public API and documentation that is so
 specific. I suspect apps probably want to set higher level goals like
 low latency *at any cost* and have the stack opaquely interpret that
 as this guy is willing to blow his foot off, so let's disable idle
 window reset, tweak X, disable Y and hand the man his loaded shotgun.
 TCP_IGNOREIDLE as currently proposed misses this bigger picture, though
 doesn't preclude it either.

 I would also echo Kevin/Grenville's thoughts about keying the socket
 option's activation off a tunable (sysctl or kernel option is up for
 discussion, though I'd be leaning towards sysctl) that is disabled by
 default i.e. only skip after idle window reset if the app sets the
 option *and* the sysadmin has pulled the I like me some bursty network
 lever.

 There are only a few cases where such an option is useful, but if
 anyone else
 thinks this might be useful I'd be happy to add the option to FreeBSD.

 The idea is useful. I'd just like to discuss the implementation
 specifics a little further before recommending whether the patch should
 go in as is to provide a stop gap, or we rework the patch to be a little
 less specific in readiness for the future work I have in mind.
 
 Again I'd like to point out that this sort of modification should
 be implemented as a congestion control module.  All the hook points
 are already there and can readily be used instead of adding more special
 cases to the generic part of TCP.  The CC algorithm can be selected per
 socket.  For such a special CC module it'd get a nice fat warning that
 it is not suitable for Internet use.

As a local hack, sure, a CC module would do the job assuming you were
happy to use a single algorithm as the base. John's patch transcends the
algorithm in use on a particular connection, so it has wider
applicability than a CC module.

I would also strongly oppose the inclusion of such a module in FreeBSD
proper - it's the wrong way to implement the functionality. The patch as
posted is technically appropriate, though I'm interested in discussing
whether the public API should be tweaked to capture higher level goals
instead e.g. low delay at all costs or maximum throughput.

We could initially map low delay at all costs to a TCP stack meaning
of disable idle window reset and expand the meaning later (e.g.
relaxing the silly window checks as briefly discussed in the other thread).

 Additionally I speculate that for the use-case of John he may also be
 willing to forgo congestion avoidance and always operate in (ill-named)
 slow start mode.  With a 

Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-13 Thread Andre Oppermann

On 13.02.2013 15:26, Lawrence Stewart wrote:

On 02/13/13 21:27, Andre Oppermann wrote:

On 13.02.2013 09:25, Lawrence Stewart wrote:

The idea is useful. I'd just like to discuss the implementation
specifics a little further before recommending whether the patch should
go in as is to provide a stop gap, or we rework the patch to be a little
less specific in readiness for the future work I have in mind.


Again I'd like to point out that this sort of modification should
be implemented as a congestion control module.  All the hook points
are already there and can readily be used instead of adding more special
cases to the generic part of TCP.  The CC algorithm can be selected per
socket.  For such a special CC module it'd get a nice fat warning that
it is not suitable for Internet use.


As a local hack, sure, a CC module would do the job assuming you were
happy to use a single algorithm as the base. John's patch transcends the
algorithm in use on a particular connection, so it has wider
applicability than a CC module.


The algorithm is becoming somewhat meaningless when your goal is to
have an open pipe and push data as fast as possible without regard
to other traffic.  NewReno, Cubic and what have you is becoming
meaningless.


I would also strongly oppose the inclusion of such a module in FreeBSD
proper - it's the wrong way to implement the functionality. The patch as
posted is technically appropriate, though I'm interested in discussing
whether the public API should be tweaked to capture higher level goals
instead e.g. low delay at all costs or maximum throughput.


I strongly disagree.  The patch is a hack.  From the description John
gave on his use-case I read that he would actually take more than just
ignoring idle-cwnd-reset.  And actually if I were in his situation I
would use a very aggressive congestion control algorithm doing away with
more than idle-cwnd-reset.


We could initially map low delay at all costs to a TCP stack meaning
of disable idle window reset and expand the meaning later (e.g.
relaxing the silly window checks as briefly discussed in the other thread).


Ugh, if you go that far fork it, obtain a fresh protocol number and don't
call it TCP anymore.


Additionally I speculate that for the use-case of John he may also be
willing to forgo congestion avoidance and always operate in (ill-named)
slow start mode.  With a special CC module this can easily be tweaked.


John already has the functionality he needs in this local tree - this
discussion is no longer about John per se, but rather about other people
who may want the functionality John has implemented.


That's what I'm worried most about.  So far no real other people have
spoken out, only cheering from the sidelines.


We need to figure out how to provide the functionality in FreeBSD
proper, and a CC module is not the answer.


I totally disagree.  This functionality (removal) is not at all a part
of TCP and should not be supported directly.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-13 Thread Adrian Chadd
On 13 February 2013 02:27, Andre Oppermann an...@freebsd.org wrote:

 Again I'd like to point out that this sort of modification should
 be implemented as a congestion control module.  All the hook points
 are already there and can readily be used instead of adding more special
 cases to the generic part of TCP.  The CC algorithm can be selected per
 socket.  For such a special CC module it'd get a nice fat warning that
 it is not suitable for Internet use.

 Additionally I speculate that for the use-case of John he may also be
 willing to forgo congestion avoidance and always operate in (ill-named)
 slow start mode.  With a special CC module this can easily be tweaked.

There are some cute things that could be done here - eg, having an L3
route table entry map to a congestion control (like having an MSS in
the L3 entry too.)

But I'd love to see some modelling / data showing competing congestion
control algorithms on the same set of congested pipes. Doubly so on
multiple congested pipes (ie, modelling a handful of parallel
user-last-mile-IX-various transit feeds with different levels of
congestion/RTT-IX-last mile-user connections.) You all know much
more about this than I do. :-)

Thanks,


Adrian
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-13 Thread Adrian Chadd
.. and I should say, competing / parallel congestion algorithms. Ie
- how multiple CC's work for/against each other on the same internet
at the same time.


Adrian
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-13 Thread Lawrence Stewart
On 02/14/13 01:48, Andre Oppermann wrote:
 On 13.02.2013 15:26, Lawrence Stewart wrote:
 On 02/13/13 21:27, Andre Oppermann wrote:
 On 13.02.2013 09:25, Lawrence Stewart wrote:
 The idea is useful. I'd just like to discuss the implementation
 specifics a little further before recommending whether the patch should
 go in as is to provide a stop gap, or we rework the patch to be a
 little
 less specific in readiness for the future work I have in mind.

 Again I'd like to point out that this sort of modification should
 be implemented as a congestion control module.  All the hook points
 are already there and can readily be used instead of adding more special
 cases to the generic part of TCP.  The CC algorithm can be selected per
 socket.  For such a special CC module it'd get a nice fat warning that
 it is not suitable for Internet use.

 As a local hack, sure, a CC module would do the job assuming you were
 happy to use a single algorithm as the base. John's patch transcends the
 algorithm in use on a particular connection, so it has wider
 applicability than a CC module.
 
 The algorithm is becoming somewhat meaningless when your goal is to
 have an open pipe and push data as fast as possible without regard
 to other traffic.  NewReno, Cubic and what have you is becoming
 meaningless.

But that's not the goal. We're not discussing unbounded or unreactive
congestion windows. If a burst causes drops, we still back off. The
algorithm does still matter.

 I would also strongly oppose the inclusion of such a module in FreeBSD
 proper - it's the wrong way to implement the functionality. The patch as
 posted is technically appropriate, though I'm interested in discussing
 whether the public API should be tweaked to capture higher level goals
 instead e.g. low delay at all costs or maximum throughput.
 
 I strongly disagree.  The patch is a hack. 

I agree it's hacky in its current form, but for different reasons to you
as outlined in my previous email. You are arguing that idle window
resetting is an intrinsic and non negotiable part of TCP. This is
demonstrably not true.

As long as something doesn't change the wire format, then it is fair
game for being tunable. How we make something tunable and what we set as
defaults are where we need to be conservative.

 From the description John gave on his use-case I read that he would actually 
 take more than just
 ignoring idle-cwnd-reset.  And actually if I were in his situation I
 would use a very aggressive congestion control algorithm doing away with
 more than idle-cwnd-reset.

Congestion control is only one aspect of what we're discussing.

 We could initially map low delay at all costs to a TCP stack meaning
 of disable idle window reset and expand the meaning later (e.g.
 relaxing the silly window checks as briefly discussed in the other
 thread).
 
 Ugh, if you go that far fork it, obtain a fresh protocol number and don't
 call it TCP anymore.

You're channelling Joe Touch ;)

What exactly is TCP? As far as interop is concerned, it's just a wire
protocol - as long as I format my headers/segments correctly and ignore
options I don't understand, I can communicate with other TCP stacks,
many of which implement a different set of TCP features and options.

The dynamics of the protocol have evolved significantly over time and
continue to do so because of its ubiquity - it flows freely across the
public internet and gets used for all manner of things it wasn't
initially designed to handle (well). A lot of the dynamics are also
controlled by optional parameters.

So no, we don't need a new protocol number. We need to provide knobs
that allow people to tune TCP dynamics to their particular use case.

 Additionally I speculate that for the use-case of John he may also be
 willing to forgo congestion avoidance and always operate in (ill-named)
 slow start mode.  With a special CC module this can easily be tweaked.

 John already has the functionality he needs in this local tree - this
 discussion is no longer about John per se, but rather about other people
 who may want the functionality John has implemented.
 
 That's what I'm worried most about.  So far no real other people have
 spoken out, only cheering from the sidelines.

We surely don't need them to speak out explicitly - the use case is not
obscure enough that I am having difficulty imagining other places it
would be useful.

 We need to figure out how to provide the functionality in FreeBSD
 proper, and a CC module is not the answer.
 
 I totally disagree.  This functionality (removal) is not at all a part
 of TCP and should not be supported directly.

I don't understand how you can argue that idle window resetting is an
intrinsic and non negotiable part of TCP. There is no one true set of
options and features that is TCP. It is not only one idea.

Let's work on providing a rich set of knobs to tune every aspect of our
TCP stack's dynamics and operation that don't break wire format, set
conservative defaults 

Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-13 Thread Lawrence Stewart
On 02/14/13 05:37, Adrian Chadd wrote:
 On 13 February 2013 02:27, Andre Oppermann an...@freebsd.org wrote:
 
 Again I'd like to point out that this sort of modification should
 be implemented as a congestion control module.  All the hook points
 are already there and can readily be used instead of adding more special
 cases to the generic part of TCP.  The CC algorithm can be selected per
 socket.  For such a special CC module it'd get a nice fat warning that
 it is not suitable for Internet use.

 Additionally I speculate that for the use-case of John he may also be
 willing to forgo congestion avoidance and always operate in (ill-named)
 slow start mode.  With a special CC module this can easily be tweaked.
 
 There are some cute things that could be done here - eg, having an L3
 route table entry map to a congestion control (like having an MSS in
 the L3 entry too.)

This is an area I've thought about and would form the basis for an
interesting applied research project. On a related tangent, we (CAIA)
also have some ongoing research looking at using different CC algorithms
per subflow of a multipath TCP connection.

 But I'd love to see some modelling / data showing competing congestion
 control algorithms on the same set of congested pipes. Doubly so on
 multiple congested pipes (ie, modelling a handful of parallel
 user-last-mile-IX-various transit feeds with different levels of
 congestion/RTT-IX-last mile-user connections.) You all know much
 more about this than I do. :-)

There is quite a bit of relevant literature out there. You could start
with some of the stuff CAIA has had a hand in (e.g. [1]) and follow the
citation trail from there...

Cheers,
Lawrence

[1] http://caia.swin.edu.au/urp/newtcp/papers.html
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-12 Thread Andrey Zonov
On 2/11/13 3:18 PM, Andre Oppermann wrote:
 
 Smaller RTO (1s) has become a RFC so there was very broad consensus in
 TCPM that is a good thing.  We don't have it yet because we were not fully
 compliant in one case (loss of first segment).  I've fixed that a while
 back and will bring 1s RTO soon to HEAD.
 

They use 300ms at least for me/my link/ISP/etc.

-- 
Andrey Zonov



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-12 Thread Andre Oppermann

On 12.02.2013 11:55, Andrey Zonov wrote:

On 2/11/13 3:18 PM, Andre Oppermann wrote:


Smaller RTO (1s) has become a RFC so there was very broad consensus in
TCPM that is a good thing.  We don't have it yet because we were not fully
compliant in one case (loss of first segment).  I've fixed that a while
back and will bring 1s RTO soon to HEAD.



They use 300ms at least for me/my link/ISP/etc.


Let me be more precise: An initial RTO of 1s was published as RFC.  This
is what I'm referring to.  It affects the setup phase of a connection.

A separate issue is the minimum RTO during a connection.  According to
the RFC the RTO during the lifetime of the connection should also not be
less than 1s.  The RTO being determined based on the RTT measurement done
using timestamps or Karn's algorithm.  However on fast links this has been
shown to be too long to wait for.  So FreeBSD decreased the allowed lower
bound to hz/33.  This is only effective if your RTO was actually calculated
to be equal or lower than that.  The result is a quicker re-probing and
discovery of the current line conditions.  Since the RTO was measured to be
less-equal than hz/33, the possible negative downside is very limited.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-12 Thread Andre Oppermann

On 11.02.2013 19:56, Adrian Chadd wrote:

On 11 February 2013 03:18, Andre Oppermann an...@freebsd.org wrote:


In general Google does provide quite a bit of data with their experiments
showing that it isn't harmful and that it helps the case.

Smaller RTO (1s) has become a RFC so there was very broad consensus in
TCPM that is a good thing.  We don't have it yet because we were not fully
compliant in one case (loss of first segment).  I've fixed that a while
back and will bring 1s RTO soon to HEAD.

I'm pretty sure that Google doesn't ignore idle on their Internet facing
servers.  They may have proposed a decay mechanism in the past.  I'd have
to check the TCPM archives for that.


Argh, the If google does, it it must be fine argument.


Please.  You removed what I was replying to.  There is no doubt IW10
originated from Google.  However Google took it to TCPM and provided
measurement data with it.  After some forth and back they provided more
data which began to convince more people on TCPM.  Eventually the
proposal was adopted as official TCPM working group draft and likely
will become a RFC later this year.

If you want to argue against RTO1s (RFC6298) then the lead authors are
from ICSI/UC Berkeley.  Google did participate in that one by providing
additional measurement data.


Does Google publish the data for these experiments with the
international and local links broken down?


Yes.  Have you followed the evolution and discussion of IW10 on TCPM?


Google run a highly distributed infrastructure (this isn't news for
anyone, I know) and thus the link distance, RTT, number of hops, etc
may not accurately reflect the internet. It may accurately reflect
the internet from the perspective of being roughly within the same
city or state in a lot of cases.



The TCP congestion algorithms aren't just for avoiding congestion over
a peering fabric and last-mile ISP infrastructure.


IW10 is not a congestion control algorithm.  It is a change to the
initial state of it at the beginning of an connection when not much
other data is available.  Many years ago the same thing happend with
RFC3390 which increased the IW to 3 segments.


The effects of tweaking congestion algorithms for delivery over a
local peering infrastructure where you try to run things as
un-congested as possible (where congestion is now The ISPs Problem)
where you maintain tight control over as much of the network
infrastructure as you can is likely going to be very different to the
congestion algorithm behaviour needed for some end-node speaking to a
variety of end-nodes over a longer, more varying set of international
links. You know, what TCP congestion algorithms are also trying to
play fair with.


I agree but not relevant to this case.


Please - as much as I applaud Google for what they do, please don't
generalise their results to the greater internet without looking at
the many caveats/assumptions.


Well, that's exactly what I'm trying to do here.  Except not only for
ideas sourced from Google but also other places.  Like it's in Linux
and the Internet hasn't broken down yet.  Without any measurement
date whatsoever.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-11 Thread Eggert, Lars
On Feb 10, 2013, at 11:36, Andrey Zonov z...@freebsd.org wrote:
 Google made many many TCP tweaks.  Increased initial window, small RTO,
 enabled ignore after idle and others.  They published that, other people
 just blindly applied these tunings and the Internet still works.

MANY people are experimenting with the changes Google is proposing, in order to 
evaluate if and how well they work. Sure, some folks may blindly apply them, 
but please don't generalize.

Lars

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-11 Thread Andre Oppermann

On 09.02.2013 15:41, Alfred Perlstein wrote:

However, the end result must be far different than what has occurred so far.

If the code was deemed unacceptable for general inclusion, then we must find a 
way to provide a
light framework to accomplish the needs of the community member.


We've got pluggable congestion control modules thanks to lstewart.

You can implement any non-standard congestion control method by adding
your own module.  They can be compiled into the kernel or loaded as KLD.

I consider implementing this as a CC module the correct approach instead
of adding yet another sysctl.  Doing a CC module like this is very easy.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-11 Thread Andre Oppermann

On 05.02.2013 22:40, John Baldwin wrote:

On Tuesday, February 05, 2013 12:44:27 pm Andre Oppermann wrote:

I would prefer to encapsulate it into its own not-so-much-congestion-management
algorithm so you can eventually do other tweaks as well like more aggressive
loss recovery which would fit your objective as well.  Since you have to modify
your app anyways to do the sockopt call this seems a more complete solution to
me.  At least better than to do a non-portable hack that violates one of the
most fundamental TCP concepts.


This is real rich from the guy pushing the increased IW that came from Linux. :)


IW10 came from Google and obviously was implemented in Linux first because that
is what they use.  However, and this is the big difference, they also provided
significant real-world data on the effects of their changes.  TCPM was very
skeptical at first but the data from the experiments has convinced many that it
is not harmful first and actually beneficial second.


Tools not policy yadda yadda, but I digress.


--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-11 Thread Andre Oppermann

On 10.02.2013 11:36, Andrey Zonov wrote:

On 2/10/13 9:05 AM, Kevin Oberman wrote:


This is a subject rather near to my heart, having fought battles with
congestion back in the dark days of Windows when it essentially
defaulted to TCPIGNOREIDLE. It was a huge pain, but it was the only
way Windows did TCP in the early days. It simply did not implement
slow-start. This was really evil, but in the days when lots of links
were 56K and T-1 was mostly used for network core links, the Internet,
small as it was back then, did not melt, though it glowed a
frightening shade of red fairly often. Today too many systems running
like this would melt thins very quickly.



Google made many many TCP tweaks.  Increased initial window, small RTO,
enabled ignore after idle and others.  They published that, other people
just blindly applied these tunings and the Internet still works.


In general Google does provide quite a bit of data with their experiments
showing that it isn't harmful and that it helps the case.

Smaller RTO (1s) has become a RFC so there was very broad consensus in
TCPM that is a good thing.  We don't have it yet because we were not fully
compliant in one case (loss of first segment).  I've fixed that a while
back and will bring 1s RTO soon to HEAD.

I'm pretty sure that Google doesn't ignore idle on their Internet facing
servers.  They may have proposed a decay mechanism in the past.  I'd have
to check the TCPM archives for that.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-11 Thread Adrian Chadd
On 11 February 2013 03:18, Andre Oppermann an...@freebsd.org wrote:

 In general Google does provide quite a bit of data with their experiments
 showing that it isn't harmful and that it helps the case.

 Smaller RTO (1s) has become a RFC so there was very broad consensus in
 TCPM that is a good thing.  We don't have it yet because we were not fully
 compliant in one case (loss of first segment).  I've fixed that a while
 back and will bring 1s RTO soon to HEAD.

 I'm pretty sure that Google doesn't ignore idle on their Internet facing
 servers.  They may have proposed a decay mechanism in the past.  I'd have
 to check the TCPM archives for that.

Argh, the If google does, it it must be fine argument.

Does Google publish the data for these experiments with the
international and local links broken down?

Google run a highly distributed infrastructure (this isn't news for
anyone, I know) and thus the link distance, RTT, number of hops, etc
may not accurately reflect the internet. It may accurately reflect
the internet from the perspective of being roughly within the same
city or state in a lot of cases.

The TCP congestion algorithms aren't just for avoiding congestion over
a peering fabric and last-mile ISP infrastructure.

The effects of tweaking congestion algorithms for delivery over a
local peering infrastructure where you try to run things as
un-congested as possible (where congestion is now The ISPs Problem)
where you maintain tight control over as much of the network
infrastructure as you can is likely going to be very different to the
congestion algorithm behaviour needed for some end-node speaking to a
variety of end-nodes over a longer, more varying set of international
links. You know, what TCP congestion algorithms are also trying to
play fair with.

Please - as much as I applaud Google for what they do, please don't
generalise their results to the greater internet without looking at
the many caveats/assumptions.



Adrian
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-11 Thread Alfred Perlstein

On 2/11/13 3:10 AM, Andre Oppermann wrote:

On 09.02.2013 15:41, Alfred Perlstein wrote:
However, the end result must be far different than what has occurred 
so far.


If the code was deemed unacceptable for general inclusion, then we 
must find a way to provide a

light framework to accomplish the needs of the community member.


We've got pluggable congestion control modules thanks to lstewart.

You can implement any non-standard congestion control method by adding
your own module.  They can be compiled into the kernel or loaded as KLD.

I consider implementing this as a CC module the correct approach instead
of adding yet another sysctl.  Doing a CC module like this is very easy.


That sounds like a win.

-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-10 Thread grenville armitage



On 02/10/2013 18:30, Eggert, Lars wrote:

On Feb 10, 2013, at 6:05, Kevin Oberman kob6...@gmail.com wrote:

One idea that popped into my head (and may be completely ridiculous,
is to make its availability dependent on a kernel option and have
warning in NOTES about it contravening normal and accepted practice
and that it can cause serious problems both for yourself and for
others using the network.


Also, if it gets merged, don't call it TCP_IGNOREIDLE. Call it 
TCP_BLAST_DANGEROUSLY_AFTER_IDLE.


TCP_AVALANCHE

cheers,
gja

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-10 Thread grenville armitage


I'm somewhat sympathetic to the purity of TCP. Nevertheless...

On 02/10/2013 16:05, Kevin Oberman wrote:
[..]

What I would like to see is a way to have it available, but make it
unlikely to be enabled except in a way that would put up flashing red
warnings and sound sirens to warn people that it is very dangerous and
can be a way to blow off a few of one's own toes.


+1

I rather doubt the Internet will be crushed by adding a non-default
option that allows FreeBSD TCP to behave More Aggressively Than It
Really Should(tm) under certain circumstances.

I'm certainly not denying that the sky would likely fall if everyone
turned on John's proposed socket option all the time. (Such might
also be said of allowing UDP applications to be free of any CC at
all, or allowing new TCP CC algorithms that deviate from the prevalent
norm.) But I think that FreeBSD benefits from adding more special-case
knobs for the cognoscenti to twiddle, on the basis that most end-users
wont bother.


One idea that popped into my head (and may be completely ridiculous,
is to make its availability dependent on a kernel option and have
warning in NOTES about it contravening normal and accepted practice
and that it can cause serious problems both for yourself and for
others using the network.


Perhaps also require a sysctl to be set before John's per-socket
TCP_IGNOREIDLE option has any effect. (Thus requiring a sending host's
administrator to at least be complicit in enabling any subsequent
ruination of their nearest bottleneck.)

cheers,
gja

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-10 Thread Andrey Zonov
On 2/10/13 9:05 AM, Kevin Oberman wrote:
 
 This is a subject rather near to my heart, having fought battles with
 congestion back in the dark days of Windows when it essentially
 defaulted to TCPIGNOREIDLE. It was a huge pain, but it was the only
 way Windows did TCP in the early days. It simply did not implement
 slow-start. This was really evil, but in the days when lots of links
 were 56K and T-1 was mostly used for network core links, the Internet,
 small as it was back then, did not melt, though it glowed a
 frightening shade of red fairly often. Today too many systems running
 like this would melt thins very quickly.
 

Google made many many TCP tweaks.  Increased initial window, small RTO,
enabled ignore after idle and others.  They published that, other people
just blindly applied these tunings and the Internet still works.

-- 
Andrey Zonov



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-09 Thread Alfred Perlstein

On 2/7/13 12:04 PM, George Neville-Neil wrote:

On Feb 6, 2013, at 12:28 , Alfred Perlstein bri...@mu.org wrote:


On 2/6/13 4:46 AM, John Baldwin wrote:

On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote:

John:

A burst at line rate will *often* cause drops. This is because
router queues are at a finite size. Also such a burst (especially
on a long delay bandwidth network) cause your RTT to increase even
if there is no drop which is going to hurt you as well.

A SHOULD in an RFC says you really really really really need to do it
unless there is some thing that makes you willing to override it. It is
slight wiggle room.

In this I agree with Andre, we should not be *not* doing it. Otherwise
folks will be turning this on and it is plain wrong. It may be fine
for your network but I would not want to see it in FreeBSD.

In my testing here at home I have put back into our stack max-burst. This
uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at
no more than 4 packets larger than your flight. All of my testing
high-bw-delay or lan has shown this to improve TCP performance. This
is because it helps you avoid bursting out so many packets that you overflow
a queue.

In your long-delay bw link if you do burst out too many (and you never
know how many that is since you can not predict how full all those
MPLS queues are or how big they are) you will really hurt yourself even worse.
Note that generally in Cisco routers the default queue size is somewhere between
100-300 packets depending on the router.

Due to the way our application works this never happens, but I am fine with
just keeping this patch private.  If there are other shops that need this they
can always dig the patch up from the archives.


This is yet another time when I'm sad about how things happen in FreeBSD.

A developer come forward with a non-default option that's very useful for some 
specific workloads, specifically one that contributes much time and $$$ to the 
project and the community rejects the patches even though it's been successful 
in other OSes.

It makes zero sense.

John, can you repost the patch?  Maybe there is a way to refactor this somehow 
so it's like accept filters where we can plug in a hook for TCP?

I am very disappointed, but not surprised.


I take away the complete opposite feeling.  This is how we work through these 
issues.
It's clear from the discussion that this need not be a default in the system,
and is a special case.  We had a reasoned discussion of what would be best to do
and at least two experts in TCP weighed in on the effect this change might have.

Not everything proposed by a developer need go into the tree, in particular 
since these
discussions are archived we can always revisit this later.

This is exactly how collaborative development should look, whether or not the 
patch
is integrated now, next week, next year, or ever.


I agree that discussion is great, we have all learned quite a bit from 
it, about TCP and the dangers of adjusting buffering without 
considerable thought.  I would not be involved in FreeBSD had this type 
of discussion and information not be discussed on the lists so readily.


However, the end result must be far different than what has occurred so far.

If the code was deemed unacceptable for general inclusion, then we must 
find a way to provide a light framework to accomplish the needs of the 
community member.


Take for instance someone who is starting a company that needs this 
facility.  Which OS will they choose?  One who has integrated a useful 
feature?  Or one who has rejected it and left that code in the mailing 
list archives?


As much as expert opinion is valuable, it must include understanding and 
need of handling special cases and the ability to facilitate those 
special cases for our users and developers.


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-09 Thread Kevin Oberman
On Sat, Feb 9, 2013 at 6:41 AM, Alfred Perlstein bri...@mu.org wrote:
 On 2/7/13 12:04 PM, George Neville-Neil wrote:

 On Feb 6, 2013, at 12:28 , Alfred Perlstein bri...@mu.org wrote:

 On 2/6/13 4:46 AM, John Baldwin wrote:

 On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote:

 John:

 A burst at line rate will *often* cause drops. This is because
 router queues are at a finite size. Also such a burst (especially
 on a long delay bandwidth network) cause your RTT to increase even
 if there is no drop which is going to hurt you as well.

 A SHOULD in an RFC says you really really really really need to do it
 unless there is some thing that makes you willing to override it. It is
 slight wiggle room.

 In this I agree with Andre, we should not be *not* doing it. Otherwise
 folks will be turning this on and it is plain wrong. It may be fine
 for your network but I would not want to see it in FreeBSD.

 In my testing here at home I have put back into our stack max-burst.
 This
 uses Mark Allman's version (not Kacheong Poon's) where you clamp the
 cwnd at
 no more than 4 packets larger than your flight. All of my testing
 high-bw-delay or lan has shown this to improve TCP performance. This
 is because it helps you avoid bursting out so many packets that you
 overflow
 a queue.

 In your long-delay bw link if you do burst out too many (and you never
 know how many that is since you can not predict how full all those
 MPLS queues are or how big they are) you will really hurt yourself even
 worse.
 Note that generally in Cisco routers the default queue size is
 somewhere between
 100-300 packets depending on the router.

 Due to the way our application works this never happens, but I am fine
 with
 just keeping this patch private.  If there are other shops that need
 this they
 can always dig the patch up from the archives.

 This is yet another time when I'm sad about how things happen in FreeBSD.

 A developer come forward with a non-default option that's very useful for
 some specific workloads, specifically one that contributes much time and $$$
 to the project and the community rejects the patches even though it's been
 successful in other OSes.

 It makes zero sense.

 John, can you repost the patch?  Maybe there is a way to refactor this
 somehow so it's like accept filters where we can plug in a hook for TCP?

 I am very disappointed, but not surprised.

 I take away the complete opposite feeling.  This is how we work through
 these issues.
 It's clear from the discussion that this need not be a default in the
 system,
 and is a special case.  We had a reasoned discussion of what would be best
 to do
 and at least two experts in TCP weighed in on the effect this change might
 have.

 Not everything proposed by a developer need go into the tree, in
 particular since these
 discussions are archived we can always revisit this later.

 This is exactly how collaborative development should look, whether or not
 the patch
 is integrated now, next week, next year, or ever.


 I agree that discussion is great, we have all learned quite a bit from it,
 about TCP and the dangers of adjusting buffering without considerable
 thought.  I would not be involved in FreeBSD had this type of discussion and
 information not be discussed on the lists so readily.

 However, the end result must be far different than what has occurred so far.

 If the code was deemed unacceptable for general inclusion, then we must find
 a way to provide a light framework to accomplish the needs of the community
 member.

 Take for instance someone who is starting a company that needs this
 facility.  Which OS will they choose?  One who has integrated a useful
 feature?  Or one who has rejected it and left that code in the mailing list
 archives?

 As much as expert opinion is valuable, it must include understanding and
 need of handling special cases and the ability to facilitate those special
 cases for our users and developers.

This is a subject rather near to my heart, having fought battles with
congestion back in the dark days of Windows when it essentially
defaulted to TCPIGNOREIDLE. It was a huge pain, but it was the only
way Windows did TCP in the early days. It simply did not implement
slow-start. This was really evil, but in the days when lots of links
were 56K and T-1 was mostly used for network core links, the Internet,
small as it was back then, did not melt, though it glowed a
frightening shade of red fairly often. Today too many systems running
like this would melt thins very quickly.

OTOH, I can certainly see cases, like John's,  where it would be very
beneficial. And, yes, Linux has it. (I don't see this a relevant in
any way except as proof tat not enough people have turned it on to
cause serious problems... yet!) It seems a shame to make everyone who
really has a need develop their own patches or dig though old mail to
find John's.

What I would like to see is a way to have it available, but make it
unlikely 

Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-09 Thread Eggert, Lars
On Feb 10, 2013, at 6:05, Kevin Oberman kob6...@gmail.com wrote:
 One idea that popped into my head (and may be completely ridiculous,
 is to make its availability dependent on a kernel option and have
 warning in NOTES about it contravening normal and accepted practice
 and that it can cause serious problems both for yourself and for
 others using the network.

Also, if it gets merged, don't call it TCP_IGNOREIDLE. Call it 
TCP_BLAST_DANGEROUSLY_AFTER_IDLE.

Lars
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-07 Thread George Neville-Neil

On Feb 6, 2013, at 12:28 , Alfred Perlstein bri...@mu.org wrote:

 On 2/6/13 4:46 AM, John Baldwin wrote:
 On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote:
 John:
 
 A burst at line rate will *often* cause drops. This is because
 router queues are at a finite size. Also such a burst (especially
 on a long delay bandwidth network) cause your RTT to increase even
 if there is no drop which is going to hurt you as well.
 
 A SHOULD in an RFC says you really really really really need to do it
 unless there is some thing that makes you willing to override it. It is
 slight wiggle room.
 
 In this I agree with Andre, we should not be *not* doing it. Otherwise
 folks will be turning this on and it is plain wrong. It may be fine
 for your network but I would not want to see it in FreeBSD.
 
 In my testing here at home I have put back into our stack max-burst. This
 uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at
 no more than 4 packets larger than your flight. All of my testing
 high-bw-delay or lan has shown this to improve TCP performance. This
 is because it helps you avoid bursting out so many packets that you overflow
 a queue.
 
 In your long-delay bw link if you do burst out too many (and you never
 know how many that is since you can not predict how full all those
 MPLS queues are or how big they are) you will really hurt yourself even 
 worse.
 Note that generally in Cisco routers the default queue size is somewhere 
 between
 100-300 packets depending on the router.
 Due to the way our application works this never happens, but I am fine with
 just keeping this patch private.  If there are other shops that need this 
 they
 can always dig the patch up from the archives.
 
 This is yet another time when I'm sad about how things happen in FreeBSD.
 
 A developer come forward with a non-default option that's very useful for 
 some specific workloads, specifically one that contributes much time and $$$ 
 to the project and the community rejects the patches even though it's been 
 successful in other OSes.
 
 It makes zero sense.
 
 John, can you repost the patch?  Maybe there is a way to refactor this 
 somehow so it's like accept filters where we can plug in a hook for TCP?
 
 I am very disappointed, but not surprised.
 

I take away the complete opposite feeling.  This is how we work through these 
issues.
It's clear from the discussion that this need not be a default in the system,
and is a special case.  We had a reasoned discussion of what would be best to do
and at least two experts in TCP weighed in on the effect this change might have.

Not everything proposed by a developer need go into the tree, in particular 
since these
discussions are archived we can always revisit this later.

This is exactly how collaborative development should look, whether or not the 
patch
is integrated now, next week, next year, or ever.

Best,
George


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-06 Thread Randall Stewart
John:

A burst at line rate will *often* cause drops. This is because
router queues are at a finite size. Also such a burst (especially
on a long delay bandwidth network) cause your RTT to increase even
if there is no drop which is going to hurt you as well.

A SHOULD in an RFC says you really really really really need to do it
unless there is some thing that makes you willing to override it. It is
slight wiggle room.

In this I agree with Andre, we should not be *not* doing it. Otherwise
folks will be turning this on and it is plain wrong. It may be fine
for your network but I would not want to see it in FreeBSD.

In my testing here at home I have put back into our stack max-burst. This
uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at
no more than 4 packets larger than your flight. All of my testing
high-bw-delay or lan has shown this to improve TCP performance. This
is because it helps you avoid bursting out so many packets that you overflow
a queue.

In your long-delay bw link if you do burst out too many (and you never
know how many that is since you can not predict how full all those
MPLS queues are or how big they are) you will really hurt yourself even worse.
Note that generally in Cisco routers the default queue size is somewhere between
100-300 packets depending on the router.

bottom line IMO this is a bad idea.

If you want to really improve that link, let me get with you off line and we can
see about getting you a couple of our boxes again :-D.

R
On Jan 22, 2013, at 4:37 PM, Andre Oppermann wrote:

 On 22.01.2013 21:35, Alfred Perlstein wrote:
 On 1/22/13 12:11 PM, John Baldwin wrote:
 As I mentioned in an earlier thread, I recently had to debug an issue we 
 were
 seeing across a link with a high bandwidth-delay product (both high 
 bandwidth
 and high RTT).  Our specific use case was to use a TCP connection to 
 reliably
 forward a latency-sensitive datagram stream across a WAN connection.  We 
 would
 often see spikes in the latency of individual datagrams.  I eventually 
 tracked
 this down to the connection entering slow start when it would transmit data
 after being idle.  The data stream was quite bursty and would often attempt 
 to
 transmit a burst of data after being idle for far longer than a retransmit
 timeout.
 
 In 7.x we had worked around this in the past by disabling RFC 3390 and 
 jacking
 the slow start window size up via a sysctl.  On 8.x this no longer worked.
 The solution I came up with was to add a new socket option to disable idle
 handling completely.  That is, when an idle connection restarts with this 
 new
 option enabled, it keeps its current congestion window and doesn't enter 
 slow
 start.
 
 There are only a few cases where such an option is useful, but if anyone 
 else
 thinks this might be useful I'd be happy to add the option to FreeBSD.
 
 This looks good, but it almost sounds like a bug for TCP to be doing this 
 anyhow.
 
 It's not a bug.  It's by design.  It's required by the RFC.
 
 Why would one want this behavior?
 
 Network conditions change all the time.  Traffic and congestion comes and 
 goes.
 Connections can go idle for milliseconds to minutes to hours.  Whenever 
 enough
 time has passed network capacity probing has to start anew.
 
 Wouldn't it make sense to keep the window large until there was a problem 
 rather than
 unconditionally chop it down?  I almost think TCP is afraid that you might 
 wind up swapping out a
 10gig interface for a modem?  I'm just not getting it.  (probably simple 
 oversight on my part).
 
 The very real fear is congestion meltdown.  That is the reason we ended up 
 with
 TCP's AIMD mechanism in the first place.  If everybody were to blast into the
 network anyone will suffer.  The bufferbloat issue identified recently makes 
 things
 even worse.
 
 What do you think about also making this a sysctl for global on/off by 
 default?
 
 Please don't.  The correct fix is either a) to use the initial window as the 
 restart
 window (up to 10 MSS nowadays); b) to use a decay mechanism based on the time 
 since
 the last network condition probe.  Even the latter must decay to initCWND 
 within at
 most 1MSL.
 
 -- 
 Andre
 
 ___
 freebsd-net@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
 

--
Randall Stewart
803-317-4952 (cell)

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-06 Thread Randall Stewart
John:

In-line

On Jan 24, 2013, at 11:14 AM, John Baldwin wrote:

 On Thursday, January 24, 2013 3:03:31 am Andre Oppermann wrote:
 On 24.01.2013 03:31, Sepherosa Ziehau wrote:
 On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin j...@freebsd.org wrote:
 On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote:
 On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote:
 As I mentioned in an earlier thread, I recently had to debug an issue we 
 were
 seeing across a link with a high bandwidth-delay product (both high 
 bandwidth
 and high RTT).  Our specific use case was to use a TCP connection to 
 reliably
 forward a latency-sensitive datagram stream across a WAN connection.  We 
 would
 often see spikes in the latency of individual datagrams.  I eventually 
 tracked
 this down to the connection entering slow start when it would transmit 
 data
 after being idle.  The data stream was quite bursty and would often 
 attempt to
 transmit a burst of data after being idle for far longer than a 
 retransmit
 timeout.
 
 In 7.x we had worked around this in the past by disabling RFC 3390 and 
 jacking
 the slow start window size up via a sysctl.  On 8.x this no longer 
 worked.
 The solution I came up with was to add a new socket option to disable 
 idle
 handling completely.  That is, when an idle connection restarts with 
 this new
 option enabled, it keeps its current congestion window and doesn't enter 
 slow
 start.
 
 There are only a few cases where such an option is useful, but if anyone 
 else
 thinks this might be useful I'd be happy to add the option to FreeBSD.
 
 I think what you need is the RFC2861, however, you probably should
 ignore the application-limited period part of RFC2861.
 
 Hummm.  It appears btw, that Linux uses RFC 2861, but has a global knob to
 disable it due to applictions having problems.  When it is disabled,
 it doesn't decay the congestion window at all during idle handling.  That 
 is,
 it appears to act the same as if TCP_IGNOREIDLE were enabled.
 
 From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html:
 
tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 
 2.6.18)
   If enabled, provide RFC 2861 behavior and time out the 
 congestion
   window after an idle period.  An idle period is defined as 
 the current
   RTO (retransmission timeout).  If disabled, the congestion 
 window will
   not be timed out after an idle period.
 
 Also, in this thread on tcp-m it appears no one on that list realizes that
 there are any implementations which follow the SHOULD in RFC 2581 for 
 idle
 handling (which is what we do currently):
 
 Nah, I don't think the idle detection in FreeBSD follows the
 RFC2581/RFC5681 4.1 (the paragraph before the SHOULD).  IMHO, that's
 probably why the author in the following email requestioned about the
 implementation of SHOULD in RFC2581/RFC5681.
 
 
 http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html
 
 So if we were to implement RFC 2861, the new socket option would be 
 equivalent
 to setting Linux's 'tcp_slow_start_after_idle' to false, but on a 
 per-socket
 basis rather than globally.
 
 Agree, per-socket option could be useful than global sysctls under
 certain situation.  However, in addition to the per-socket option,
 could global sysctl nodes to disable idle_restart/idle_cwv help too?
 
 No.  This is far too dangerous once it makes it into some tuning guide.
 The threat of congestion breakdown is real.  The Internet, or any packet
 network, can only survive in the long term if almost all follow the rules
 and self-constrain to remain fair to the others.  What would happen if
 nobody would respect the traffic lights anymore?
 
 The problem with this argument is Linux has already had this as a tunable
 option for years and the Internet hasn't melted as a result.

Just because Linux does bad-behaviour does *not* mean that we have to.
They also put Bic CC in by default, and this makes things bad for users
even more so than RFC2581 in the buffer-bloat sense. The buffer-bloat
problems reported by John Getty would not near has been as bad (they
still would have existed) if he had been using standard RFC2581 CC.

There are much better (and safer) ways to handle this type of network.
Putting this in is not a good idea IMO.




 
 Besides that bursting into unknown network conditions is very likely to
 result in burst losses as well.  TCP isn't good at recovering from it.
 In the end you most likely come out ahead if you decay the restartCWND.
 
 We have two cases primarily: a) long distance, medium to high RTT, and
 wildly varying bandwidth (a.k.a. the Internet); b) short distance, low
 RTT and mostly plenty of bandwidth (a.k.a. Datacenter).  The former
 absolutely definately requires a decayed restartCWND.  The latter less
 so but even there bursting at 10Gig TSO assisted wirespeed isn't going
 to end too happy more often than not.
 
 You 

Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-06 Thread John Baldwin
On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote:
 John:
 
 A burst at line rate will *often* cause drops. This is because
 router queues are at a finite size. Also such a burst (especially
 on a long delay bandwidth network) cause your RTT to increase even
 if there is no drop which is going to hurt you as well.
 
 A SHOULD in an RFC says you really really really really need to do it
 unless there is some thing that makes you willing to override it. It is
 slight wiggle room.
 
 In this I agree with Andre, we should not be *not* doing it. Otherwise
 folks will be turning this on and it is plain wrong. It may be fine
 for your network but I would not want to see it in FreeBSD.
 
 In my testing here at home I have put back into our stack max-burst. This
 uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at
 no more than 4 packets larger than your flight. All of my testing
 high-bw-delay or lan has shown this to improve TCP performance. This
 is because it helps you avoid bursting out so many packets that you overflow
 a queue.
 
 In your long-delay bw link if you do burst out too many (and you never
 know how many that is since you can not predict how full all those
 MPLS queues are or how big they are) you will really hurt yourself even worse.
 Note that generally in Cisco routers the default queue size is somewhere 
 between
 100-300 packets depending on the router.

Due to the way our application works this never happens, but I am fine with
just keeping this patch private.  If there are other shops that need this they
can always dig the patch up from the archives.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-06 Thread Alfred Perlstein

On 2/6/13 4:46 AM, John Baldwin wrote:

On Wednesday, February 06, 2013 6:27:04 am Randall Stewart wrote:

John:

A burst at line rate will *often* cause drops. This is because
router queues are at a finite size. Also such a burst (especially
on a long delay bandwidth network) cause your RTT to increase even
if there is no drop which is going to hurt you as well.

A SHOULD in an RFC says you really really really really need to do it
unless there is some thing that makes you willing to override it. It is
slight wiggle room.

In this I agree with Andre, we should not be *not* doing it. Otherwise
folks will be turning this on and it is plain wrong. It may be fine
for your network but I would not want to see it in FreeBSD.

In my testing here at home I have put back into our stack max-burst. This
uses Mark Allman's version (not Kacheong Poon's) where you clamp the cwnd at
no more than 4 packets larger than your flight. All of my testing
high-bw-delay or lan has shown this to improve TCP performance. This
is because it helps you avoid bursting out so many packets that you overflow
a queue.

In your long-delay bw link if you do burst out too many (and you never
know how many that is since you can not predict how full all those
MPLS queues are or how big they are) you will really hurt yourself even worse.
Note that generally in Cisco routers the default queue size is somewhere between
100-300 packets depending on the router.

Due to the way our application works this never happens, but I am fine with
just keeping this patch private.  If there are other shops that need this they
can always dig the patch up from the archives.


This is yet another time when I'm sad about how things happen in FreeBSD.

A developer come forward with a non-default option that's very useful 
for some specific workloads, specifically one that contributes much time 
and $$$ to the project and the community rejects the patches even though 
it's been successful in other OSes.


It makes zero sense.

John, can you repost the patch?  Maybe there is a way to refactor this 
somehow so it's like accept filters where we can plug in a hook for TCP?


I am very disappointed, but not surprised.

-Alfred


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-05 Thread John Baldwin
On Wednesday, January 30, 2013 12:26:17 pm Andre Oppermann wrote:
 You can simply create your own congestion control algorithm with only the
 restart window changed.  See (pseudo) code below.  BTW, I just noticed that
 the other cc algos don't do not reset the idle window.

*sigh*  I am fully competent at maintaining my own local changes.  The point
was to share this so that other people with similar workloads could make use 
of it.  Also, a custom CC algo is not the right approach as we would want this
change regardless of the CC algo used for handling non-idle periods (so that
this is an orthogonal knob).  Linux also makes this an orthogonal knob rather 
than requiring a separate CC algo.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-05 Thread Andre Oppermann

On 05.02.2013 18:11, John Baldwin wrote:

On Wednesday, January 30, 2013 12:26:17 pm Andre Oppermann wrote:

You can simply create your own congestion control algorithm with only the
restart window changed.  See (pseudo) code below.  BTW, I just noticed that
the other cc algos don't do not reset the idle window.


*sigh*  I am fully competent at maintaining my own local changes.  The point
was to share this so that other people with similar workloads could make use
of it.  Also, a custom CC algo is not the right approach as we would want this
change regardless of the CC algo used for handling non-idle periods (so that
this is an orthogonal knob).  Linux also makes this an orthogonal knob rather
than requiring a separate CC algo.


If everything Linux does is good, then go ahead and commit it.  Discussing
this change further then is pointless.  I don't mind too much and I have
stated my case why I think it's the wrong thing to do.

I would prefer to encapsulate it into its own not-so-much-congestion-management
algorithm so you can eventually do other tweaks as well like more aggressive
loss recovery which would fit your objective as well.  Since you have to modify
your app anyways to do the sockopt call this seems a more complete solution to
me.  At least better than to do a non-portable hack that violates one of the
most fundamental TCP concepts.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-02-05 Thread John Baldwin
On Tuesday, February 05, 2013 12:44:27 pm Andre Oppermann wrote:
 On 05.02.2013 18:11, John Baldwin wrote:
  On Wednesday, January 30, 2013 12:26:17 pm Andre Oppermann wrote:
  You can simply create your own congestion control algorithm with only the
  restart window changed.  See (pseudo) code below.  BTW, I just noticed that
  the other cc algos don't do not reset the idle window.
 
  *sigh*  I am fully competent at maintaining my own local changes.  The point
  was to share this so that other people with similar workloads could make use
  of it.  Also, a custom CC algo is not the right approach as we would want 
  this
  change regardless of the CC algo used for handling non-idle periods (so that
  this is an orthogonal knob).  Linux also makes this an orthogonal knob 
  rather
  than requiring a separate CC algo.
 
 If everything Linux does is good, then go ahead and commit it.  Discussing
 this change further then is pointless.  I don't mind too much and I have
 stated my case why I think it's the wrong thing to do.

Not everything Linux does is good, nor is everything Linux does bad.

 I would prefer to encapsulate it into its own 
 not-so-much-congestion-management
 algorithm so you can eventually do other tweaks as well like more aggressive
 loss recovery which would fit your objective as well.  Since you have to 
 modify
 your app anyways to do the sockopt call this seems a more complete solution to
 me.  At least better than to do a non-portable hack that violates one of the
 most fundamental TCP concepts.

This is real rich from the guy pushing the increased IW that came from Linux. :)

Tools not policy yadda yadda, but I digress.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-30 Thread John Baldwin
On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote:
 On 29.01.2013 19:50, John Baldwin wrote:
  On Thursday, January 24, 2013 11:14:40 am John Baldwin wrote:
  Agree, per-socket option could be useful than global sysctls under
  certain situation.  However, in addition to the per-socket option,
  could global sysctl nodes to disable idle_restart/idle_cwv help too?
 
  No.  This is far too dangerous once it makes it into some tuning guide.
  The threat of congestion breakdown is real.  The Internet, or any packet
  network, can only survive in the long term if almost all follow the rules
  and self-constrain to remain fair to the others.  What would happen if
  nobody would respect the traffic lights anymore?
 
  The problem with this argument is Linux has already had this as a tunable
  option for years and the Internet hasn't melted as a result.
 
  Since this seems to be a burning issue I'll come up with a patch in the
  next days to add a decaying restartCWND that'll be fair and allow a very
  quick ramp up if no loss occurs.
 
  I think this could be useful.  OTOH, I still think the TCP_IGNOREIDLE 
  option
  is useful both with and without a decaying restartCWND?
 
  *ping*
 
  Andre, do you object to adding the new socket option?
 
 Yes, unfortunately I do object.  This option, combined with the inflated
 CWND at the end of a burst, effectively removes much, if not all, of the
 congestion control mechanisms originally put in place to allow multiple
 [TCP] streams co-exist on the same pipe.  Not having any decay or timeout
 makes it even worse by doing this burst after an arbitrary amount of time
 when network conditions and the congestion situation have certainly changed.

You have completely ignored the fact that Linux has had this as a global
option for years and the Internet has not melted.  A socket option is far more
fine-grained than their tunable (and requires code changes, not something a
random sysadmin can just toggle as tuning).

 The primary principle of TCP is be cooperative with competing streams and
 fairly share bandwidth on a given link.  Whenever the ACK clock came to a
 halt for some time we must re-probe (slowstart from a restartCWND) the link
 to compensate for our lack of knowledge of the current link and congestion
 situation.  Doing that with a decay function and floor equaling the IW (10
 segments nowadays) gives a rapid ramp up especially on LAN RTTs while avoiding
 a blind burst and subsequent loss cycle.

I understand all that, but it isn't applicable to my use case.  I'm not sharing
the bandwidth with anyone but other connections of my own (and they are all
lower priority than this one).  Also, I have idle periods of hundreds of
milliseconds (large than an RTT on this cross-continental link that also has
high bandwidth), so it seems that even a decayed restartCWND will be useless to
me as it will have decayed down to nothing before I finally restart after long
idle periods.

 If you absolutely know that you're the only one on that network and you want
 pure wirespeed then a TCP cc_null module doing away with all congestion 
 control
 may be the right answer.  The infrastructure is in place and it can be 
 selected
 per socket.  Plus it can be loaded as a module and thus doesn't have to be 
 part
 of the base system.

No, I do not think that doing away with all congestion control will work for
my case.  Even though we have a dedicated line, etc. that doesn't mean
congestion is impossible and that I don't want the normal feedback to apply
during the non-restart cases.  BTW, I looked at using alternate congestion
control algorithms (cc_cubic and some of the others) first before resorting to
adding this option and they either did not fix the issue or were buggy.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-30 Thread Alfred Perlstein

On 1/30/13 11:58 AM, John Baldwin wrote:

On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote:


Yes, unfortunately I do object.  This option, combined with the inflated
CWND at the end of a burst, effectively removes much, if not all, of the
congestion control mechanisms originally put in place to allow multiple
[TCP] streams co-exist on the same pipe.  Not having any decay or timeout
makes it even worse by doing this burst after an arbitrary amount of time
when network conditions and the congestion situation have certainly changed.

You have completely ignored the fact that Linux has had this as a global
option for years and the Internet has not melted.  A socket option is far more
fine-grained than their tunable (and requires code changes, not something a
random sysadmin can just toggle as tuning).


I agree with John here.

While Andre's objection makes sense, since the majority of Linux/Unix 
hosts now have this as a global option I can't think of why you would 
force FreeBSD to be a final holdout.


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-30 Thread Andre Oppermann

On 30.01.2013 17:58, John Baldwin wrote:

On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote:

On 29.01.2013 19:50, John Baldwin wrote:

On Thursday, January 24, 2013 11:14:40 am John Baldwin wrote:

Agree, per-socket option could be useful than global sysctls under
certain situation.  However, in addition to the per-socket option,
could global sysctl nodes to disable idle_restart/idle_cwv help too?


No.  This is far too dangerous once it makes it into some tuning guide.
The threat of congestion breakdown is real.  The Internet, or any packet
network, can only survive in the long term if almost all follow the rules
and self-constrain to remain fair to the others.  What would happen if
nobody would respect the traffic lights anymore?


The problem with this argument is Linux has already had this as a tunable
option for years and the Internet hasn't melted as a result.


Since this seems to be a burning issue I'll come up with a patch in the
next days to add a decaying restartCWND that'll be fair and allow a very
quick ramp up if no loss occurs.


I think this could be useful.  OTOH, I still think the TCP_IGNOREIDLE option
is useful both with and without a decaying restartCWND?


*ping*

Andre, do you object to adding the new socket option?


Yes, unfortunately I do object.  This option, combined with the inflated
CWND at the end of a burst, effectively removes much, if not all, of the
congestion control mechanisms originally put in place to allow multiple
[TCP] streams co-exist on the same pipe.  Not having any decay or timeout
makes it even worse by doing this burst after an arbitrary amount of time
when network conditions and the congestion situation have certainly changed.


You have completely ignored the fact that Linux has had this as a global
option for years and the Internet has not melted.


Sure.  A friend of mine does free climbing and he hasn't crashed yet.
He also runs all filesystems async with disk write cache enabled, no
backup and hasn't lost a file yet.  ;-)


A socket option is far more
fine-grained than their tunable (and requires code changes, not something a
random sysadmin can just toggle as tuning).


Agreed that a socket option is much more difficult to use.


The primary principle of TCP is be cooperative with competing streams and
fairly share bandwidth on a given link.  Whenever the ACK clock came to a
halt for some time we must re-probe (slowstart from a restartCWND) the link
to compensate for our lack of knowledge of the current link and congestion
situation.  Doing that with a decay function and floor equaling the IW (10
segments nowadays) gives a rapid ramp up especially on LAN RTTs while avoiding
a blind burst and subsequent loss cycle.


I understand all that, but it isn't applicable to my use case.  I'm not sharing
the bandwidth with anyone but other connections of my own (and they are all
lower priority than this one).  Also, I have idle periods of hundreds of
milliseconds (large than an RTT on this cross-continental link that also has
high bandwidth), so it seems that even a decayed restartCWND will be useless to
me as it will have decayed down to nothing before I finally restart after long
idle periods.


OK.


If you absolutely know that you're the only one on that network and you want
pure wirespeed then a TCP cc_null module doing away with all congestion control
may be the right answer.  The infrastructure is in place and it can be selected
per socket.  Plus it can be loaded as a module and thus doesn't have to be part
of the base system.


No, I do not think that doing away with all congestion control will work for
my case.  Even though we have a dedicated line, etc. that doesn't mean
congestion is impossible and that I don't want the normal feedback to apply
during the non-restart cases.  BTW, I looked at using alternate congestion
control algorithms (cc_cubic and some of the others) first before resorting to
adding this option and they either did not fix the issue or were buggy.


You can simply create your own congestion control algorithm with only the
restart window changed.  See (pseudo) code below.  BTW, I just noticed that
the other cc algos don't do not reset the idle window.

--
Andre

/* boilerplate from netinet/cc/cc_newreno.c here. */
struct cc_algo jhb_cc_algo = {
.name = jhb_full_restartCWND,
.ack_received = newreno_ack_received,
.after_idle = jhb_after_idle,
.cong_signal = newreno_cong_signal,
.post_recovery = newreno_post_recovery,
};

static void
jhb_after_idle(struct cc_var *ccv)
{

return;
}

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-30 Thread Andre Oppermann

On 30.01.2013 18:11, Alfred Perlstein wrote:

On 1/30/13 11:58 AM, John Baldwin wrote:

On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote:


Yes, unfortunately I do object.  This option, combined with the inflated
CWND at the end of a burst, effectively removes much, if not all, of the
congestion control mechanisms originally put in place to allow multiple
[TCP] streams co-exist on the same pipe.  Not having any decay or timeout
makes it even worse by doing this burst after an arbitrary amount of time
when network conditions and the congestion situation have certainly changed.

You have completely ignored the fact that Linux has had this as a global
option for years and the Internet has not melted.  A socket option is far more
fine-grained than their tunable (and requires code changes, not something a
random sysadmin can just toggle as tuning).


I agree with John here.

While Andre's objection makes sense, since the majority of Linux/Unix hosts now 
have this as a
global option I can't think of why you would force FreeBSD to be a final 
holdout.


Unless OpenBSD, NetBSD, Solaris/Ilumos also support this it is hardly a
majority of Linux/Unix hosts.  And this isn't something a sysadmin should
tune at all.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-30 Thread Alfred Perlstein

On 1/30/13 12:29 PM, Andre Oppermann wrote:

On 30.01.2013 18:11, Alfred Perlstein wrote:

On 1/30/13 11:58 AM, John Baldwin wrote:

On Tuesday, January 29, 2013 6:07:22 pm Andre Oppermann wrote:


Yes, unfortunately I do object.  This option, combined with the 
inflated
CWND at the end of a burst, effectively removes much, if not all, 
of the
congestion control mechanisms originally put in place to allow 
multiple
[TCP] streams co-exist on the same pipe.  Not having any decay or 
timeout
makes it even worse by doing this burst after an arbitrary amount 
of time
when network conditions and the congestion situation have certainly 
changed.
You have completely ignored the fact that Linux has had this as a 
global
option for years and the Internet has not melted.  A socket option 
is far more
fine-grained than their tunable (and requires code changes, not 
something a

random sysadmin can just toggle as tuning).


I agree with John here.

While Andre's objection makes sense, since the majority of Linux/Unix 
hosts now have this as a
global option I can't think of why you would force FreeBSD to be a 
final holdout.


Unless OpenBSD, NetBSD, Solaris/Ilumos also support this it is hardly a
majority of Linux/Unix hosts.  And this isn't something a sysadmin 
should

tune at all.

My apologies, I should have been more clear.  I was speaking of majority 
of install base, not majority of distros.


-Alfred
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-29 Thread John Baldwin
On Thursday, January 24, 2013 11:14:40 am John Baldwin wrote:
   Agree, per-socket option could be useful than global sysctls under
   certain situation.  However, in addition to the per-socket option,
   could global sysctl nodes to disable idle_restart/idle_cwv help too?
  
  No.  This is far too dangerous once it makes it into some tuning guide.
  The threat of congestion breakdown is real.  The Internet, or any packet
  network, can only survive in the long term if almost all follow the rules
  and self-constrain to remain fair to the others.  What would happen if
  nobody would respect the traffic lights anymore?
 
 The problem with this argument is Linux has already had this as a tunable
 option for years and the Internet hasn't melted as a result.
  
  Since this seems to be a burning issue I'll come up with a patch in the
  next days to add a decaying restartCWND that'll be fair and allow a very
  quick ramp up if no loss occurs.
 
 I think this could be useful.  OTOH, I still think the TCP_IGNOREIDLE option
 is useful both with and without a decaying restartCWND?

*ping*

Andre, do you object to adding the new socket option?

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-29 Thread Andre Oppermann

On 29.01.2013 19:50, John Baldwin wrote:

On Thursday, January 24, 2013 11:14:40 am John Baldwin wrote:

Agree, per-socket option could be useful than global sysctls under
certain situation.  However, in addition to the per-socket option,
could global sysctl nodes to disable idle_restart/idle_cwv help too?


No.  This is far too dangerous once it makes it into some tuning guide.
The threat of congestion breakdown is real.  The Internet, or any packet
network, can only survive in the long term if almost all follow the rules
and self-constrain to remain fair to the others.  What would happen if
nobody would respect the traffic lights anymore?


The problem with this argument is Linux has already had this as a tunable
option for years and the Internet hasn't melted as a result.


Since this seems to be a burning issue I'll come up with a patch in the
next days to add a decaying restartCWND that'll be fair and allow a very
quick ramp up if no loss occurs.


I think this could be useful.  OTOH, I still think the TCP_IGNOREIDLE option
is useful both with and without a decaying restartCWND?


*ping*

Andre, do you object to adding the new socket option?


Yes, unfortunately I do object.  This option, combined with the inflated
CWND at the end of a burst, effectively removes much, if not all, of the
congestion control mechanisms originally put in place to allow multiple
[TCP] streams co-exist on the same pipe.  Not having any decay or timeout
makes it even worse by doing this burst after an arbitrary amount of time
when network conditions and the congestion situation have certainly changed.

The primary principle of TCP is be cooperative with competing streams and
fairly share bandwidth on a given link.  Whenever the ACK clock came to a
halt for some time we must re-probe (slowstart from a restartCWND) the link
to compensate for our lack of knowledge of the current link and congestion
situation.  Doing that with a decay function and floor equaling the IW (10
segments nowadays) gives a rapid ramp up especially on LAN RTTs while avoiding
a blind burst and subsequent loss cycle.

If you absolutely know that you're the only one on that network and you want
pure wirespeed then a TCP cc_null module doing away with all congestion control
may be the right answer.  The infrastructure is in place and it can be selected
per socket.  Plus it can be loaded as a module and thus doesn't have to be part
of the base system.

I'm currently re-emerging finishing up from the startup and auto-scaling rabbit-
hole and will post patches for review shortly.

After that I'm looking after the restartCWND issue.  A first quick patch
(untested) to update the restartCWND to the IW is below.

--
Andre

$ svn diff netinet/cc/cc_newreno.c
Index: netinet/cc/cc_newreno.c
===
--- netinet/cc/cc_newreno.c (revision 246082)
+++ netinet/cc/cc_newreno.c (working copy)
@@ -166,12 +166,21 @@
 *
 * See RFC5681 Section 4.1. Restarting Idle Connections.
 */
-   if (V_tcp_do_rfc3390)
+   if (V_tcp_do_initcwnd10)
+   rw = min(10 * CCV(ccv, t_maxseg),
+   max(2 * CCV(ccv, t_maxseg), 14600));
+   else if (V_tcp_do_rfc3390)
rw = min(4 * CCV(ccv, t_maxseg),
max(2 * CCV(ccv, t_maxseg), 4380));
-   else
-   rw = CCV(ccv, t_maxseg) * 2;
-
+   else {
+   /* Per RFC5681 Section 3.1 */
+   if (CCV(ccv, t_maxseg)  2190)
+   rw = 2 * CCV(ccv, t_maxseg);
+   else if (CCV(ccv, t_maxseg)  1095)
+   rw = 3 * CCV(ccv, t_maxseg);
+   else
+   rw = 4 * CCV(ccv, t_maxseg);
+   }
CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd));
 }


___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-24 Thread Andre Oppermann

On 24.01.2013 03:31, Sepherosa Ziehau wrote:

On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin j...@freebsd.org wrote:

On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote:

On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote:

As I mentioned in an earlier thread, I recently had to debug an issue we were
seeing across a link with a high bandwidth-delay product (both high bandwidth
and high RTT).  Our specific use case was to use a TCP connection to reliably
forward a latency-sensitive datagram stream across a WAN connection.  We would
often see spikes in the latency of individual datagrams.  I eventually tracked
this down to the connection entering slow start when it would transmit data
after being idle.  The data stream was quite bursty and would often attempt to
transmit a burst of data after being idle for far longer than a retransmit
timeout.

In 7.x we had worked around this in the past by disabling RFC 3390 and jacking
the slow start window size up via a sysctl.  On 8.x this no longer worked.
The solution I came up with was to add a new socket option to disable idle
handling completely.  That is, when an idle connection restarts with this new
option enabled, it keeps its current congestion window and doesn't enter slow
start.

There are only a few cases where such an option is useful, but if anyone else
thinks this might be useful I'd be happy to add the option to FreeBSD.


I think what you need is the RFC2861, however, you probably should
ignore the application-limited period part of RFC2861.


Hummm.  It appears btw, that Linux uses RFC 2861, but has a global knob to
disable it due to applictions having problems.  When it is disabled,
it doesn't decay the congestion window at all during idle handling.  That is,
it appears to act the same as if TCP_IGNOREIDLE were enabled.

 From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html:

tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 
2.6.18)
   If enabled, provide RFC 2861 behavior and time out the congestion
   window after an idle period.  An idle period is defined as the 
current
   RTO (retransmission timeout).  If disabled, the congestion 
window will
   not be timed out after an idle period.

Also, in this thread on tcp-m it appears no one on that list realizes that
there are any implementations which follow the SHOULD in RFC 2581 for idle
handling (which is what we do currently):


Nah, I don't think the idle detection in FreeBSD follows the
RFC2581/RFC5681 4.1 (the paragraph before the SHOULD).  IMHO, that's
probably why the author in the following email requestioned about the
implementation of SHOULD in RFC2581/RFC5681.



http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html

So if we were to implement RFC 2861, the new socket option would be equivalent
to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket
basis rather than globally.


Agree, per-socket option could be useful than global sysctls under
certain situation.  However, in addition to the per-socket option,
could global sysctl nodes to disable idle_restart/idle_cwv help too?


No.  This is far too dangerous once it makes it into some tuning guide.
The threat of congestion breakdown is real.  The Internet, or any packet
network, can only survive in the long term if almost all follow the rules
and self-constrain to remain fair to the others.  What would happen if
nobody would respect the traffic lights anymore?

Besides that bursting into unknown network conditions is very likely to
result in burst losses as well.  TCP isn't good at recovering from it.
In the end you most likely come out ahead if you decay the restartCWND.

We have two cases primarily: a) long distance, medium to high RTT, and
wildly varying bandwidth (a.k.a. the Internet); b) short distance, low
RTT and mostly plenty of bandwidth (a.k.a. Datacenter).  The former
absolutely definately requires a decayed restartCWND.  The latter less
so but even there bursting at 10Gig TSO assisted wirespeed isn't going
to end too happy more often than not.

Since this seems to be a burning issue I'll come up with a patch in the
next days to add a decaying restartCWND that'll be fair and allow a very
quick ramp up if no loss occurs.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-24 Thread John Baldwin
On Thursday, January 24, 2013 3:03:31 am Andre Oppermann wrote:
 On 24.01.2013 03:31, Sepherosa Ziehau wrote:
  On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin j...@freebsd.org wrote:
  On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote:
  On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote:
  As I mentioned in an earlier thread, I recently had to debug an issue we 
  were
  seeing across a link with a high bandwidth-delay product (both high 
  bandwidth
  and high RTT).  Our specific use case was to use a TCP connection to 
  reliably
  forward a latency-sensitive datagram stream across a WAN connection.  We 
  would
  often see spikes in the latency of individual datagrams.  I eventually 
  tracked
  this down to the connection entering slow start when it would transmit 
  data
  after being idle.  The data stream was quite bursty and would often 
  attempt to
  transmit a burst of data after being idle for far longer than a 
  retransmit
  timeout.
 
  In 7.x we had worked around this in the past by disabling RFC 3390 and 
  jacking
  the slow start window size up via a sysctl.  On 8.x this no longer 
  worked.
  The solution I came up with was to add a new socket option to disable 
  idle
  handling completely.  That is, when an idle connection restarts with 
  this new
  option enabled, it keeps its current congestion window and doesn't enter 
  slow
  start.
 
  There are only a few cases where such an option is useful, but if anyone 
  else
  thinks this might be useful I'd be happy to add the option to FreeBSD.
 
  I think what you need is the RFC2861, however, you probably should
  ignore the application-limited period part of RFC2861.
 
  Hummm.  It appears btw, that Linux uses RFC 2861, but has a global knob to
  disable it due to applictions having problems.  When it is disabled,
  it doesn't decay the congestion window at all during idle handling.  That 
  is,
  it appears to act the same as if TCP_IGNOREIDLE were enabled.
 
   From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html:
 
  tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 
  2.6.18)
 If enabled, provide RFC 2861 behavior and time out the 
  congestion
 window after an idle period.  An idle period is defined as 
  the current
 RTO (retransmission timeout).  If disabled, the congestion 
  window will
 not be timed out after an idle period.
 
  Also, in this thread on tcp-m it appears no one on that list realizes that
  there are any implementations which follow the SHOULD in RFC 2581 for 
  idle
  handling (which is what we do currently):
 
  Nah, I don't think the idle detection in FreeBSD follows the
  RFC2581/RFC5681 4.1 (the paragraph before the SHOULD).  IMHO, that's
  probably why the author in the following email requestioned about the
  implementation of SHOULD in RFC2581/RFC5681.
 
 
  http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html
 
  So if we were to implement RFC 2861, the new socket option would be 
  equivalent
  to setting Linux's 'tcp_slow_start_after_idle' to false, but on a 
  per-socket
  basis rather than globally.
 
  Agree, per-socket option could be useful than global sysctls under
  certain situation.  However, in addition to the per-socket option,
  could global sysctl nodes to disable idle_restart/idle_cwv help too?
 
 No.  This is far too dangerous once it makes it into some tuning guide.
 The threat of congestion breakdown is real.  The Internet, or any packet
 network, can only survive in the long term if almost all follow the rules
 and self-constrain to remain fair to the others.  What would happen if
 nobody would respect the traffic lights anymore?

The problem with this argument is Linux has already had this as a tunable
option for years and the Internet hasn't melted as a result.
 
 Besides that bursting into unknown network conditions is very likely to
 result in burst losses as well.  TCP isn't good at recovering from it.
 In the end you most likely come out ahead if you decay the restartCWND.
 
 We have two cases primarily: a) long distance, medium to high RTT, and
 wildly varying bandwidth (a.k.a. the Internet); b) short distance, low
 RTT and mostly plenty of bandwidth (a.k.a. Datacenter).  The former
 absolutely definately requires a decayed restartCWND.  The latter less
 so but even there bursting at 10Gig TSO assisted wirespeed isn't going
 to end too happy more often than not.

You forgot my case: c) dedicated long distance links with high bandwidth.

 Since this seems to be a burning issue I'll come up with a patch in the
 next days to add a decaying restartCWND that'll be fair and allow a very
 quick ramp up if no loss occurs.

I think this could be useful.  OTOH, I still think the TCP_IGNOREIDLE option
is useful both with and without a decaying restartCWND?

-- 
John Baldwin
___

Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-24 Thread Alfred Perlstein

On 1/24/13 11:14 AM, John Baldwin wrote:

On Thursday, January 24, 2013 3:03:31 am Andre Oppermann wrote:

On 24.01.2013 03:31, Sepherosa Ziehau wrote:

On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin j...@freebsd.org wrote:

On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote:

On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote:

As I mentioned in an earlier thread, I recently had to debug an issue we were
seeing across a link with a high bandwidth-delay product (both high bandwidth
and high RTT).  Our specific use case was to use a TCP connection to reliably
forward a latency-sensitive datagram stream across a WAN connection.  We would
often see spikes in the latency of individual datagrams.  I eventually tracked
this down to the connection entering slow start when it would transmit data
after being idle.  The data stream was quite bursty and would often attempt to
transmit a burst of data after being idle for far longer than a retransmit
timeout.

In 7.x we had worked around this in the past by disabling RFC 3390 and jacking
the slow start window size up via a sysctl.  On 8.x this no longer worked.
The solution I came up with was to add a new socket option to disable idle
handling completely.  That is, when an idle connection restarts with this new
option enabled, it keeps its current congestion window and doesn't enter slow
start.

There are only a few cases where such an option is useful, but if anyone else
thinks this might be useful I'd be happy to add the option to FreeBSD.

I think what you need is the RFC2861, however, you probably should
ignore the application-limited period part of RFC2861.

Hummm.  It appears btw, that Linux uses RFC 2861, but has a global knob to
disable it due to applictions having problems.  When it is disabled,
it doesn't decay the congestion window at all during idle handling.  That is,
it appears to act the same as if TCP_IGNOREIDLE were enabled.

  From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html:

 tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 
2.6.18)
If enabled, provide RFC 2861 behavior and time out the 
congestion
window after an idle period.  An idle period is defined as the 
current
RTO (retransmission timeout).  If disabled, the congestion 
window will
not be timed out after an idle period.

Also, in this thread on tcp-m it appears no one on that list realizes that
there are any implementations which follow the SHOULD in RFC 2581 for idle
handling (which is what we do currently):

Nah, I don't think the idle detection in FreeBSD follows the
RFC2581/RFC5681 4.1 (the paragraph before the SHOULD).  IMHO, that's
probably why the author in the following email requestioned about the
implementation of SHOULD in RFC2581/RFC5681.


http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html

So if we were to implement RFC 2861, the new socket option would be equivalent
to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket
basis rather than globally.

Agree, per-socket option could be useful than global sysctls under
certain situation.  However, in addition to the per-socket option,
could global sysctl nodes to disable idle_restart/idle_cwv help too?

No.  This is far too dangerous once it makes it into some tuning guide.
The threat of congestion breakdown is real.  The Internet, or any packet
network, can only survive in the long term if almost all follow the rules
and self-constrain to remain fair to the others.  What would happen if
nobody would respect the traffic lights anymore?

The problem with this argument is Linux has already had this as a tunable
option for years and the Internet hasn't melted as a result.
  

Besides that bursting into unknown network conditions is very likely to
result in burst losses as well.  TCP isn't good at recovering from it.
In the end you most likely come out ahead if you decay the restartCWND.

We have two cases primarily: a) long distance, medium to high RTT, and
wildly varying bandwidth (a.k.a. the Internet); b) short distance, low
RTT and mostly plenty of bandwidth (a.k.a. Datacenter).  The former
absolutely definately requires a decayed restartCWND.  The latter less
so but even there bursting at 10Gig TSO assisted wirespeed isn't going
to end too happy more often than not.

You forgot my case: c) dedicated long distance links with high bandwidth.


Since this seems to be a burning issue I'll come up with a patch in the
next days to add a decaying restartCWND that'll be fair and allow a very
quick ramp up if no loss occurs.

I think this could be useful.  OTOH, I still think the TCP_IGNOREIDLE option
is useful both with and without a decaying restartCWND?

Linux seems to be doing just fine with it for what seems to be a long 
while.  Can we get this committed?


-Alfred
___
freebsd-net@freebsd.org mailing list

Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-23 Thread John Baldwin
On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote:
 On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote:
  As I mentioned in an earlier thread, I recently had to debug an issue we 
  were
  seeing across a link with a high bandwidth-delay product (both high 
  bandwidth
  and high RTT).  Our specific use case was to use a TCP connection to 
  reliably
  forward a latency-sensitive datagram stream across a WAN connection.  We 
  would
  often see spikes in the latency of individual datagrams.  I eventually 
  tracked
  this down to the connection entering slow start when it would transmit data
  after being idle.  The data stream was quite bursty and would often attempt 
  to
  transmit a burst of data after being idle for far longer than a retransmit
  timeout.
 
  In 7.x we had worked around this in the past by disabling RFC 3390 and 
  jacking
  the slow start window size up via a sysctl.  On 8.x this no longer worked.
  The solution I came up with was to add a new socket option to disable idle
  handling completely.  That is, when an idle connection restarts with this 
  new
  option enabled, it keeps its current congestion window and doesn't enter 
  slow
  start.
 
  There are only a few cases where such an option is useful, but if anyone 
  else
  thinks this might be useful I'd be happy to add the option to FreeBSD.
 
 I think what you need is the RFC2861, however, you probably should
 ignore the application-limited period part of RFC2861.

Hummm.  It appears btw, that Linux uses RFC 2861, but has a global knob to
disable it due to applictions having problems.  When it is disabled,
it doesn't decay the congestion window at all during idle handling.  That is,
it appears to act the same as if TCP_IGNOREIDLE were enabled.

From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html:

   tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 2.6.18)
  If enabled, provide RFC 2861 behavior and time out the congestion
  window after an idle period.  An idle period is defined as the 
current
  RTO (retransmission timeout).  If disabled, the congestion window 
will
  not be timed out after an idle period.

Also, in this thread on tcp-m it appears no one on that list realizes that
there are any implementations which follow the SHOULD in RFC 2581 for idle
handling (which is what we do currently):

http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html

So if we were to implement RFC 2861, the new socket option would be equivalent
to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket
basis rather than globally.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-23 Thread Sepherosa Ziehau
On Thu, Jan 24, 2013 at 12:15 AM, John Baldwin j...@freebsd.org wrote:
 On Wednesday, January 23, 2013 1:33:27 am Sepherosa Ziehau wrote:
 On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote:
  As I mentioned in an earlier thread, I recently had to debug an issue we 
  were
  seeing across a link with a high bandwidth-delay product (both high 
  bandwidth
  and high RTT).  Our specific use case was to use a TCP connection to 
  reliably
  forward a latency-sensitive datagram stream across a WAN connection.  We 
  would
  often see spikes in the latency of individual datagrams.  I eventually 
  tracked
  this down to the connection entering slow start when it would transmit data
  after being idle.  The data stream was quite bursty and would often 
  attempt to
  transmit a burst of data after being idle for far longer than a retransmit
  timeout.
 
  In 7.x we had worked around this in the past by disabling RFC 3390 and 
  jacking
  the slow start window size up via a sysctl.  On 8.x this no longer worked.
  The solution I came up with was to add a new socket option to disable idle
  handling completely.  That is, when an idle connection restarts with this 
  new
  option enabled, it keeps its current congestion window and doesn't enter 
  slow
  start.
 
  There are only a few cases where such an option is useful, but if anyone 
  else
  thinks this might be useful I'd be happy to add the option to FreeBSD.

 I think what you need is the RFC2861, however, you probably should
 ignore the application-limited period part of RFC2861.

 Hummm.  It appears btw, that Linux uses RFC 2861, but has a global knob to
 disable it due to applictions having problems.  When it is disabled,
 it doesn't decay the congestion window at all during idle handling.  That is,
 it appears to act the same as if TCP_IGNOREIDLE were enabled.

 From http://www.kernel.org/doc/man-pages/online/pages/man7/tcp.7.html:

tcp_slow_start_after_idle (Boolean; default: enabled; since Linux 
 2.6.18)
   If enabled, provide RFC 2861 behavior and time out the 
 congestion
   window after an idle period.  An idle period is defined as the 
 current
   RTO (retransmission timeout).  If disabled, the congestion 
 window will
   not be timed out after an idle period.

 Also, in this thread on tcp-m it appears no one on that list realizes that
 there are any implementations which follow the SHOULD in RFC 2581 for idle
 handling (which is what we do currently):

Nah, I don't think the idle detection in FreeBSD follows the
RFC2581/RFC5681 4.1 (the paragraph before the SHOULD).  IMHO, that's
probably why the author in the following email requestioned about the
implementation of SHOULD in RFC2581/RFC5681.


 http://www.ietf.org/mail-archive/web/tcpm/current/msg02864.html

 So if we were to implement RFC 2861, the new socket option would be equivalent
 to setting Linux's 'tcp_slow_start_after_idle' to false, but on a per-socket
 basis rather than globally.

Agree, per-socket option could be useful than global sysctls under
certain situation.  However, in addition to the per-socket option,
could global sysctl nodes to disable idle_restart/idle_cwv help too?

Best Regards,
sephe

--
Tomorrow Will Never Die
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


[PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-22 Thread John Baldwin
As I mentioned in an earlier thread, I recently had to debug an issue we were 
seeing across a link with a high bandwidth-delay product (both high bandwidth 
and high RTT).  Our specific use case was to use a TCP connection to reliably 
forward a latency-sensitive datagram stream across a WAN connection.  We would 
often see spikes in the latency of individual datagrams.  I eventually tracked 
this down to the connection entering slow start when it would transmit data 
after being idle.  The data stream was quite bursty and would often attempt to 
transmit a burst of data after being idle for far longer than a retransmit 
timeout.

In 7.x we had worked around this in the past by disabling RFC 3390 and jacking 
the slow start window size up via a sysctl.  On 8.x this no longer worked.  
The solution I came up with was to add a new socket option to disable idle 
handling completely.  That is, when an idle connection restarts with this new 
option enabled, it keeps its current congestion window and doesn't enter slow 
start.

There are only a few cases where such an option is useful, but if anyone else 
thinks this might be useful I'd be happy to add the option to FreeBSD.

Index: share/man/man4/tcp.4
===
--- share/man/man4/tcp.4(revision 245742)
+++ share/man/man4/tcp.4(working copy)
@@ -205,6 +205,18 @@
 in the
 .Sx MIB Variables
 section further down.
+.It Dv TCP_IGNOREIDLE
+If a TCP connection is idle for more than one retransmit timeout,
+it enters slow start when new data is available to transmit.
+This avoids flooding the network with a full window of traffic at line rate.
+It also allows the connection to adjust to changes to network conditions
+that occurred while the connection was idle.  A connection that sends
+bursts of data separated by large idle periods can be permamently stuck in
+slow start as a result.
+The boolean option
+.Dv TCP_IGNOREIDLE
+disables the idle connection handling allowing connections to maintain the
+existing congestion window when restarting after an idle period.
 .It Dv TCP_NODELAY
 Under most circumstances,
 .Tn TCP
Index: sys/netinet/tcp_var.h
===
--- sys/netinet/tcp_var.h   (revision 245742)
+++ sys/netinet/tcp_var.h   (working copy)
@@ -230,6 +230,7 @@
 #defineTF_NEEDFIN  0x000800/* send FIN (implicit state) */
 #defineTF_NOPUSH   0x001000/* don't push */
 #defineTF_PREVVALID0x002000/* saved values for bad rxmit 
valid */
+#defineTF_IGNOREIDLE   0x004000/* connection is never idle */
 #defineTF_MORETOCOME   0x01/* More data to be appended to 
sock */
 #defineTF_LQ_OVERFLOW  0x02/* listen queue overflow */
 #defineTF_LASTIDLE 0x04/* connection was previously 
idle */
Index: sys/netinet/tcp_output.c
===
--- sys/netinet/tcp_output.c(revision 245742)
+++ sys/netinet/tcp_output.c(working copy)
@@ -206,7 +206,8 @@
 * to send, then transmit; otherwise, investigate further.
 */
idle = (tp-t_flags  TF_LASTIDLE) || (tp-snd_max == tp-snd_una);
-   if (idle  ticks - tp-t_rcvtime = tp-t_rxtcur)
+   if (!(tp-t_flags  TF_IGNOREIDLE) 
+   idle  ticks - tp-t_rcvtime = tp-t_rxtcur)
cc_after_idle(tp);
tp-t_flags = ~TF_LASTIDLE;
if (idle) {
Index: sys/netinet/tcp.h
===
--- sys/netinet/tcp.h   (revision 245823)
+++ sys/netinet/tcp.h   (working copy)
@@ -156,6 +156,7 @@
 #defineTCP_NODELAY 1   /* don't delay send to coalesce packets 
*/
 #if __BSD_VISIBLE
 #defineTCP_MAXSEG  2   /* set maximum segment size */
+#defineTCP_IGNOREIDLE  3   /* disable idle connection handling */
 #define TCP_NOPUSH 4   /* don't push last block of write */
 #define TCP_NOOPT  8   /* don't use TCP options */
 #define TCP_MD5SIG 16  /* use MD5 digests (RFC2385) */
Index: sys/netinet/tcp_usrreq.c
===
--- sys/netinet/tcp_usrreq.c(revision 245742)
+++ sys/netinet/tcp_usrreq.c(working copy)
@@ -1354,6 +1354,7 @@
 
case TCP_NODELAY:
case TCP_NOOPT:
+   case TCP_IGNOREIDLE:
INP_WUNLOCK(inp);
error = sooptcopyin(sopt, optval, sizeof optval,
sizeof optval);
@@ -1368,6 +1369,9 @@
case TCP_NOOPT:
opt = TF_NOOPT;
break;
+   case TCP_IGNOREIDLE:
+   opt = TF_IGNOREIDLE;
+   break;
default:
 

Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-22 Thread Alfred Perlstein

On 1/22/13 12:11 PM, John Baldwin wrote:

As I mentioned in an earlier thread, I recently had to debug an issue we were
seeing across a link with a high bandwidth-delay product (both high bandwidth
and high RTT).  Our specific use case was to use a TCP connection to reliably
forward a latency-sensitive datagram stream across a WAN connection.  We would
often see spikes in the latency of individual datagrams.  I eventually tracked
this down to the connection entering slow start when it would transmit data
after being idle.  The data stream was quite bursty and would often attempt to
transmit a burst of data after being idle for far longer than a retransmit
timeout.

In 7.x we had worked around this in the past by disabling RFC 3390 and jacking
the slow start window size up via a sysctl.  On 8.x this no longer worked.
The solution I came up with was to add a new socket option to disable idle
handling completely.  That is, when an idle connection restarts with this new
option enabled, it keeps its current congestion window and doesn't enter slow
start.

There are only a few cases where such an option is useful, but if anyone else
thinks this might be useful I'd be happy to add the option to FreeBSD.


This looks good, but it almost sounds like a bug for TCP to be doing 
this anyhow.


Why would one want this behavior?

Wouldn't it make sense to keep the window large until there was a 
problem rather than unconditionally chop it down?  I almost think TCP is 
afraid that you might wind up swapping out a 10gig interface for a 
modem?  I'm just not getting it.  (probably simple oversight on my part).


What do you think about also making this a sysctl for global on/off by 
default?


-Alfred



Index: share/man/man4/tcp.4
===
--- share/man/man4/tcp.4(revision 245742)
+++ share/man/man4/tcp.4(working copy)
@@ -205,6 +205,18 @@
  in the
  .Sx MIB Variables
  section further down.
+.It Dv TCP_IGNOREIDLE
+If a TCP connection is idle for more than one retransmit timeout,
+it enters slow start when new data is available to transmit.
+This avoids flooding the network with a full window of traffic at line rate.
+It also allows the connection to adjust to changes to network conditions
+that occurred while the connection was idle.  A connection that sends
+bursts of data separated by large idle periods can be permamently stuck in
+slow start as a result.
+The boolean option
+.Dv TCP_IGNOREIDLE
+disables the idle connection handling allowing connections to maintain the
+existing congestion window when restarting after an idle period.
  .It Dv TCP_NODELAY
  Under most circumstances,
  .Tn TCP
Index: sys/netinet/tcp_var.h
===
--- sys/netinet/tcp_var.h   (revision 245742)
+++ sys/netinet/tcp_var.h   (working copy)
@@ -230,6 +230,7 @@
  #define   TF_NEEDFIN  0x000800/* send FIN (implicit state) */
  #define   TF_NOPUSH   0x001000/* don't push */
  #define   TF_PREVVALID0x002000/* saved values for bad rxmit 
valid */
+#defineTF_IGNOREIDLE   0x004000/* connection is never idle */
  #define   TF_MORETOCOME   0x01/* More data to be appended to 
sock */
  #define   TF_LQ_OVERFLOW  0x02/* listen queue overflow */
  #define   TF_LASTIDLE 0x04/* connection was previously 
idle */
Index: sys/netinet/tcp_output.c
===
--- sys/netinet/tcp_output.c(revision 245742)
+++ sys/netinet/tcp_output.c(working copy)
@@ -206,7 +206,8 @@
 * to send, then transmit; otherwise, investigate further.
 */
idle = (tp-t_flags  TF_LASTIDLE) || (tp-snd_max == tp-snd_una);
-   if (idle  ticks - tp-t_rcvtime = tp-t_rxtcur)
+   if (!(tp-t_flags  TF_IGNOREIDLE) 
+   idle  ticks - tp-t_rcvtime = tp-t_rxtcur)
cc_after_idle(tp);
tp-t_flags = ~TF_LASTIDLE;
if (idle) {
Index: sys/netinet/tcp.h
===
--- sys/netinet/tcp.h   (revision 245823)
+++ sys/netinet/tcp.h   (working copy)
@@ -156,6 +156,7 @@
  #define   TCP_NODELAY 1   /* don't delay send to coalesce packets 
*/
  #if __BSD_VISIBLE
  #define   TCP_MAXSEG  2   /* set maximum segment size */
+#defineTCP_IGNOREIDLE  3   /* disable idle connection handling */
  #define TCP_NOPUSH4   /* don't push last block of write */
  #define TCP_NOOPT 8   /* don't use TCP options */
  #define TCP_MD5SIG16  /* use MD5 digests (RFC2385) */
Index: sys/netinet/tcp_usrreq.c
===
--- sys/netinet/tcp_usrreq.c(revision 245742)
+++ sys/netinet/tcp_usrreq.c(working copy)
@@ -1354,6 +1354,7 @@
  
  		case TCP_NODELAY:

case 

Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-22 Thread Andre Oppermann

On 22.01.2013 21:35, Alfred Perlstein wrote:

On 1/22/13 12:11 PM, John Baldwin wrote:

As I mentioned in an earlier thread, I recently had to debug an issue we were
seeing across a link with a high bandwidth-delay product (both high bandwidth
and high RTT).  Our specific use case was to use a TCP connection to reliably
forward a latency-sensitive datagram stream across a WAN connection.  We would
often see spikes in the latency of individual datagrams.  I eventually tracked
this down to the connection entering slow start when it would transmit data
after being idle.  The data stream was quite bursty and would often attempt to
transmit a burst of data after being idle for far longer than a retransmit
timeout.

In 7.x we had worked around this in the past by disabling RFC 3390 and jacking
the slow start window size up via a sysctl.  On 8.x this no longer worked.
The solution I came up with was to add a new socket option to disable idle
handling completely.  That is, when an idle connection restarts with this new
option enabled, it keeps its current congestion window and doesn't enter slow
start.

There are only a few cases where such an option is useful, but if anyone else
thinks this might be useful I'd be happy to add the option to FreeBSD.


This looks good, but it almost sounds like a bug for TCP to be doing this 
anyhow.


It's not a bug.  It's by design.  It's required by the RFC.


Why would one want this behavior?


Network conditions change all the time.  Traffic and congestion comes and goes.
Connections can go idle for milliseconds to minutes to hours.  Whenever enough
time has passed network capacity probing has to start anew.


Wouldn't it make sense to keep the window large until there was a problem 
rather than
unconditionally chop it down?  I almost think TCP is afraid that you might wind 
up swapping out a
10gig interface for a modem?  I'm just not getting it.  (probably simple 
oversight on my part).


The very real fear is congestion meltdown.  That is the reason we ended up with
TCP's AIMD mechanism in the first place.  If everybody were to blast into the
network anyone will suffer.  The bufferbloat issue identified recently makes 
things
even worse.


What do you think about also making this a sysctl for global on/off by default?


Please don't.  The correct fix is either a) to use the initial window as the 
restart
window (up to 10 MSS nowadays); b) to use a decay mechanism based on the time 
since
the last network condition probe.  Even the latter must decay to initCWND 
within at
most 1MSL.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-22 Thread John Baldwin
On Tuesday, January 22, 2013 3:35:40 pm Alfred Perlstein wrote:
 On 1/22/13 12:11 PM, John Baldwin wrote:
  As I mentioned in an earlier thread, I recently had to debug an issue we 
  were
  seeing across a link with a high bandwidth-delay product (both high 
  bandwidth
  and high RTT).  Our specific use case was to use a TCP connection to 
  reliably
  forward a latency-sensitive datagram stream across a WAN connection.  We 
  would
  often see spikes in the latency of individual datagrams.  I eventually 
  tracked
  this down to the connection entering slow start when it would transmit data
  after being idle.  The data stream was quite bursty and would often attempt 
  to
  transmit a burst of data after being idle for far longer than a retransmit
  timeout.
 
  In 7.x we had worked around this in the past by disabling RFC 3390 and 
  jacking
  the slow start window size up via a sysctl.  On 8.x this no longer worked.
  The solution I came up with was to add a new socket option to disable idle
  handling completely.  That is, when an idle connection restarts with this 
  new
  option enabled, it keeps its current congestion window and doesn't enter 
  slow
  start.
 
  There are only a few cases where such an option is useful, but if anyone 
  else
  thinks this might be useful I'd be happy to add the option to FreeBSD.
 
 This looks good, but it almost sounds like a bug for TCP to be doing 
 this anyhow.
 
 Why would one want this behavior?
 
 Wouldn't it make sense to keep the window large until there was a 
 problem rather than unconditionally chop it down?  I almost think TCP is 
 afraid that you might wind up swapping out a 10gig interface for a 
 modem?  I'm just not getting it.  (probably simple oversight on my part).
 
 What do you think about also making this a sysctl for global on/off by 
 default?

No, I think this is the proper default and RFC 5681 makes this a SHOULD.  The
burst at line rate argument is a very good one.  Normally if you have a stream
of data your data rate is clocked by the arrival of return ACKs (once you have
filled the window), and slow starts keeps you throttled at the beginning from
flooding the pipe.  However, if your connection becomes idle then you will
accumulate a large number of ACKs and be able to spend them all at once when
you get a burst of data to send.  This burst can then use a higher effective
bandwidth than the normal flow of traffic and could overwhelm a switch.

Also, for the cases where this is most useful (high RTT), it is not at all
unimaginable for network conditions to change dramatically.  In my use case we
have dedicated lines and control what goes across them so we don't have to
worry about that, but the general use case certainly needs to take that into
account.

-- 
John Baldwin
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org


Re: [PATCH] Add a new TCP_IGNOREIDLE socket option

2013-01-22 Thread Sepherosa Ziehau
On Wed, Jan 23, 2013 at 4:11 AM, John Baldwin j...@freebsd.org wrote:
 As I mentioned in an earlier thread, I recently had to debug an issue we were
 seeing across a link with a high bandwidth-delay product (both high bandwidth
 and high RTT).  Our specific use case was to use a TCP connection to reliably
 forward a latency-sensitive datagram stream across a WAN connection.  We would
 often see spikes in the latency of individual datagrams.  I eventually tracked
 this down to the connection entering slow start when it would transmit data
 after being idle.  The data stream was quite bursty and would often attempt to
 transmit a burst of data after being idle for far longer than a retransmit
 timeout.

 In 7.x we had worked around this in the past by disabling RFC 3390 and jacking
 the slow start window size up via a sysctl.  On 8.x this no longer worked.
 The solution I came up with was to add a new socket option to disable idle
 handling completely.  That is, when an idle connection restarts with this new
 option enabled, it keeps its current congestion window and doesn't enter slow
 start.

 There are only a few cases where such an option is useful, but if anyone else
 thinks this might be useful I'd be happy to add the option to FreeBSD.

I think what you need is the RFC2861, however, you probably should
ignore the application-limited period part of RFC2861.

Best Regards,
sephe


 Index: share/man/man4/tcp.4
 ===
 --- share/man/man4/tcp.4(revision 245742)
 +++ share/man/man4/tcp.4(working copy)
 @@ -205,6 +205,18 @@
  in the
  .Sx MIB Variables
  section further down.
 +.It Dv TCP_IGNOREIDLE
 +If a TCP connection is idle for more than one retransmit timeout,
 +it enters slow start when new data is available to transmit.
 +This avoids flooding the network with a full window of traffic at line rate.
 +It also allows the connection to adjust to changes to network conditions
 +that occurred while the connection was idle.  A connection that sends
 +bursts of data separated by large idle periods can be permamently stuck in
 +slow start as a result.
 +The boolean option
 +.Dv TCP_IGNOREIDLE
 +disables the idle connection handling allowing connections to maintain the
 +existing congestion window when restarting after an idle period.
  .It Dv TCP_NODELAY
  Under most circumstances,
  .Tn TCP
 Index: sys/netinet/tcp_var.h
 ===
 --- sys/netinet/tcp_var.h   (revision 245742)
 +++ sys/netinet/tcp_var.h   (working copy)
 @@ -230,6 +230,7 @@
  #defineTF_NEEDFIN  0x000800/* send FIN (implicit state) 
 */
  #defineTF_NOPUSH   0x001000/* don't push */
  #defineTF_PREVVALID0x002000/* saved values for bad rxmit 
 valid */
 +#defineTF_IGNOREIDLE   0x004000/* connection is never idle */
  #defineTF_MORETOCOME   0x01/* More data to be appended 
 to sock */
  #defineTF_LQ_OVERFLOW  0x02/* listen queue overflow */
  #defineTF_LASTIDLE 0x04/* connection was previously 
 idle */
 Index: sys/netinet/tcp_output.c
 ===
 --- sys/netinet/tcp_output.c(revision 245742)
 +++ sys/netinet/tcp_output.c(working copy)
 @@ -206,7 +206,8 @@
  * to send, then transmit; otherwise, investigate further.
  */
 idle = (tp-t_flags  TF_LASTIDLE) || (tp-snd_max == tp-snd_una);
 -   if (idle  ticks - tp-t_rcvtime = tp-t_rxtcur)
 +   if (!(tp-t_flags  TF_IGNOREIDLE) 
 +   idle  ticks - tp-t_rcvtime = tp-t_rxtcur)
 cc_after_idle(tp);
 tp-t_flags = ~TF_LASTIDLE;
 if (idle) {
 Index: sys/netinet/tcp.h
 ===
 --- sys/netinet/tcp.h   (revision 245823)
 +++ sys/netinet/tcp.h   (working copy)
 @@ -156,6 +156,7 @@
  #defineTCP_NODELAY 1   /* don't delay send to coalesce 
 packets */
  #if __BSD_VISIBLE
  #defineTCP_MAXSEG  2   /* set maximum segment size */
 +#defineTCP_IGNOREIDLE  3   /* disable idle connection handling */
  #define TCP_NOPUSH 4   /* don't push last block of write */
  #define TCP_NOOPT  8   /* don't use TCP options */
  #define TCP_MD5SIG 16  /* use MD5 digests (RFC2385) */
 Index: sys/netinet/tcp_usrreq.c
 ===
 --- sys/netinet/tcp_usrreq.c(revision 245742)
 +++ sys/netinet/tcp_usrreq.c(working copy)
 @@ -1354,6 +1354,7 @@

 case TCP_NODELAY:
 case TCP_NOOPT:
 +   case TCP_IGNOREIDLE:
 INP_WUNLOCK(inp);
 error = sooptcopyin(sopt, optval, sizeof optval,
 sizeof optval);
 @@ -1368,6 +1369,9 @@