Dave,

At 04:22 28/10/2013, Dave Taht wrote:
As is probably well known at this point, I make a clear distinction
between networking problems in the data center and ones in the wild
and wooly world.

The two are converging. It will gradually become less sensible to think of the two as separate.


The data center portion of the universe is a couple hundred meters in
diameter, the other, I dunno, let's say 8x from here to the moon
(3,075,200,000m).

DCTCP was fixed ages ago for global RTTs. Download the Linux version from Stanford.


There are all sorts of things that work and are needed in the data
center that probably won't work outside it. People run straighter
cables, do microwave, use pause frames at layer 2, etc, etc, in order
to wring out the last nanosecond of performance, and certainly running
without loss is important in that world.

Not relevant. Don't be confused by the name DCTCP. It's only a few lines of code different from TCP, and it's not only relevant in data centres. It's for low queuing delay.

It's only confined to a data centre while we can't work out how it co-exists with current Internet traffic - that's what we're doing.


On Sun, Oct 27, 2013 at 6:36 PM, Bob Briscoe <[email protected]> wrote:
> John, inline...
>
> At 12:16 26/10/2013, John Leslie wrote:
>>
>> Bob Briscoe <[email protected]> wrote:
>> >
>> > Exec summary
>> > * Early tests show promise that we may have found a way to make the

I'm failing to get excited lacking paper, testable code, and other proofs.

I'll send the paper, but agreed we've only just got started, so you don't have to be excited yet. A lot more testing to do.


>> > ultra-low queuing delay of data centre TCP incrementally deployable
>> > on the public Internet

At the moment the edge of the internet is using things like cable and
gpon which have 2ms or more of inherent latency built into their
grant/response structures.

You're saying you're not interested in low queuing delay solutions because there's often fluff in badly designed access networks anyway? That sounds like a poor excuse for not aspiring to lower delay. What excuse is there for the 100ms of signalling delay on ECN packets in CoDel then? That's 50 times more than this 2ms (currently) inherent access network delay (that some of us are working to remove as well).

If you put an AQM in a host that is the bottleneck for a transfer to your media server in the next room of your house, at the same time as transfers to the wide area, you really don't want to introduce a nominal global RTT into both feedback loops.


>> > * For rtcweb, we need to address
>> >   a) cc for r-t media [rmcat w-g in progress]
>> >   b) Making TCP nicer
>> >   c) minimise ability of TCP to bloat queues [AQM w-g now in progress]
>> >   This addresses b) & c)

I do strongly feel that webrtc needs good aqm and packet scheduling in
order to succeed, and that the webrtc folk should be testing out what
has already been developed (red, sfqred, codel, fq_codel, pie) and the
aqm folk testing the webrtc code.

The AQM for DCTCP was developed 2 years before all of those (except RED). Why decide to build on CoDel in 2012, when DCTCP had already been out there and deployed for 2 years with excellent results? And tcp-rcv-cheat had been out there for 5 years.


I look forward to some interesting conference calls.

I gave the webrtc codebases a whirl against what aqm and packet
scheduling techniques we have already this past summer, the initial
results were encouraging, but I was able to easily crash most of the
browsers sooner than I could take a good, repeatable set of
measurements. That said, things over there are moving along smartly.

>> >
>> > The problem
>> > * All AQMs delay dropping for about one (hard-coded) worst-case RTT,
>> > in case a burst dissipates (allegedly a 'good queue' according to Van
>> > Jacobson)
>>
>>    This assertion is going to need a lot of support.
>>
>>    Bob is a man after my own heart suggesting that an ECN notification
>> may be sent earlier than a packet drop would be indicated. I don't know
>> if we can get there; but IMHO that is essential to getting ECN deployed
>> and used.
>>
>>    I don't think I agree with Bob that what's hard-coded is necessarily
>> a "worst-case" RTT -- and I'm quite sure I'm not willing to make any
>> pronouncement about "all AQMs".
>>
>>    I suggest the talk might be more useful if Bob outlined the AQMs
>> currently in widespread use and detailed _how_ they delay dropping
>> for an estimated RTT.
>
>
> You're right. The 'about' in my sentence was meant to indicate some leeway.
> The specifics depend on each AQM...
>
> The only AQM I know of that doesn't smooth over some nominal RTT is DCTCP
> itself.

DCTCP fits uncomfortably into the aqm catagory.

Oh dear. Is this Dave Taht saying "I don't like it cos it's different to tradition"?

DCTCP shows just how low you can get queuing delay even with a brain-dead simple step AQM as long as you fix the right problem (smooth in TCP, not in the AQM). Doesn't this tell you something about where the problem probably was?


>
> * CoDel was designed for 'interval' to be a worst-case (largest) RTT, which
> it recommends to be set to 100ms. One the queue has exceeded threshold,

Which is 5ms of delay.

105ms of delay on top of RTT delay before it can possibly do anything.

By far the most common case on an access link is a flow arriving when the link's empty.


> CoDel delays for time 'interval' before starting to signal congestion.

Which it then decreases until it finds something approximating the
RTT. For a system in a reasonably steady state, it will find a decent
value and stay there.

Until the queue empties (which it should if the control system is working), then it has to start over, because another flow may have a different RTT.


For bursty traffic it is not necessarily helpful, and slow start is
interesting...

> - I already said how sluggish CoDel would be for a flow with a much shorter

But you've never published a measurement, unless there is one in your new paper?

Lots of experiments in the paper, but not the dynamics yet - we deliberately focused on the long-running behaviour first, to check the critical starvation issues first.


> RTT than CoDel (others have made this point, e.g. for data centres :
> https://lists.bufferbloat.net/pipermail/codel/2012-August/000448.html).

You might want to read the full thread... a stumbling block on fixing
the srtt was the stochastic hashing... so now there is a full blown
non-approximated fq scheduler that implements pacing.

http://www.ietf.org/mail-archive/web/aqm/current/msg00259.html

As I understand it, this only works for an AQM on a host that has the benefit of being told the RTT from higher up the stack.


And I look forward to a certain upcoming presentation in ICCRG with
great anticipation as to additional fallout from this....

>  - And in the other direction, we already know that utilisation suffers
> fairly badly for flows with RTT significantly larger than 100ms.

"target" and "interval" have always been variables in the fq_codel and
codel codebase

It's not use having variables you don't know how to set. That was the lesson from RED that DCTCP started to solve (concerning variable RTT), and CoDel continued to solve (concerning line variable rate). However, unfortunately CoDel chose the not invented here route.

and ecn is supported.

tc qdisc add dev your_device root fq_codel target 500us interval 10ms ecn

AFAICT, ECN is only supported in CoDel in the sense that it is treated as equivalent to drop. The factors where ECN is not equivalent to drop have not been catered for.

And ECN certainly hasn't been considered as something to exploit beyond what drop can do. For instance, to take the approach I'm suggesting, but using CoDel not RED, for ECN-capable packets you would use interval as zero. Then the end-system would smooth out bursts of ECN marks instead of CoDel doing it.


I don't think the pain of adding that one line of configuration would
affect a data center deployment any. I would certainly like to see
benchmarks in a data center environment. I personally lack 10Gig hw.

I note that once you get below 1ms on a typical box today, you start
hitting other bottlenecks in (for example) the cpu scheduler.

As above, not a good reason. We have to knock down each source of delay, one after the other.


> * PIE suppresses all drops for time max_burst (set to 100ms by default) from
> when the drop probability it calculates (but doesn't necessarily use) first
> rises above zero. This is very similar to CoDel, and similar comments are
> applicable.

This too has a variable target value, although there are several other
magic constants that I don't fully understand.

It takes a few minutes of reading to understand all the PIE variables (about the same complexity as CoDel). I think PIE has the following constants (defaults in []):
* target_del (the target queuing delay [20ms] as you say)
* max_burst (has a similar role to interval in CoDel [100ms])
* deq_threshold (there's enough packets in the queue to measure the line-rate [10KB]) * Tupdate, which determines how often the line rate is tested (which shouldn't affect performance unless the machine can't cope with a high enough update rate).

All the other variables are dependents of these, or autotuned (in the case of alpha and beta).

I would certainly like
to see some benchmarks in a data center environment.

Just to be clear, for this conversation I'm not focused on a data centre environment - I'm making DCTCP applicable to the public Internet.

The pie code I
have for linux shoots for a target of 20ms by default, for some
reason.

> * RED requires the constant for its exponentially weighted moving average
> (w_q) to be set taking into account how many packets are likely to arrive at
> the link in a 'typical' RTT. Reverse engineering the values recommended by
> Sally Floyd in the RED paper and in her famous RED parameters Web page
> <http://www.icir.org/floyd/REDparameters.txt>, she recommended a 'typical'
> RTT of about 130ms.
>
> [BTW, I know of people who don't calculate w_q, but just use the value of
> "0.002" that Sally recommended for her 45Mb/s link in the original RED paper > simulations (and repeated at <http://www.icir.org/floyd/REDparameters.txt>).
> This was calculated assuming about 500 packets arrive at a link (from all
> flows) in a typical RTT. Links have got a lot faster since 1993.
> Nonetheless, she was considering 45Mb/s for an aggregated link in those
> days, and it happens to be about right for a single user today.]
>
>
>> > * For a flow with 1/10 or 1/100 of this RTT (e.g. from a CDN or your
>> > home media server), any congestion signal is delayed tens or hundreds
>> > of its own RTTs by these AQMs.

Congestion takes time to occur. Consider tcp dynamics as flows also
only increase relative to RTT, as well.

Er no. As a queue builds, it causes delay. Regular TCP makes this worse than it has to be by using large saw-teeth, so they become a large inherent variable source of delay. VJ calls this a good queue. But if a variant of TCP can remove this variable source of delay, it can't be good.


A core question is what do you want response time to congestion to be under?

100ms? 5ms? 1ms? 500us?

Wrong question. Answer depends on RTT. That's the whole point. Consider that you may have been brain-washed.


>>    Clearly, RTTs differing by a factor of ten are quite common at most
>> nodes traversed in a typical path; and it seems _very_ suboptimal to

2.4 billion folk connect to the internet generally at RTTs between 20 and 80ms.

The Internet includes your home network, potentially a cache in your home gateway, I/O lanes within your own machine, etc etc. No-one "connects to the Internet at an RTT of" anything. Today, all these 2.4B folk use connections over the Internet with a very wide range of RTTs.


In a data center environment I have no data.

Typically 100us or 200us base RTT. And all that cheap data centre kit can be and will be re-used in the public Internet too.


>> have the responsibility for guessing the RTT at the node which must
>> drop packets.

I certainly favor the ongoing development of end to end cc capable of
monitoring the RTT and doing the right thing.

Good.


> For packets that do not support ECN, the dropping node has to make a guess
> at the RTT, so as not to drop packets unnecessarily, because drop is an
> impairment as well as a congestion signal. So a transport cannot 'undrop'
> packets.
>
> Our point though is that a network node doesn't have to mimic this behaviour
> for ECN packets, because ECN is not an impairment. So a transport can
> un-ECN-mark packets (by smoothing out bursts itself).
>
>
>> > * A TCP flow in slow-start doesn't need the burst smoothed anyway
>> >   - delaying the signal just makes slow-start overshoot more
>> >   - a TCP in slow-start knows that it won't allow the burst to
>> > dissipate anyway

The end of slow start is an ECN notification or a packet drop. What am
I missing here?

A traditional AQM (RED, CoDel, PIE) suppresses any signal (ECN or drop) for another 100ms or so after slow start has pushed the queue over the delay threshold. That's on top of the feedback delay of 1RTT, because it doesn't know this is slow-start, so it waits to see if the queue goes away.

The transport knows whether it's in slow-start, so it doesn't need to do any smoothing here - it can drop straight out of slow-start without smoothing.

The transport knows when it's in congestion avoidance too, of course, then it can smooth out bursts of ECN by doing its own EWMA over the actual RTT (that it knows properly).

Ie the AQM can't be selective about when it smooths bursts, because it doesn't know what phase each transport flow is in, but each transport does and therefore can.


>>    A critical point! (It seems obvious to me, but is it obvious to
>> everyone?)

>>
>> > The solution: make ECN also mean "Immediate Congestion Notification"?
>> > * For ECN-capable packets, shift the job of hiding bursts from network
>> > to
>> > host:
>> >   - the network signals ECN with no smoothing delay
>> >   - then the transport can hide bursts of ECN signals from itself
>>
>>    But can we get there from here?
>>
>>    The node doing the ECN notification _can't_ know how the transport
>> will react; and the transport receiving and ECN notification can't know
>> whether the forwarding node has "smoothed" the signal. (It is truly a
>> shame we haven't left any bits for signals like this!)
>
>
> Well, we do have ECT(1) still only assigned experimentally and never used,
> which we could decide to use for this immediate ECN. However, first I want
> to see whether people think it might be feasible to just redefine the
> meaning of CE.

A lot of the NONCE logic has been discussed in other rfcs.

>
> Rationale: So few buffers have ECN support turned on anyway that we should
> be able to redefine ECN so that many more will want to turn it on.
>
> For those AQMs that already support ECN, we believe this retrospective
> change will make them only a little worse than they are already (and the
> operator can update them by simple reconfiguration anyway, and is more
> likely to do so, given these are clearly early-adopter networks).
>
>
>> >   - the transport knows
>> >     o whether it's TCP or RTP etc,
>> >     o whether its in congestion avoidance or slow-start,
>> >     o and it knows its RTT,
>> >     o so it can know whether to respond immediately or to smooth the
>> >     signals,
>> >     o and if so, over what time
>>
>>    Yes, but it can't know what smoothing may already have been applied.
>
>
> Yes. If this is a problem, we will have to consider using ECT(1) not CE.
> But it's pretty academic when so few buffers support ECN.
>
> The tiny proportion that do support ECN will already smooth by a 'typical
> RTT' of about 100ms.
>
> If a 20ms RTT flow adds smoothing over its own RTT to this, it will be
> smooth over 120ms.
> The main problem there is not the extra 20ms, it's the original 100ms, which
> we won't lose unless we make this change somehow.
>
>
>> >   - then short RTT flows can smooth the signals with only the delay
>> > of their /own/ RTT
>> >     o so they can fill troughs and absorb peaks that longer RTT flows
>> > cannot
>> >   - a TCP only needs to smooth the signals if in congestion avoidance
>> >     o in slow start, it can respond immediately, thus reducing overshoot
>>
>>    This would, IMHO, improve "slow start".
>>
>> > Incremental Deployment:
>> > * Immediate congestion notification doesn't need new AQM implementation
>> >   - it can use the widely implemented WRED algorithm with an
>> > unexpected configuration
>>
>>    Bob is beginning to lose me here. Does he mean that a forwarding node
>> would apply WRED for both drop and ECN, but with different parameters?
>>
>> > * The network classifies packets for this AQM treatment based on
>> > their ECN-capability
>> >   - Without ECN, it smoothes the queue before signalling drops
>>
>>    Bob has lost me now -- apparently he doesn't mean different
>> parameters... and I don't recognize this "smoothing" step in WRED.
>
>
> I do mean that a forwarding node would apply WRED for both drop and ECN, but
> with different parameters.
>
> Each WRED policy-map includes a setting for this smoothing parameter, which
> Cisco calls the exponential-weighting-constant. Many people don't notice
> it's there and they just leave it at the default. For instance, Cisco set it
> to
> 2^(-9) ~ 0.002 by default for each of the WRED policy-maps (see
> http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/fswfq26.html#wp1039982).
>
>
>> >   - With ECN, it signals immediately, without any smoothing delay
>> >   - (as today, the operator can still use WRED with the Diffserv field
>> > too)
>>
>>    (Do we need to confuse this discussion by adding diffserv?)
>
>
> A non-Diffserv network still doesn't need to worry about Diffserv.
>
> I put this in parentheses because, if WRED is used today, it is usually used
> with Diffserv, and I didn't want anyone to worry that they wouldn't be able
> to continue to do this (e.g. BT use WRED with Diffserv in enterprise
> networks, as do many other carriers).

Perhaps we need a taxonomy so we can all talk to areas of the network
we care about? RED as currently defined will not work correctly on
variable rate networks, like wireless...

I have never thought we'd end up with one aqm or tcp to rule them all..

DC: Data Center
WI: Wireless
FL: Fixed Line

Again, this immediate ECN stuff is for public Internet (not just data centres).


>
>> > * For TCP apps, the stack will use 'DCTCP' (we've tweaked it), if the
>> > ends negotiate ECN with the accurate feedback capability.
>>
>>    Have we settled on "accurate feedback" already? I thought that was
>> still under discussion. (I don't follow exactly what it adds...)
>
>
> See response from Richard Scheffenegger. Essentially the TCPM WG has
> accepted the requirements doc, but not decided between the mechanisms on
> offer.
>
>
>> > * It should 'just work' if an RTP app or a Reno TCP uses ECN.
>>
>>    I don't see any way for a Reno transport using ECN to avoid being
>> starved if ECN arrives earlier (without notice).
>
>
> We haven't tested legacy Reno with ECN yet (we figured legacy Reno without
> ECN is a lot more prevalent, so focused on this first). Nonetheless,
> Reno-ECN is unlikely to starve, because starvation is about long-running
> behaviour, and once a flow has run for more than a couple of 100ms RTTs, the
> immediate ECN signals should be no different from a smoothed ECN. I suspect
> Reno-ECN might be worse in its short-term dynamics. But remember Reno-ECN is
> likely to be a tiny corner-case.
>
>
>> > The request:
>> > * Much more evaluation to do, but first we want to know:
>> >   - if the idea works, would the IETF have an appetite for tweaking
>> > the definition of ECN so it is merely equivalent to drop in the long
>> > term, but the dynamics need not be equivalent.
>>
>>    There's a good question there; but I don't think we're ready for it.
>
>
> At this stage, even we haven't got many answers. So I'm not asking the IETF
> to answer the question right now. I'm merely saying, /if/ our idea works, is > there at least an /appetite/ in the IETF for reconsidering the definition of
> ECN?

I wouldn't mind it, but frankly my own concern was addressing the
security issue, not the current definition.

It seems strange that you weren't aware of solutions to this then.


> We wanted to make the IETF aware of this research early, because it might
> want to at least hold off on any actions that would otherwise close off this
> option.


>
> And if we find that any change is completely out of the question, we have to
> try a different tack (e.g. ECT(1)).
>
>
>>    I'd really like to discuss the dynamics of responding more quickly
>> but perhaps less drastically for almost any real-time flow.
>>
>>    But proving "equivalence in the long term" seems too hard.
>
>
> This should be the easy part, because the longer that conditions are stable,
> a smoothed signal should tend towards an unsmoothed signal, all other
> factors being equal.
>
> Equivalence during dynamics is the hard part, and I'm suggesting we don't
> sweat too much about that, as long as the performance evaluations are not
> too far apart.
>
>
>> > Much better than the ECN that didn't get deployed
>> > * This is Explicit and Immediate Congestion Notification (EICN?)
>> >   - same wire protocol, much greater benefits
>> > * The advantage of the original ECN (avoiding congestive loss) was
>> > too small to be worth the deployment hassle
>>
>>    Actually, I don't agree that was the problem -- instead I believe
>> the code has been deployed but administratively suppressed because
>> the operators don't trust the transports. There _is_ a significant
>> improvement from one-RTT reaction instead of several (to detect a
>> drop), but the whole process is just too complicated, while the
>> opportunity for abuse remains obvious.
>
>
> I agree. That's the 'deployment hassle' side of my sentence - the extra
> trust-enhancing mechanisms that seemed necessary were too much pain for the
> small gain.

I

>
>
>> > * Predictable ultra-low latency without loss too (similar to
>> > DCTCP-ECN) would be worth deploying
>>
>>    I'm optimistic that latency will become an easier argument.
>>
>> > * But we all thought DCTCP could only be deployed in isolation (e.g.
>> > data centres)
>> >   - we all thought DCTCP traffic would starve alongside today's TCP
>> > traffic
>> >   - because in a DCTCP queue, the ECN threshold is lower than you
>> > would trigger drop
>> >   - and we thought ECN & drop had to be equivalent.

I'm looking forward to being convinced otherwise.

Yup, I don't want to take up too much of anyone's time with this until we've proved it. But we decided a heads up was important. That's all.


Bob


>>    (I'm not sure we'll succeed at breaking that "equivalence"...)
>>
>> > * We believe we've found a way to ensure DCTCP-ECN traffic doesn't
>> > starve
>> >   - we still make DCTCP-ECN equivalent to drop in the long-run, but
>> > not in its dynamics
>>
>>    (I'm still not sure it's worth arguing the "long-run".)
>
>
> I mean competing long-running ECN & non-ECN flows stabilise at predictable
> rates, rather than one ratchetting itself down to nothing over time
> (starvation).
>
> That's the primary concern of congestion control 'fairness', before anyone
> starts worrying about what the relative rates are. Given apps get different
> relative rates with different RTTs, with different size objects or by
> opening multiple flows, we don't need to sweat so much about precisely equal
> flow rates; but we must sweat about stable convergence.
>
> Results so far show that the proposed idea is at least very robust against
> starvation.
>
>
> Bob
>
>
>> --
>> John Leslie <[email protected]>
>
>
> ________________________________________________________________
> Bob Briscoe,                                                  BT
> _______________________________________________
> aqm mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/aqm



--
Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html

________________________________________________________________
Bob Briscoe, BT
_______________________________________________
aqm mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/aqm

Reply via email to