Re: [aqm] think once to mark, think twice to drop: draft-ietf-aqm-ecn-benefits-02

Bob Briscoe Mon, 13 Apr 2015 08:12:08 -0700

David,

Returning from a fortnight offlist...

I think your conception of how ECN works is incorrect. You describeECN as if the AQM marks one packet when it drops another packet. Yousay that the ECN-mark speeds up the retransmission of the droppedpacket. On the contrary, the idea of classic ECN [RFC3168] is thatthe ECN marks replace the drops. In all known testing (exceptpathological cases), classic ECN effectively eliminates drops for allECN-capable packets.

Nonetheless, I do agree with your sentiment that the perfect is theenemy of the good. We can remove most of the really bloat-inducedlatency without ECN. So the message must be clear: Deploy AQM now. Noneed to wait for ECN. But implementations SHOULD allow ECN packets tobe classifed into a separately configurable instance of the AQM algo. {Note 1}

Similarly, this WG made sure we did not deprecate RED in the AQMrecommendations. Because, in existing equipment, even a poorly tunedRED is usually much better than a bloated buffer with no AQM.

Just as the WG mustn't confuse messages, you mustn't get confused bya discussion about the potential for more ambitious reductions inlatency. Koen started the thread with reference to our presentationin the ICCRG in the IRTF, where 'R' in both cases stands for Research.

And I believe it is valid for the ECN benefits draft (in the IETF AQMWG) to point to the potential of ECN, by using insights from researchin progress.

There are a number of different contributions to unnecessary latency,not just the one popularised as bufferbloat:

[In the following, I will use the term 'short message(s)' asshorthand for either a short interactive flow or a long flowconsisting of short interactive messages (like in a game), or videoframes or voice datagrams or anything where the perceived latencydepends on the latency of each 'message', especially a string ofmessages with serial dependency as is typical in Web.]

#1 The well-known bufferbloat problem, where a long-running flowfills a bloated buffer, delays short messages.

  - AQM and/or flow queuing can remove this delay, without needing ECN.

#2 A loss causes head-of-line blocking for short message(s) whilewaiting for the retransmission.

  - AQM without ECN cannot remove this delay.
  - Flow queuing cannot remove this delay, if the losses are self-induced.

- FEC can remove this delay without needing ECN, but the increasedredundancy is equivalent to poorer utilisation (altho only the shortflows need the redundancy).

  - ECN can remove this delay.

#3 A loss near the end of a short flow can lead to multi-RTT delay.
  - AQM without ECN cannot remove this delay.

- Techniques like tail-loss probe and RTO-restart can mitigatethis delay, without ECN, but not remove it.

  - ECN can remove this delay.

#4 The Reno/Cubic sawtooth causes variation in delay between 1 and 2base RTTs. This can affect short messages.- AQM without ECN cannot remove this delay unless configured tosacrifice utilisation.- Flow queuing removes this delay variation if caused by aseparate long-running flow.{Note 2}- A change to the TCP algo (e.g. DCTCP) can remove this delayvariation. Smaller sawteeth imply a much higher signalling rate,which in turn requires ECN, otherwise drop probability would beexcessive. (This was the main point Koen was making.)

  - Therefore, ECN can remove this delay.

#5 Slow-starts{Note 3} cause spikes in delay.

- AQM without ECN cannot remove this delay, and typically AQM isdesigned to allow such bursts of delay in the hope they willdisappear of their own accord.- Flow queuing can remove the effect of these delay bursts onother flows, but only if it gives all flows a separate queue from thestart.{Note 2}- Delay-based softening of slow-start, such as Hybrid Slow-Startin Linux, can mitigate these variations, but with increased risk ofcoming out of SS early, causing significantly longer completion time.- ECN with AQM based on the instantaneous queue limits this delay,without the risk of longer completion time.


#6 Slow-starts{Note 3} can cause runs of losses, which in turn cause delays.
  - AQM without ECN cannot remove these delays.
  - Flow queuing cannot remove these losses, if self-induced.

- Delay-based SS like HSS can mitigate these losses, withincreased risk of longer completion time.

  - ECN can remove these losses, and the consequent delays.

Summary:
* AQM alone solves the main problem
* Flow queuing solves or mitigates most of the remaining secondary problems.

* ECN has the potential to solve all the remaining secondary problems(pending further research to prove some of them).

Whether flow queuing is applicable depends on the scale. The work I'mdoing with Koen is to reduce the cost of the queuing mechanisms onour BNGs (broadband network gateways). We're trying to reduce thecost of per-customer queuing at scale, so per-flow queuing is simplyout of the question. Whereas ECN requires no more processing than drop.


ECN has potential cheating problems, but we have per-customer queues anyway.

Using flow as the unit of allocation also has its own problems, withno proposed solutions.

Bob

{Note 1}: Your general point is that the perfect can be the enemy ofthe good. Here's the presentation I gave in TSVAREA straight after VJpresented CoDel in 2012 entitled "DCTCP & CoDel; the Best is theFriend of the Good."

<http://www.bobbriscoe.net/presents/1207ietf/1207-tsvarea-dctcp.pdf>

{Note 2}: A lone flow can cause this delay variation to itself, butthat's irrelevant because, if the delay were not in the network itwould be at the sender.

{Note 3}: Delay and loss spikes can equivalently be caused whenCubic's window rises to seek out newly available capacity afteranother flow finishes or the link rate varies.


At 05:16 30/03/2015, David Lang wrote:

On Sat, 28 Mar 2015, Scheffenegger, Richard wrote:
David,
Perhaps you would care to provide some text to address themisconception that you pointed out? (To wait for a 100% fix as a90% fix appears much less appealing, while the current state of art is at 0%)
Ok, you put me on the spot :-) Here goes.
If you think that aqm-recommendations is not strogly enough worded.I think this particular discussion (to aqm or not) really belongsthere. The other document (ecn benefits) has a different target inarguing for going those last 10%...
so here is my "elevator pitch" on the problem. Feel free to takeanything I say here for any purpose, and I'm sure I'll get correctedfor anything I am wong on
Problem statement: Transmit buffers are needed to keep the networklayer fully utilized, but excessive buffers result in poor latencyfor all traffic. This latency is frequently bad enough to cause sometypes of traffic to fail entirely.
<link to more background goes here, including how separatebenchmarks for throughput and latency have mislead people, "packetloss considered evil", cheaper memory encouraging larger buffers,etc. Include tests like netperf-wrapper and ping latency whileunder load, etc. Include examples where buffers have resulted inlatencies so long that packets are retransmitted before the firstcopy gets to the destination>
Traditionally, transmit buffers have been sized to handle a fixednumber of packets. Due to teh variation in packet sizes, it isimpossible to tune this value to both keep the link fully utilizedwhen small packets dominate the trafific without having the queuesize be large enough to cause latency problems when large packetsdominate the traffic.
Shifting to Byte Queue Lengths where queues are allowed to hold avariable number of packets depending on how large they are makes itpossible to manually tune the transmit buffer size to get goodlatency under all traffic conditions at a given speed. However, thisstep forward revealed two additional problems.
1. whenever the data rate changes, this value needs to be manuallychanged (multi-link paths loose a link, noise degrades maxthroughput on a link, etc)
2. high volume flows (i.e. bulk downloads) can starve other flows(DNS lookups, VoIP, Gaming, etc). this happens because space in tuequeue is on a first-com-first-served basis, so the high-volumetraffic fills the queue (at which point it starts to be dropped),but all other traffic that tries to arrive is also dropped. It turnsout that these light flows tend to have a larger effect on the userexperience than heavier flows, because things tend to be serializedbehind the lighter flows (DNS lookup before doing a large download,retrieving a small HTML page to find what additional resources needto be fetched to display a page), or the user experience is directlyeffected by light flows (gaming lag, VoIP drops, etc)
Active Queue Management addresses these problems by adapting theamount of data that is buffered to match the data transmissioncapacity, and prevents high volume flows from starving low-volumeflows without the need to implement QoS classifications.
<insert link about how you can't trust QoS tags that are made byother organizations, ways that it can be abused, etc>
This is possible because AQM algoithms don't have to drop the newpacket that arrives, the algorithm can decide to drop the packet forone of the heavy flows rather than for one of the lightweight flows.
<insert references to currently favored AQM options here, PIE,fq_codel, cake, ???. Also links to failed approaches>
Turning on aqm on every bottleneck link makes the Internet usablefor everyone, no matter what sort of application they are using.
<insert link on how to deal with equipment you can't configure bythrottling bandwidth before the bottleneck oand/or doing ingressshaping of traffic>
While AQM makes the network usable, there is still additional roomfor improvement. While dropping packets does result in the TCPsenders slowing down,and eventually stabilizing at around the rightspeed to keep the link fully utilized, the only way that sendershave been able to detect problems is to discover that they have notreceived an ack for the traffic within the allowed time. This causesa 'bubble' in the flow as teh dropped packet must be retransmitted(and sometimes a significant amount of data after the dropped packetthat did make it to the destination, but could not be acked becausefo the missing packet).
This "bubble" in the data flow can be greatly compressed byconfiguring the AQM algorithm to send an ECN packet to the senderwhen it drops a packet in a flow. The sender can then adapt faster,slowing down it's new data, and re-sending the dropped packetwithout having to wait for the timeout. This has two major effectsby allowing the sender to retransmit the packet sooner the dealy onthe dropped data is not as long, and because the replacement datacan arrive before the timeout of the following packets, they may notneed to be re-sent. by configuring the AQM algorithm to send the ECNnotification to the sender only when the packet is being dropped,the effect of failure of the ECN packet to get through to the sender(the notification packet runs into congestion and gets dropped, somenetwork device blocks it, etc) is that the ECN enabled case devolvesto match the non-ECN case in that the sender will still detect thedropped packet via the timeout waiting for the ack as if ENCN was not enabled.
<insert link to possible problems that can happen here, includingthe potential for an app to 'game' things if packets are marked at adifferent level than when they are dropped.>
So a very strong recommendation to enable Active Queue Management,while the different algorithms have different advantages and levelsof testing, even the 'worst' of the set results in a night-and-dayimprovement for usability compared to unmanaged buffers.
Enabling ECN at the same point as dropping packets as part ofenabling any AQM algorithm results in a noticable improvement overthe base algorithm without ECN. When compared to the baseline, theimprovement added by ECN is tiny compared to the improvement from enabling AQM.
Is it fair to say that plain aqm vs aqm+ecn variation is on thesame order of difference as the differences between the differentAQM algorithms?
Future research items (which others here may already have done, andwould not be part of my 'elevator pitch')
I believe that currently ECn triggers the exact same slowdown that amissed packet does, and it may be appropriate to have the sender doa less drastic slowdown.
It would be very interesing to provide soem way for the applicationsending the traffic to detect dropped packets and ECN responses. Forexample, a streaming media source (especially an interactive onelike video conferencing) could adjust the bitrate that it's sending.
David Lang

_______________________________________________
aqm mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/aqm


________________________________________________________________

Bob Briscoe, BT

_______________________________________________
aqm mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/aqm

Re: [aqm] think once to mark, think twice to drop: draft-ietf-aqm-ecn-benefits-02

Reply via email to