On Wed, Dec 11, 2013 at 1:11 PM, Ilpo Järvinen <[email protected]> wrote: > On Wed, 11 Dec 2013, Dave Taht wrote: > >> On Wed, Dec 11, 2013 at 11:21 AM, Bob Briscoe <[email protected]> wrote: >> > Jim, >> > >> > At 16:55 11/12/2013, Jim Gettys wrote: >> > On Tue, Dec 10, 2013 at 10:04 PM, Bob Briscoe <[email protected]> wrote: >> > Jim, >> > >> > I'm just checking we're not talking past each other. I'll repeat two quotes >> > from each of us, then comment. >> > >> > On Thu, Dec 5, 2013 at 1:13 PM, Bob Briscoe <[email protected]> wrote: >> > >> > 3{New}. It SHOULD be possible to make different instances of an AQM >> > algorithm apply to different subsets of packets that share the same queue. >> > It SHOULD be possible to classify packets into these subsets at least by >> > ECN >> > codepoint [RFC3168] and Diffserv codepoint [RFC2474] (or the equivalent of >> > these fields at lower layers), >> > >> > >> > At 19:50 05/12/2013, Jim Gettys wrote: >> > >> > "Certainly, it may be the same instance of an AQM algorithm, rather than >> > different instances, for example." >> > >> > >> > That's true of course, but the case with one AQM handling all packets >> > within >> > a queue is the norm. I want to check you're happy with the converse: >> > 1) A set-up more like WRED which was based on Dave Clark's RIO (RED with in >> > and out of contract). So we can have WPIE, WCoDel etc where the >> > differentiation between aggregates is provided by different AQM instances >> > in >> > the same queue, not by different queues with different scheduling >> > priorities. >> > 2) Extending this so that AQM differentiation can be between ECN-capable >> > and >> > Not-ECN-capable aggregates, not just between Diffserv classes (an example >> > being CoDel with a lower 'interval' for ECN-capable packets). >> > >> > I presented the evaluations of this last idea in tsvwg on the final Friday >> > of the Vancouver IETF - I don't think you were there. < >> > http://www.ietf.org/proceedings/88/slides/slides-88-tsvwg-20.pdf > >> > >> > >> > Yes, unfortunately I had to leave before the Friday session. >> > This is my primary motivation for this wordsmithing - I'm trying allow us >> > to >> > move towards zero signalling delays in CoDel, PIE and RED (currently >> > defaults of 200ms, 100ms and 512packets respectively, which are not good >> > for >> > dynamics). >> > >> > >> > Certainly signalling delays are very important: this is why I'm favorably >> > inclined to "head mark/drop", as it signals TCP as quickly as possible, >> > keeping the response of the TCP feedback loop as tight as possible (and >> > part >> > of why I like CoDel so much for the highly variable bandwidth problem we >> > face at the edge of the net). >> > >> > It's *really* important than when the bandwidth drops suddenly that >> > everyone >> > gets told to slow down quickly (exactly how quickly probably depends on the >> > propagation change characteristics of the medium), or packets can pile up >> > in >> > a big way. >> > >> > How quickly the mark/drop algorithm can figure out that signalling is >> > appropriate is the *other* piece of getting good dynamics. Here I don't >> > doubt that something may be discovered that is better than CoDel in the >> > slightest. >> > It takes a CoDel instance (within an fq structure) 200ms from its queue >> > first passing 'threshold' before it will ever drop the first packet (unless >> > the queue hits taildrop before that). So if the RTT is 20ms, that's 220ms >> > signalling delay. In fq_codel this creates considerable self-delay for >> > short >> > flows or r-t apps, which kill their own latency before they get any loss >> > signal to tell them to slow down. Even for elastic flows, with congestion >> > signals delayed by so much, they risk hitting themselves with a huge train >> > of overshoot loss. This would be the same for fq_pie, except the number is >> > 100ms + RTT. >> >> Things have so consistently expressed things this way that I began to >> doubt the reality myself. It seems like a large number of folk on this list >> don't get it either, so I am going to try an explain in a new way. >> >> Tackling codel first: >> >> the first phase of codel has effectively a "training" period where a link >> going >> from unloaded to loaded for the first time ever - the very first drop with >> the >> default interval will happen in 200ms, yes. IF it stays loaded and over the >> target delay after the first drop/mark, it will then tune to ever >> smaller intervals >> to approximate an ideal drop rate until the latency on the link drops below >> the >> target. At which point the algorithm saves that rate, and stops doing >> anything >> until the next time the target delay is exceeded. >> >> Some keep asserting that that is all there is to codel, saying things like >> "there is a linear increase in drop probability" using the invsqrt >> mechanism, >> >> *which is true during the training phase*. >> >> After that approximation of ideal drop/mark rate is obtained, the algorithm >> goes quiescent until the next time the target delay is consistently exceeded, >> at which point it schedules the next drop at a little more than the stored >> previous drop rate. It then continually seeks around that point up and down. >> >> If the delay drops below the target in this phase, the algorithm stops >> again and decreases the drop rate again, as it's too high. If the delay >> stays above target after the drop for the current value of the interval, >> the drop rate increases. >> >> This is an interesting solution to kleinrocks formulation of "power", where >> he once said an average of one packet should be in the queue, codel aims >> to never have less than one packet in the queue. >> >> And the switch into and out of drop mode going above target is entirely >> dependent on the characteristics over time of the flows on the system, >> completely nonlinear, and where codel spends 99.999% of it's time on a >> loaded link. >> >> As debussy said: "Music is the *space* between the notes". >> >> I wish i had a name for this second "seeking" phase that makes as much >> sense as "congestion avoidance". >> >> So asserting that you'll have a 200ms interval on a codled link always >> is just blatantly incorrect. On first boot, yes. On a busy network, >> never again.[1] > > No no. The queue empties after CoDel overshoots the marking probability > and then CoDel stops and starts from scratch.
No it doesn't. Usually. When the queue length drops below 5ms, codel stops dropping packets, and the queue does not drain to zero. There have been several variants of the control law so far, with varying degrees of success across different ranges of bandwidths. http://www.pollere.net/CoDel.html documents some of them. If you would like to back up your assertion with data taken against the linux codel variant, the more advanced ns2_codel, or fq_codel, please do so. Patches are available for linux here for the more advanced stuff, and please be aware of: http://www.bufferbloat.net/projects/codel/wiki/Best_practices_for_benchmarking_Codel_and_FQ_Codel One of the biggest problems we've seen with overshoot has been dealing with TSO and GSO and GRO offloads, which were not modeled in ns2 - and as of linux 3.12 much of the TSO and GSO problem were mostly solved (and don't happen in the real world of routers in the first place, but only on testbeds not emulating the real world. Another problem we've seen is scheduling latency in virtualized environments can exceed the target. Don't do that. Hmm… what else. The original paper contained an error that was corrected shortly after it hit dead tree land. The as-deployed-in-linux variant has a tendency to overshoot in some cases, and has issues at really high levels of flows, but in the fq_codel universe it is usually fine. One problem we know exists in the real world is the 5ms target is unachievable at very low (<4mbit) bandwidths in which case we have been increasing the target delay for that rate, and there is also a patch under test that eliminates the maxpacket check that helps in that case as well. the largest known fq_codel deployment (free.fr) uses a simple formula to set the target at those rates. I don't think it's correct but the packet scheduling portion hides the issues here. No, codel did not fully achieve the parameterless goal, and also was not targeted at the data center environments where different rules of physics apply in it's initial release. Development continues, but has mostly been focused on the *fq_codel variants which are superior in every benchmark we've tried against all other comers. > And yes, CoDel will always > overshoot for sure because it _controls_ until the queue is in its > control, i.e., below the threshold. How big the overshoot is, of course > depends. I don't understand what you mean by overshoot, it will reduce the target delay to less than 5ms. it can certainly reduce it well below 5ms, but inducing only that much jitter while still keeping utilization high is a goodness. I don't have a problem if people shoot for a higher target rates when using codel alone - pie as submitted to the lkml has a target of 20ms and a very large estimation window that I have not yet tested at lower bandwidths. And nobody uses codel by itself, although given pie's problems I am thinking of taking a harder thwack at making codel itself better, abstracting out the hysterisis variable for example. > > What this means in terms of TCP: > > The network/or you fq-queue (in case of fq_codel if you so want) won't be > all that busy according to CoDel once CoDel kindly "coddled" the TCPs, > that's the whole point of CoDel I'm told :-). This is because the queue > happens to be "under control" only if TCP backs off to below 5ms + RTT > level of utilization. ...Now remember the effect of beta here. If the > network remains more than "5ms busy", CoDel thinks that the queue is not > in control and keeps shooting again and until eventually the network is no > longer "busy". The worst case beta * (5ms + RTT) is quite small > utilization and it takes time for a TCP to recover network busyness. You are trying to explain something to me in RED terms that doesn't happen in the real world. packets get smoothed into the "RTT" which serves as inflight storage. Go take a measurement please. Supplying a packet capture would help explain whatever you are seeing. tell us what kernel you are using and follow the guidelines. > >> In the event of a link going completely idle, and staying >> idle, there is hysteresis built into the code so it will retain that >> drop rate for a few >> hundred milliseconds (it's 8*interval in some versions of the code, >> 4 in others), before resetting count to 1 and the resulting estimation >> window to interval. > > True, however, it only means that you'll overshoot more or the time was > too short to retain the trained count in memory (and in that case CoDel > forgets it like you admit). Or do you think that the magic number applied > to count on the recall (was it "-2") works for all traffic? You are not looking at the current as-shipped code. > > > -- > i. -- Dave Täht Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html _______________________________________________ aqm mailing list [email protected] https://www.ietf.org/mailman/listinfo/aqm
