Re: [aqm] Text for aqm-recommendation on independent ECN config

Dave Taht Wed, 11 Dec 2013 13:57:25 -0800

On Wed, Dec 11, 2013 at 1:11 PM, Ilpo Järvinen
<[email protected]> wrote:
> On Wed, 11 Dec 2013, Dave Taht wrote:
>
>> On Wed, Dec 11, 2013 at 11:21 AM, Bob Briscoe <[email protected]> wrote:
>> > Jim,
>> >
>> > At 16:55 11/12/2013, Jim Gettys wrote:
>> > On Tue, Dec 10, 2013 at 10:04 PM, Bob Briscoe <[email protected]> wrote:
>> > Jim,
>> >
>> > I'm just checking we're not talking past each other. I'll repeat two quotes
>> > from each of us, then comment.
>> >
>> > On Thu, Dec 5, 2013 at 1:13 PM, Bob Briscoe <[email protected]> wrote:
>> >
>> > 3{New}. It SHOULD be possible to make different instances of an AQM
>> > algorithm apply to different subsets of packets that share the same queue.
>> > It SHOULD be possible to classify packets into these subsets at least by 
>> > ECN
>> > codepoint [RFC3168] and Diffserv codepoint [RFC2474] (or the equivalent of
>> > these fields at lower layers),
>> >
>> >
>> > At 19:50 05/12/2013, Jim Gettys wrote:
>> >
>> > "Certainly, it may be the same instance of an AQM algorithm, rather than
>> > different instances, for example."
>> >
>> >
>> > That's true of course, but the case with one AQM handling all packets 
>> > within
>> > a queue is the norm. I want to check you're happy with the converse:
>> > 1) A set-up more like WRED which was based on Dave Clark's RIO (RED with in
>> > and out of contract). So we can have WPIE, WCoDel etc where the
>> > differentiation between aggregates is provided by different AQM instances 
>> > in
>> > the same queue, not by different queues with different scheduling
>> > priorities.
>> > 2) Extending this so that AQM differentiation can be between ECN-capable 
>> > and
>> > Not-ECN-capable aggregates, not just between Diffserv classes (an example
>> > being CoDel with a lower 'interval' for ECN-capable packets).
>> >
>> > I presented the evaluations of this last idea in tsvwg on the final Friday
>> > of the Vancouver IETF - I don't think you were there. <
>> > http://www.ietf.org/proceedings/88/slides/slides-88-tsvwg-20.pdf >
>> >
>> >
>> > Yes, unfortunately I had to leave before the Friday session.
>> > This is my primary motivation for this wordsmithing - I'm trying allow us 
>> > to
>> > move towards zero signalling delays in CoDel, PIE and RED (currently
>> > defaults of 200ms, 100ms and 512packets respectively, which are not good 
>> > for
>> > dynamics).
>> >
>> >
>> > Certainly signalling delays are very important: this is why I'm favorably
>> > inclined to "head mark/drop", as it signals TCP as quickly as possible,
>> > keeping the response of the TCP feedback loop as tight as possible (and 
>> > part
>> > of why I like CoDel so much for the highly variable bandwidth problem we
>> > face at the edge of the net).
>> >
>> > It's *really* important than when the bandwidth drops suddenly that 
>> > everyone
>> > gets told to slow down quickly (exactly how quickly probably depends on the
>> > propagation change characteristics of the medium), or packets can pile up 
>> > in
>> > a big way.
>> >
>> > How quickly the mark/drop algorithm can figure out that signalling is
>> > appropriate is the *other* piece of getting good dynamics.  Here I don't
>> > doubt that something may be discovered that is better than CoDel in the
>> > slightest.
>> > It takes a CoDel instance (within an fq structure) 200ms from its queue
>> > first passing 'threshold' before it will ever drop the first packet (unless
>> > the queue hits taildrop before that). So if the RTT is 20ms, that's 220ms
>> > signalling delay. In fq_codel this creates considerable self-delay for 
>> > short
>> > flows or r-t apps, which kill their own latency before they get any loss
>> > signal to tell them to slow down. Even for elastic flows, with congestion
>> > signals delayed by so much, they risk hitting themselves with a huge train
>> > of overshoot loss. This would be the same for fq_pie, except the number is
>> > 100ms + RTT.
>>
>> Things have so consistently expressed things this way that I began to
>> doubt the reality myself. It seems like a large number of folk on this list
>> don't get it either, so I am going to try an explain in a new way.
>>
>> Tackling codel first:
>>
>> the first phase of codel has effectively a "training" period where a link 
>> going
>> from unloaded to loaded for the first time ever - the very first drop with 
>> the
>> default interval will happen in 200ms, yes. IF it stays loaded and  over the
>> target delay after the first drop/mark, it will then tune to ever
>> smaller intervals
>> to approximate an ideal drop rate until the latency on the link drops below 
>> the
>> target. At which point the algorithm saves that rate, and stops doing 
>> anything
>> until the next time the target delay is exceeded.
>>
>> Some keep asserting that that is all there is to codel, saying things like
>>  "there is a linear increase in drop probability" using the invsqrt 
>> mechanism,
>>
>> *which is true during the training phase*.
>>
>> After that approximation of ideal drop/mark rate is obtained, the algorithm
>> goes quiescent until the next time the target delay is consistently exceeded,
>> at which point it schedules the next drop at a little more than the stored
>> previous drop rate. It then continually seeks around that point up and down.
>>
>> If the delay drops below the target in this phase, the algorithm stops
>> again and decreases the drop rate again, as it's too high. If the delay
>> stays above target after the drop for the current value of the interval,
>> the drop rate increases.
>>
>> This is an interesting solution to kleinrocks formulation of "power", where
>> he once said an average of one packet should be in the queue, codel aims
>> to never have less than one packet in the queue.
>>
>> And the switch into and out of drop mode going above target is entirely
>> dependent on the characteristics over time of the flows on the system,
>> completely nonlinear, and where codel spends 99.999% of it's time on a
>> loaded link.
>>
>> As debussy said: "Music is the *space* between the notes".
>>
>> I wish i had a name for this second "seeking" phase that makes as much
>> sense as "congestion avoidance".
>>
>> So asserting that you'll have a 200ms interval on a codled link always
>> is just blatantly incorrect. On first boot, yes. On a busy network,
>> never again.[1]
>
> No no. The queue empties after CoDel overshoots the marking probability
> and then CoDel stops and starts from scratch.


No it doesn't. Usually. When the queue length drops below 5ms, codel
stops dropping packets,
and the queue does not drain to zero.

There have been several variants of the control law so far, with
varying degrees of success
across different ranges of bandwidths.

http://www.pollere.net/CoDel.html documents some of them.

If you would like to back up your assertion with data taken against
the linux codel variant, the more advanced ns2_codel, or fq_codel, please
do so. Patches are available for linux here for the more advanced stuff,
and please be aware of:

http://www.bufferbloat.net/projects/codel/wiki/Best_practices_for_benchmarking_Codel_and_FQ_Codel

One of the biggest problems we've seen with overshoot has been dealing
with TSO and GSO
and GRO offloads, which were not modeled in ns2 - and as of linux 3.12
much of the TSO and GSO
problem were mostly solved (and don't happen in the real world of
routers in the first place, but
only on testbeds not emulating the real world.

Another problem we've seen is scheduling latency in virtualized
environments can exceed
the target. Don't do that.

Hmm… what else.

The original paper contained an error that was corrected shortly after it hit
dead tree land. The as-deployed-in-linux variant has a tendency to
overshoot in some cases,
and has issues at really high levels of flows, but in the fq_codel
universe it is usually fine.

One problem we know exists in the real world is the 5ms target is
unachievable at
very low (<4mbit) bandwidths in which case we have been increasing the
target delay
for that rate, and there is also a patch under test that eliminates
the maxpacket check
that helps in that case as well.

the largest known fq_codel deployment (free.fr) uses a simple formula
to set the target
at those rates. I don't think it's correct but the packet scheduling
portion hides
the issues here.

No, codel did not fully achieve the parameterless goal, and also was
not targeted at the
data center environments where different rules of physics apply in
it's initial release.

Development continues, but has mostly been focused on the *fq_codel variants
which are superior in every benchmark we've tried against all other comers.

> And yes, CoDel will always
> overshoot for sure because it _controls_ until the queue is in its
> control, i.e., below the threshold. How big the overshoot is, of course
> depends.

I don't understand what you mean by overshoot, it will reduce the
target delay to
less than 5ms.

it can certainly reduce it well below 5ms, but inducing only that much jitter
while still keeping utilization high is a goodness.

I don't have a problem if
people shoot for a higher target rates when using codel alone - pie as
submitted to the lkml has a target of 20ms and a very large estimation
window that I have not yet tested at lower bandwidths.

And nobody uses codel by itself, although given pie's problems I
am thinking of taking a harder thwack at making codel itself better,
abstracting out the hysterisis variable for example.

>
> What this means in terms of TCP:
>
> The network/or you fq-queue (in case of fq_codel if you so want) won't be
> all that busy according to CoDel once CoDel kindly "coddled" the TCPs,
> that's the whole point of CoDel I'm told :-). This is because the queue
> happens to be "under control" only if TCP backs off to below 5ms + RTT
> level of utilization. ...Now remember the effect of beta here. If the
> network remains more than "5ms busy", CoDel thinks that the queue is not
> in control and keeps shooting again and until eventually the network is no
> longer "busy". The worst case beta * (5ms + RTT) is quite small
> utilization and it takes time for a TCP to recover network busyness.

You are trying to explain something to me in RED terms that doesn't
happen in the real world.

packets get smoothed into the "RTT" which serves as inflight storage.

Go take a measurement please. Supplying a packet
capture would help explain whatever you are seeing. tell us
what kernel you are using and follow the guidelines.

>
>> In the event of a link going completely idle, and staying
>> idle, there is hysteresis built into the code so it will retain that
>> drop rate for a few
>> hundred milliseconds (it's 8*interval in some versions of the code,
>> 4 in others), before resetting count to 1 and the resulting estimation
>> window to interval.
>
> True, however, it only means that you'll overshoot more or the time was
> too short to retain the trained count in memory (and in that case CoDel
> forgets it like you admit). Or do you think that the magic number applied
> to count on the recall (was it "-2") works for all traffic?

You are not looking at the current as-shipped code.

>
>
> --
>  i.



-- 
Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html
_______________________________________________
aqm mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/aqm

Re: [aqm] Text for aqm-recommendation on independent ECN config

Reply via email to