Re: [aqm] Text for aqm-recommendation on independent ECN config

Dave Taht Wed, 11 Dec 2013 12:01:59 -0800

On Wed, Dec 11, 2013 at 11:21 AM, Bob Briscoe <[email protected]> wrote:
> Jim,
>
>
> At 16:55 11/12/2013, Jim Gettys wrote:
>
>
>
> On Tue, Dec 10, 2013 at 10:04 PM, Bob Briscoe <[email protected]> wrote:
> Jim,
>
> I'm just checking we're not talking past each other. I'll repeat two quotes
> from each of us, then comment.
>
> On Thu, Dec 5, 2013 at 1:13 PM, Bob Briscoe <[email protected]> wrote:
>
> 3{New}. It SHOULD be possible to make different instances of an AQM
> algorithm apply to different subsets of packets that share the same queue.
> It SHOULD be possible to classify packets into these subsets at least by ECN
> codepoint [RFC3168] and Diffserv codepoint [RFC2474] (or the equivalent of
> these fields at lower layers),
>
>
> At 19:50 05/12/2013, Jim Gettys wrote:
>
> "Certainly, it may be the same instance of an AQM algorithm, rather than
> different instances, for example."
>
>
> That's true of course, but the case with one AQM handling all packets within
> a queue is the norm. I want to check you're happy with the converse:
> 1) A set-up more like WRED which was based on Dave Clark's RIO (RED with in
> and out of contract). So we can have WPIE, WCoDel etc where the
> differentiation between aggregates is provided by different AQM instances in
> the same queue, not by different queues with different scheduling
> priorities.
> 2) Extending this so that AQM differentiation can be between ECN-capable and
> Not-ECN-capable aggregates, not just between Diffserv classes (an example
> being CoDel with a lower 'interval' for ECN-capable packets).
>
> I presented the evaluations of this last idea in tsvwg on the final Friday
> of the Vancouver IETF - I don't think you were there. <
> http://www.ietf.org/proceedings/88/slides/slides-88-tsvwg-20.pdf >
>
>
> Yes, unfortunately I had to leave before the Friday session.
> This is my primary motivation for this wordsmithing - I'm trying allow us to
> move towards zero signalling delays in CoDel, PIE and RED (currently
> defaults of 200ms, 100ms and 512packets respectively, which are not good for
> dynamics).
>
>
> Certainly signalling delays are very important: this is why I'm favorably
> inclined to "head mark/drop", as it signals TCP as quickly as possible,
> keeping the response of the TCP feedback loop as tight as possible (and part
> of why I like CoDel so much for the highly variable bandwidth problem we
> face at the edge of the net).
>
> It's *really* important than when the bandwidth drops suddenly that everyone
> gets told to slow down quickly (exactly how quickly probably depends on the
> propagation change characteristics of the medium), or packets can pile up in
> a big way.
>
> How quickly the mark/drop algorithm can figure out that signalling is
> appropriate is the *other* piece of getting good dynamics.  Here I don't
> doubt that something may be discovered that is better than CoDel in the
> slightest.
> It takes a CoDel instance (within an fq structure) 200ms from its queue
> first passing 'threshold' before it will ever drop the first packet (unless
> the queue hits taildrop before that). So if the RTT is 20ms, that's 220ms
> signalling delay. In fq_codel this creates considerable self-delay for short
> flows or r-t apps, which kill their own latency before they get any loss
> signal to tell them to slow down. Even for elastic flows, with congestion
> signals delayed by so much, they risk hitting themselves with a huge train
> of overshoot loss. This would be the same for fq_pie, except the number is
> 100ms + RTT.


Things have so consistently expressed things this way that I began to
doubt the reality myself. It seems like a large number of folk on this list
don't get it either, so I am going to try an explain in a new way.

Tackling codel first:

the first phase of codel has effectively a "training" period where a link going
from unloaded to loaded for the first time ever - the very first drop with the
default interval will happen in 200ms, yes. IF it stays loaded and  over the
target delay after the first drop/mark, it will then tune to ever
smaller intervals
to approximate an ideal drop rate until the latency on the link drops below the
target. At which point the algorithm saves that rate, and stops doing anything
until the next time the target delay is exceeded.

Some keep asserting that that is all there is to codel, saying things like
 "there is a linear increase in drop probability" using the invsqrt mechanism,

*which is true during the training phase*.

After that approximation of ideal drop/mark rate is obtained, the algorithm
goes quiescent until the next time the target delay is consistently exceeded,
at which point it schedules the next drop at a little more than the stored
previous drop rate. It then continually seeks around that point up and down.

If the delay drops below the target in this phase, the algorithm stops
again and decreases the drop rate again, as it's too high. If the delay
stays above target after the drop for the current value of the interval,
the drop rate increases.

This is an interesting solution to kleinrocks formulation of "power", where
he once said an average of one packet should be in the queue, codel aims
to never have less than one packet in the queue.

And the switch into and out of drop mode going above target is entirely
dependent on the characteristics over time of the flows on the system,
completely nonlinear, and where codel spends 99.999% of it's time on a
loaded link.

As debussy said: "Music is the *space* between the notes".

I wish i had a name for this second "seeking" phase that makes as much sense as
"congestion avoidance".

So asserting that you'll have a 200ms interval on a codled link always is just
blatantly incorrect. On first boot, yes. On a busy network, never again.[1]

In the event of a link going completely idle, and staying
idle, there is hysteresis built into the code so it will retain that
drop rate for a few
hundred milliseconds (it's 8*interval in some versions of the code,
4 in others), before resetting count to 1 and the resulting estimation
window to interval.

It is certainly possible to come up with a codel variant that more
closely matches the preconditions of the network, and the speeds on
the link.

In the data center case there there are two knobs to set the target
delay and interval to the desired time width of that network. I keep
hoping someone in a datacenter with resources to play will merely set
them, and report the results. I'd suggest target 500us and interval
10ms as numbers most modern x86 server hardware can reliably
accomplish without invoking other scheduling delays from cpu
schedulers, etc. Even lower values are possible, but
these two are fairly close to a data center width, and I hope someone
will try them. [2]

PLEASE? I simply don't have data center resources, yet the experiment
is EASY if you do. (note: ECN is on by default in fq_codel and off by
default in codel.)

:whew: hopefully that explains that.

NEXT, is tackling what has been said about the behavior of fq_codel in this
environment, but to avoid cluttering up the issue, I'll stop here at
getting codel
described right, today. And go back to sigcomm for a while

[1] on fq_codel the values are cached, too, but… well… I will try to explain
later. yes, more work is needed in this area.

[2] It might make sense to modify the hysterisis in the data center case

> Yes, the e2e transport could measure delay growth, but it doesn't know
> whether the delay is coming from a queue that is isolated from others or
> not. So it doesn't want to slow down too quickly in response to delay growth
> in case it gets screwed by other traffic. Ie. using delay growth as a signal
> entails considerable signalling delay due to all the uncertainty.
>
> The proposal you missed in tsvwg was to define ECN as an immediate signal
> from the network, 'interval'=0 in CoDel terms, so the host always gets
> congestion signals as fast as possible, and if it needs bursts of signals
> smoothed out, it can do that itself.
>
> The suggested wording ensures all AQM implementations will allow operators,
> vendors and users to configure such a mechanism. But I've generalised it
> from ECN to Diffserv too (because the implementation would be no different).
>
>
>
> My basic issue is one of terminology: people have talked about "best effort"
> queues.  In reality, this is a "class" of service, rather than a single
> queue, and when you get into the mental model of BE being a single queue,
> (rather than a set of queues) it can lead one astray quickly and easily.
>
>
> Yeah, I know this. I suspected we were talking past each other.
>
> I need you to allow the other case into your mind for this conversation. The
> wording is specifically about the case where "different subsets of packets
> ... share the same queue".
>
> We can talk about an fq structure for this another time, but it's a really
> complicated way of doing it. Given simple looks like it could work, why get
> complicated already?
>
>
> It's really easy to fall into the idea of a single software queue mapping to
> some single hardware supported queue, and that's a cognitive mistake, as
> aggregating MACs are showing us; transmit ops are often the scarcest
> resource...
>
>
> It's only a cognitive mistake if one is not aware of all the options. I'm
> fully aware of all the options.
>
> To be specific, a queue into a wireless medium should be configured so it
> holds some 'good queue' in reserve for transmit ops, but the queues on top
> of this that TCP self-inflicts even briefly are not 'good queues' even if
> they are isolated from other flows by fq - VJ was wrong to generalise the
> phrase 'good queue' to all bursts of queue - it is only necessary to hold
> back from signalling and allow a burst of queue if the only possible signal
> is a drop. With ECN, you don't have this dilemma. This is the key to rapid
> dynamics.
>
>
> Diffserv marking has the potential to give a "hint" to distinguish how
> particular flows should be handled (scheduled) in a service class, and as my
> previous example shows, that hint may be very useful in channel access
> decisions (e.g. voip on 802.11).
>
> But fq_codel teaches the lesson that packet scheduling combined with keeping
> TCP sane is a key improvement over handling either problem apart... In
> particular, the first packets of new flows/reappearing flows are vastly more
> "important" than other packets in terms of the latency cost to users of that
> service. Each flow has in essence its own queue in this service class, and
> we're using information from that to help schedule the packets in ways that
> minimize latency to the user.
>
>
> I know all this. Please can we keep to the conversation about how to avoid
> the 200ms signalling delay that fq_codel inflicts on each flow (and the
> similar signalling delays that other AQMs inflict).
>
>
>
> So in this case, a single algorithm is acting over a bunch of flows in a
> single class of service, and both scheduling packets among the flows, and
> signalling TCP flows appropriately when they should "slow down".
>
>
> Yup, I know this.
>
>
>
> So I think you and I are on close to the same page (but have been burned
> badly in the past by terminology issues getting in the way).  On HTTP/1.1 we
> wasted probably > 2 years talking past each other because we didn't have
> clear and concise terminology that we all understood the same way.
>
>
> As I thought, we are talking past each other. We need to be able to have a
> conversation that is not always "Hmm, that's sounds like it might be
> interesting. Can I tell you about fq_codel now?"
>
>
>
> Bob
>
>
>
> And I don't claim I have the right terminology for all this stuff, either
> (even in this mail).
>
> Which is why I was loathe to suggest exact text...
>                            - Jim
>
>
>
> At 19:50 05/12/2013, Jim Gettys wrote:
>
>
>
> On Thu, Dec 5, 2013 at 1:13 PM, Bob Briscoe <[email protected]> wrote:
> Fred, Gorry, all,
> I promised to suggest text for draft-ietf-aqm-recommendation about allowing
> the AQM's behaviour to be independent for ECN and non-ECN packets. In the
> process, I realised we can't talk about independent AQMs for ECN without
> also including Diffserv.
> This gets messy, because I believe a good AQM for BE traffic with and
> without ECN, should remove much if not all the need for Diffserv. But we
> can't ignore Diffserv.
>
>
> I agree in principle with what Bob is trying to say here (and is very much
> what I've been saying in my blog entry of last summer).
>
> Once you have things under control, the need for Diffserv diminishes
> dramatically (but does not go away).
>
> But as Bob notes, there is still a good use for Diffserv: suitably marked
> traffic may want to contend for access to the channel differently: your
> marked VOIP packets may want to change the priority with which you request
> channel access, so that you get more timely access to the medium. This
> conserves transmit opportunities, which is often the scarcest resource in
> many systems (e.g. 802.11, DOCSIS, etc.). This can be the difference between
> your VOIP working well, and not working well, on a busy 802.11 network as
> well as using the channel as efficiently as possible.
>
> Similarly, if you have packets you know are background, it's helpful to know
> that to ensure that they never contend for access to the medium but will
> always defer to other traffic, and just scavenge available space in other
> transmit opportunities where possible.
>
> I'm a bit loathe though to tie the behavior to queues, however; in
> particular, best effort traffic may want to be sent in the same aggregate as
> higher (or lower) priority traffic, if there is remaining space in the
> aggregate.
>
> In short, the mental model we've had that there is a one-to-one model of
> hardware and software queues (not to mention flows in a given software
> queue) is often incorrect (or at least seriously sub-optimal) in today's
> systems (even if the hardware queues "work" properly, which it appears they
> do not in 802.11).
>
> So I'm not sure Bob's new section 3 here is how to best to state this (or to
> deal with the terminology problem).  Certainly, it may be the same instance
> of an AQM algorithm, rather than different instances, for example.  And "
> It SHOULD be possible" is more a pious wish than anything else.  But I agree
> in spirit with what Bob's trying to say.
>                                - Jim
>
> _________________________________________________________________________________________
> {In Section 4: add another bullet between recommendations 2 & 3:}
> 3{New}. It SHOULD be possible to make different instances of an AQM
> algorithm apply to different subsets of packets that share the same queue.
> It SHOULD be possible to classify packets into these subsets at least by ECN
> codepoint [RFC3168] and Diffserv codepoint [RFC2474] (or the equivalent of
> these fields at lower layers).
> {Then a new section to expand on this before the current Section 4.3.}
> 4.3{New}. Independent AQM Instances for ECN and Diffserv
> The recommendation to provide a separate instance of the AQM for ECN packets
> goes beyond the assumptions of RFC 3168, which assumed that only one
> instance of an AQM will handle both ECN-capable and non-ECN-capable packets.
>
>
> Bob
>
>
> ________________________________________________________________ Bob
> Briscoe,                                                  BT
> _______________________________________________ aqm mailing list
> [email protected] https://www.ietf.org/mailman/listinfo/aqm
>
>
> _______________________________________________
> aqm mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/aqm
>
>
> ________________________________________________________________
> Bob Briscoe,                                                  BT
>
> _______________________________________________
> aqm mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/aqm
>
>
> _______________________________________________
> aqm mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/aqm
>
> ________________________________________________________________
> Bob Briscoe,                                                  BT
>
>
> _______________________________________________
> aqm mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/aqm
>



-- 
Dave Täht

Fixing bufferbloat with cerowrt: http://www.teklibre.com/cerowrt/subscribe.html
_______________________________________________
aqm mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/aqm

Re: [aqm] Text for aqm-recommendation on independent ECN config

Reply via email to