Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
On Wed, Nov 28, 2018 at 11:45 PM Jonathan Morton wrote: > > > On 29 Nov, 2018, at 9:39 am, Dave Taht wrote: > > > > …when it is nearly certain that more than one flow exists, means aiming > > for the BDP in a single flow is generally foolish. > > It might be more accurate to say that the BDP of the fair-share of the path > is the cwnd to aim for. Plus epsilon for probing. OK, much better, thanks. > - Jonathan Morton > > ___ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat -- Dave Täht CTO, TekLibre, LLC http://www.teklibre.com Tel: 1-831-205-9740 ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] incremental deployment, transport and L4S (Re: when does the CoDel part of fq_codel help in the real world?)
On Thu, 29 Nov 2018, Jonathan Morton wrote: You are essentially proposing using ECT(1) to take over an intended function of Diffserv. Well, I am not proposing anything. I am giving people a heads-up that the L4S authors are proposing this. But yes, you're right. Diffserv has shown itself to be really hard to incrementally deploy across the Internet, so it's generally bleached mid-path. In my view, that is the wrong approach. Better to improve Diffserv to the point where it becomes useful in practice. I agree, but unfortunately nobody has made me king of the Internet yet so I can't just decree it into existance. Cake has taken steps in that direction, by implementing some reasonable interpretation of some Diffserv codepoints. Great. I don't know if I've asked this but is CAKE easily implementable in hardware? From what I can tell it's still only Marvell that is trying to put high performance enough CPUs into HGWs to do forwarding in CPU (which can do CAKE), all others still rely on packet accelerators to achieve the desired speeds. My alternative use of ECT(1) is more in keeping with the other codepoints represented by those two bits, to allow ECN to provide more fine-grained information about congestion than it presently does. The main challenge is communicating the relevant information back to the sender upon receipt, ideally without increasing overhead in the TCP/IP headers. You need to go into the IETF process and voice this opinion then, because if nobody opposes in the near time then ECT(1) might go to L4S interpretation of what is going on. They do have ECN feedback mechanisms in their proposal, have you read it? It's a whole suite of documents, architecture, AQM proposal, transport proposal, the entire thing. On the other hand, what you want to do and what L4S tries to do might be closely related. It doesn't sound too far off. Also, Bob Briscoe works for Cable Labs now, so he will now have silicon behind him. This silicon might go into other things, not just DOCSIS equipment, so if you have use-cases that L4S doesn't do but might do with minor modification, it might be better to join him than to fight him. -- Mikael Abrahamssonemail: swm...@swm.pp.se ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
> On 29 Nov, 2018, at 9:39 am, Dave Taht wrote: > > …when it is nearly certain that more than one flow exists, means aiming > for the BDP in a single flow is generally foolish. It might be more accurate to say that the BDP of the fair-share of the path is the cwnd to aim for. Plus epsilon for probing. - Jonathan Morton ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
Mikael Abrahamsson writes: > On Tue, 27 Nov 2018, Luca Muscariello wrote: > >> link fully utilized is defined as Q>0 unless you don't include the >> packet currently being transmitted. I do, so the TXtteer is never >> idle. But that's a detail. > > As someone who works with moving packets, it's perplexing to me to > interact with transport peeps who seem enormously focused on > "goodput". My personal opinion is that most people would be better off > with 80% of their available bandwidth being in use without any > noticable buffer induced delay, as opposed to the transport protocol > doing its damndest to fill up the link to 100% and sometimes failing > and inducing delay instead. +1 I came up with a new analogy today. Some really like to build dragsters - that go fast but might explode at the end of the strip - or even during the race! I like to build churches - that will stand for a 1000 years. You can reason about stable, deterministic systems, and build other beautiful structures on top of them. I have faith in churches, not dragsters. > > Could someone perhaps comment on the thinking in the transport > protocol design "crowd" when it comes to this? ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] incremental deployment, transport and L4S (Re: when does the CoDel part of fq_codel help in the real world?)
> On 29 Nov, 2018, at 9:28 am, Mikael Abrahamsson wrote: > > This is one thing about L4S, ETC(1) is the last "codepoint" in the header not > used, that can statelessly identify something. If anyone sees a better way to > use it compared to "let's put it in a separate queue and CE-mark it > agressively at very low queue depths and also do not care about re-ordering > so a ARQ L2 can re-order all it wants", then they need to speak up, soon. You are essentially proposing using ECT(1) to take over an intended function of Diffserv. In my view, that is the wrong approach. Better to improve Diffserv to the point where it becomes useful in practice. Cake has taken steps in that direction, by implementing some reasonable interpretation of some Diffserv codepoints. My alternative use of ECT(1) is more in keeping with the other codepoints represented by those two bits, to allow ECN to provide more fine-grained information about congestion than it presently does. The main challenge is communicating the relevant information back to the sender upon receipt, ideally without increasing overhead in the TCP/IP headers. - Jonathan Morton ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
"Bless, Roland (TM)" writes: > Hi Luca, > > Am 27.11.18 um 10:24 schrieb Luca Muscariello: >> A congestion controlled protocol such as TCP or others, including QUIC, >> LEDBAT and so on >> need at least the BDP in the transmission queue to get full link >> efficiency, i.e. the queue never empties out. > > This is not true. There are congestion control algorithms > (e.g., TCP LoLa [1] or BBRv2) that can fully utilize the bottleneck link > capacity without filling the buffer to its maximum capacity. The BDP Just to stay cynical, I would rather like the BBR and Lola folk to look closely at asymmetric networks, ack path delay, and lower rates than 1Gbit. And what the heck... wifi. :) BBRv1, for example, is hard coded to reduce cwnd to 4, not lower - because that works in the data center. Lola, so far as I know, achieves its tested results at 1-10Gbits. My world and much of the rest of the world, barely gets to a gbit, on a good day, with a tail-wind. If either of these TCPs could be tuned to work well and not saturate 5Mbit links I would be a happier person. RRUL benchmarks anyone? I did, honestly, want to run lola, (codebase was broken), and I am patiently waiting for BBRv2 to escape (while hoping that the googlers actually run some flent tests at edge bandwidths before I tear into it) Personally, I'd settle for SFQ on the CMTSes, fq_codel on the home routers, and then let the tcp-ers decide how much delay and loss they can tolerate. Another thought... I mean... can't we all just agree to make cubic more gentle and go fix that, and not a have a flag day? "From linux 5.0 forward cubic shall: Stop increasing its window at 250ms of delay greater than the initial RTT? Have it occasionally rtt probe a bit, more like BBR? > rule of thumb basically stems from the older loss-based congestion > control variants that profit from the standing queue that they built > over time when they detect a loss: > while they back-off and stop sending, the queue keeps the bottleneck > output busy and you'll not see underutilization of the link. Moreover, > once you get good loss de-synchronization, the buffer size requirement > for multiple long-lived flows decreases. > >> This gives rule of thumbs to size buffers which is also very practical >> and thanks to flow isolation becomes very accurate. > > The positive effect of buffers is merely their role to absorb > short-term bursts (i.e., mismatch in arrival and departure rates) > instead of dropping packets. One does not need a big buffer to > fully utilize a link (with perfect knowledge you can keep the link > saturated even without a single packet waiting in the buffer). > Furthermore, large buffers (e.g., using the BDP rule of thumb) > are not useful/practical anymore at very high speed such as 100 Gbit/s: > memory is also quite costly at such high speeds... > > Regards, > Roland > > [1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless. > TCP LoLa: Congestion Control for Low Latencies and High Throughput. > Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp. > 215-218, Singapore, Singapore, October 2017 > http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf This whole thread, although diversive... well, I'd really like everybody to get together and try to write a joint paper on the best stuff to do, worldwide, to make bufferbloat go away. >> Which is: >> >> 1) find a way to keep the number of backlogged flows at a reasonable value. >> This largely depends on the minimum fair rate an application may need in >> the long term. >> We discussed a little bit of available mechanisms to achieve that in the >> literature. >> >> 2) fix the largest RTT you want to serve at full utilization and size >> the buffer using BDP * N_backlogged. >> Or the other way round: check how much memory you can use >> in the router/line card/device and for a fixed N, compute the largest >> RTT you can serve at full utilization. >> >> 3) there is still some memory to dimension for sparse flows in addition >> to that, but this is not based on BDP. >> It is just enough to compute the total utilization of sparse flows and >> use the same simple model Toke has used >> to compute the (de)prioritization probability. >> >> This procedure would allow to size FQ_codel but also SFQ. >> It would be interesting to compare the two under this buffer sizing. >> It would also be interesting to compare another mechanism that we have >> mentioned during the defense >> which is AFD + a sparse flow queue. Which is, BTW, already available in >> Cisco nexus switches for data centres. >> >> I think that the the codel part would still provide the ECN feature, >> that all the others cannot have. >> However the others, the last one especially can be implemented in >> silicon with reasonable cost. > ___ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat
[Bloat] incremental deployment, transport and L4S (Re: when does the CoDel part of fq_codel help in the real world?)
On Wed, 28 Nov 2018, Dave Taht wrote: see ecn-sane. Please try to write a position paper as to where and why ecn is good and bad. if one day we could merely establish a talmud of commentary around this religion it would help. From my viewpoint it seems to be all about incremental deployment. We have 30 years of "crud" that things need to work with, and the worst-case needs to be a disaster for anything that wants to deploy. This is one thing about L4S, ETC(1) is the last "codepoint" in the header not used, that can statelessly identify something. If anyone sees a better way to use it compared to "let's put it in a separate queue and CE-mark it agressively at very low queue depths and also do not care about re-ordering so a ARQ L2 can re-order all it wants", then they need to speak up, soon. I actually think the "let's not care about re-ordering" would be a brilliant thing, it'd help quite a lot of packet network types become less costly and more efficient, while at the same time not doing blocking of subsequent packets just because some earlier packet needed to be retransmitted. Brilliant for QUIC for instance, that already handles this (at least per-stream). -- Mikael Abrahamssonemail: swm...@swm.pp.se ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
"Bless, Roland (TM)" writes: > Hi Luca, > > Am 28.11.18 um 11:48 schrieb Luca Muscariello: > >> And for BBR, I would say that one thing is the design principles another >> is the implementations >> and we better distinguish between them. The key design principles are >> all valid. > > While the goal is certainly right to operate around the optimal point > where the buffer is nearly empty, BBR's model is only valid from either > the viewpoint of the bottleneck or that of a single sender. I think I agree with this, from my own experimental data. > > In BBR, one of the key design principle is to observe the > achieved delivery rate. One assumption in BBRv1 is that if the delivery > rate can still be increased, then the bottleneck isn't saturated. This > doesn't necessarily hold if you have multiple BBR flows present at the > bottleneck. > Every BBR flow can (nearly always) increase its delivery rate while > probing: it will simply decrease other flows' shares. This is not > an _implementation_ issue of BBRv1 and has been explained in section III > of our BBR evaluation paper. Haven't re-read it yet. > > This section shows also that BBRv1 will (by concept) increase its amount > of inflight data to the maximum of 2 * estimated_BDP if multiple flows > are present. A BBR sender could also use packet loss or RTT increase as Carnage! > indicators that it is probably operating right from the optimal > point, but this is not done in BBRv1. > BBRv2 will be thus an improvement over BBRv1 in several ways. I really really really want a sane response to ecn in bbr. > > Regards, > Roland > ___ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
Luca Muscariello writes: > On Wed, Nov 28, 2018 at 11:40 AM Dave Taht > wrote: > > On Wed, Nov 28, 2018 at 1:56 AM Luca Muscariello > wrote: > > > > Dave, > > > > The single BDP inflight is a rule of thumb that does not account > for fluctuations of the RTT. > > And I am not talking about random fluctuations and noise. I am > talking about fluctuations > > from a control theoretic point of view to stabilise the system, > e.g. the trajectory of the system variable that > > gets to the optimal point no matter the initial conditions > (Lyapunov). > > I have been trying all day to summon the gumption to make this > argument: > > IF you have a good idea of the actual RTT... > > it is also nearly certain that there will be *at least* one other > flow > you will be competing with... > therefore the fluctuations from every point of view are dominated > by > the interaction between these flows and > the goal is, in general, is not to take up a full BDP for your > single flow. > > And BBR aims for some tiny percentage less than what it thinks it > can > get, when, well, everybody's seen it battle it out with itself and > with cubic. I hand it FQ at the bottleneck link and it works well. > > single flows exist only in the minds of theorists and labs. > > There's a relevant passage worth citing in the kleinrock paper, I > thought (did he write two recently?) that talked about this > problem... > I *swear* when I first read it it had a deeper discussion of the > second sentence below and had two paragraphs that went into the > issues > with multiple flows: > > "ch earlier and led to the Flow Deviation algorithm [28]. 17 The > reason that the early work of 40 years ago took so long to make > its > current impact is because in [31] it was shown that the mechanism > presented in [2] and [3] could not be implemented in a > decentralized > algorithm. This delayed the application of Power until the recent > work > by the Google team in [1] demonstrated that the key elements of > response time and bandwidth could indeed be estimated using a > distributed control loop sliding window spanning approximately 10 > round-trip times." > > but I can't find it today. > > > > Here it is > > https://www.lk.cs.ucla.edu/data/files/Kleinrock/Internet%20Congestion%20Control%20Using%20the%20Power%20Metric-Keep%20the%20Pipe%20Just%20Full%2C%20But%20No%20Fuller%20July%202018.pdf Thank you that is more what I remember reading. That said, I still remember a really two paragraph thing that went into footnote 17 of the 40+ years of history behind all this, that clicked with me about why we're still going wrong... and I can't remember what it is. I'll go deeper into the past and go read more refs off of this. ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
Michael Welzl writes: > Just a small clarification: > > > > > To me the switch to head dropping essentially killed the tail > loss RTO > problem, eliminated most of the need for ecn. > > > > I doubt that: TCP will need to retransmit that packet at the head, > and that takes an RTT - all the packets after it will need to wait > in the receiver buffer before the application gets them. > But I don’t have measurements to prove my point, so I’m just > hand-waving… > > I don’t doubt that this kills the tail loss RTO problem. Yea! I wish we had more data on it though. We haven't really ever looked at RTOs in our (enormous) data sets, it's just an assumption that we don't see them. There's terabytes of captures > I doubt that it eliminates the need for ECN. A specific example that burned me was stuarts demo showing screen sharing "just working", with ecn, on what was about a 20ms path. GREAT demo! Very real result from codel. Ship it! Audience applauded madly. fq_codel went into OSX earlier this year. Thing was, there was a 16ms frame rate (at best, probably closer to 64ms), at least a 32ms jitter buffer (probably in the 100s of ms actually), an encoder that took at least a frame's worth of time... and having the flow retransmit a lost packet vs ecn - within a 15ms rtt - with a jitter buffer already there - was utterly invisible also to the application and user. Sooo see ecn-sane. Please try to write a position paper as to where and why ecn is good and bad. if one day we could merely establish a talmud of commentary around this religion it would help. ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] [Codel] found another good use for a queue today, possibly
Jonathan Morton writes: >>> "polylog(n)-wise Independent Hash Function". OK, my google-foo fails >>> me: The authors use sha1, would something lighter weight suit? > >> The current favorite in DPDK land seems to be Cuckoo hashing. >> It has better cache behavior than typical chaining. > > That paper describes an improved variant of cuckoo hashing, using a > queue to help resolve collisions with better time complexity. The > proof relies on (among other things) a particular grade of hash > function being used. SHA1 is described as being suitable since it > offers cryptographic-level performance… We actually need two hashes > with independent behaviour on the same input, one for each table. > > If we were to assume table sizes up to 64K, using both halves of a > good 32-bit hash might be suitable. It may be that plain old Jenkins > hash would work in that context. Supplement that with a 64-entry > queue with linear search (in software) or constant-time CAM search (in > hardware). I was aiming for 2million routes. I gave up trying to wade through it and pinged the authors. fiddling with blake at the moment > > - Jonathan Morton > > ___ > Codel mailing list > co...@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/codel ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] known buffer sizes on switches
On Wed, Nov 28, 2018 at 8:55 AM Dave Taht wrote: > > Bruno George Moraes writes: > > > Nice resource, thanks. > > > > If someone wonders why things look the way they do, so it's all about > > on-die and off-die memory. Either you use off-die or on-die memory, often > > SRAM which requires 6 gates per bit. So spending half a billion gates > > gives you ~10MB buffer on-die. If you're doing off-die memory (DRAM or > > similar) then you'll get the gigabytes of memory seen in some equipment. > > There basically is nothing in between. As soon as you go off-die you might > > as well put at least 2-6 GB in there. > > > > There are some reasearch on new memory devices with unexpected > > results... > > https://ieeexplore.ieee.org/document/8533260 > > > > The HMC memory allows improvements in execution time and consumed > > energy. In some situations, this memory type permits removing the > > L2 cache from the memory hierarchy. > > > > HMC parts start at 2GB That effort actually looks pretty promising. I liked the support for atomic ops too, offloaded.There are also so many useful operations that I'd like to see offloaded to ram - like zeroing memory regions as one example. http://www.hybridmemorycube.org/ Will probably run hot. But: grump: I still don't "get" why the traditional division between memory and cpu makers hasn't collapsed yet. A package like that with a cpu *in it*, and we're done. 4GB "ought to be enough for everybody". 27? years ago, back when I was attempting to write a SF novel, I had an idea for a more efficient way to pack cores and memory together. Basically: shrink the cray 1 design down to about the size of a nickel (or dime!). The cray had that rough shape for optimum routing and cooling, but... the overall shape of the package becomes a hexagon (https://en.wikipedia.org/wiki/Hexagon) cylinder. That gives you 6 or 12 vertical flat surfaces to mount chips on (or just let them stand in slots on the package). There's one natural crossbar bus at the center, connecting the 6 "core" chips more rapidly than the edges. Top, bottom and sides of the package can be used for I/O, power and so on, and each hexagonal component wedged tightly together (instead of today's north-south east-west architectures you get 2 more dimensions horizontally) fill the package with some sort of coolant. Seal it up tight. Test the module as a whole and ship 'em in palletloads. I'm pretty sure the heat circulates from the center out naturally, in every orientation, but what the heck, stick in some MEMs fans in there to keep things pumping along. that design naturally led to 2 cpu chips and 4 memories. Or 4 cpu chips and 2 memories. or 2 cpus 2 mems and 2 IOs. Before you started coming up with things to do with the outer 6 sides. I I never thought separating ram from cpu by more than a millimeter was a good idea. It's a quite a jump to envision going from the cray-1 (115kw!!!) down to the size of a nickel! But everybody has a cray-1 now. They just run too hot. And are often not suited to task, just like the cray was. https://en.wikipedia.org/wiki/Cray-1 Don't know if anyone's ever tried to pattern any circuits on a cylinder though! We are certainly seeing a lot of multi-package modules now (like in epyc) but I'd like 'em to be taller and not need so many darn pins. A full blown wifi router on chip wouldn't need more than... oh... this many pins: https://www.amazon.com/Makerfocus-ESP8266-Wireless-Transceiver-Compatible/dp/B01EA3UJJ4/ref=asc_df_B01EA3UJJ4/?tag=hyprod-20=df0=309773039951=1o1=g=15072864816819105911c===9032156=pla-599566692924=1 > Thank you for that. I do have a long standing dream of a single chip > wifi router, with the lowest SNR possible, and the minimum number of > pins coming off of it. I'd settle for 32MB of (static?) ram on chip as > that has proven sufficient to date to drive 802.11n > > which would let you get rid of both the L2 and L1 cache. That said, I > think the cost of 32MB of on-chip static ram remains a bit high, and > plugging it into a mips cpu, kind of silly. Someday there will be a case > to just doing everything on a single chip, but... > > > > > > > ___ > > Bloat mailing list > > Bloat@lists.bufferbloat.net > > https://lists.bufferbloat.net/listinfo/bloat > ___ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat -- Dave Täht CTO, TekLibre, LLC http://www.teklibre.com Tel: 1-831-205-9740 ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] known buffer sizes on switches
That would be really cool: I loved the Mips we had at YorkU.ca --dave On 2018-11-28 2:02 p.m., Dave Taht wrote: I really don't know a whole heck of a lot about where mips is going. Certainly they remain strong in the embedded market (I do like the edgerouter X a lot), but as for their current direction or future product lines, not a clue. I used to know someone over there, maybe he's restored new directions. Last I recall he was busy obsoleting a whole lot of instruction space in order to make room for "new stuff". He'd even asked me if adding an invsqrt to the instruction set would help, and I sadly replied that that bit of codel was totally invisible on a trace. I really like(d) mips. ton of registers, better instruction set than arm (IMHO), no foolish processor extensions. On Wed, Nov 28, 2018 at 10:26 AM David Collier-Brown wrote: On 2018-11-28 11:55 a.m., Dave Taht wrote: Thank you for that. I do have a long standing dream of a single chip wifi router, with the lowest SNR possible, and the minimum number of pins coming off of it. I'd settle for 32MB of (static?) ram on chip as that has proven sufficient to date to drive 802.11n which would let you get rid of both the L2 and L1 cache. That said, I think the cost of 32MB of on-chip static ram remains a bit high, and plugging it into a mips cpu, kind of silly. Someday there will be a case to just doing everything on a single chip, but... I could see 32MB or more of fast memory on-chip as being attractive when one is fighting with diminishing returns in CPU speed and program parallelizability. In the past that might have excited MIPS, but these days less so. Maybe ARM? IBM? --dave -- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest dav...@spamcop.net | -- Mark Twain ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat -- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest dav...@spamcop.net | -- Mark Twain ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] known buffer sizes on switches
I really don't know a whole heck of a lot about where mips is going. Certainly they remain strong in the embedded market (I do like the edgerouter X a lot), but as for their current direction or future product lines, not a clue. I used to know someone over there, maybe he's restored new directions. Last I recall he was busy obsoleting a whole lot of instruction space in order to make room for "new stuff". He'd even asked me if adding an invsqrt to the instruction set would help, and I sadly replied that that bit of codel was totally invisible on a trace. I really like(d) mips. ton of registers, better instruction set than arm (IMHO), no foolish processor extensions. On Wed, Nov 28, 2018 at 10:26 AM David Collier-Brown wrote: > > On 2018-11-28 11:55 a.m., Dave Taht wrote: > > > Thank you for that. I do have a long standing dream of a single chip > > wifi router, with the lowest SNR possible, and the minimum number of > > pins coming off of it. I'd settle for 32MB of (static?) ram on chip as > > that has proven sufficient to date to drive 802.11n > > > > which would let you get rid of both the L2 and L1 cache. That said, I > > think the cost of 32MB of on-chip static ram remains a bit high, and > > plugging it into a mips cpu, kind of silly. Someday there will be a case > > to just doing everything on a single chip, but... > > I could see 32MB or more of fast memory on-chip as being attractive when > one is fighting with diminishing returns in CPU speed and program > parallelizability. > > In the past that might have excited MIPS, but these days less so. Maybe > ARM? IBM? > > --dave > > -- > David Collier-Brown, | Always do right. This will gratify > System Programmer and Author | some people and astonish the rest > dav...@spamcop.net | -- Mark Twain > > ___ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat -- Dave Täht CTO, TekLibre, LLC http://www.teklibre.com Tel: 1-831-205-9740 ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] known buffer sizes on switches
On 2018-11-28 11:55 a.m., Dave Taht wrote: Thank you for that. I do have a long standing dream of a single chip wifi router, with the lowest SNR possible, and the minimum number of pins coming off of it. I'd settle for 32MB of (static?) ram on chip as that has proven sufficient to date to drive 802.11n which would let you get rid of both the L2 and L1 cache. That said, I think the cost of 32MB of on-chip static ram remains a bit high, and plugging it into a mips cpu, kind of silly. Someday there will be a case to just doing everything on a single chip, but... I could see 32MB or more of fast memory on-chip as being attractive when one is fighting with diminishing returns in CPU speed and program parallelizability. In the past that might have excited MIPS, but these days less so. Maybe ARM? IBM? --dave -- David Collier-Brown, | Always do right. This will gratify System Programmer and Author | some people and astonish the rest dav...@spamcop.net | -- Mark Twain ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] known buffer sizes on switches
Bruno George Moraes writes: > Nice resource, thanks. > > If someone wonders why things look the way they do, so it's all about > on-die and off-die memory. Either you use off-die or on-die memory, often > SRAM which requires 6 gates per bit. So spending half a billion gates > gives you ~10MB buffer on-die. If you're doing off-die memory (DRAM or > similar) then you'll get the gigabytes of memory seen in some equipment. > There basically is nothing in between. As soon as you go off-die you might > as well put at least 2-6 GB in there. > > There are some reasearch on new memory devices with unexpected > results... > https://ieeexplore.ieee.org/document/8533260 > > The HMC memory allows improvements in execution time and consumed > energy. In some situations, this memory type permits removing the > L2 cache from the memory hierarchy. > > HMC parts start at 2GB Thank you for that. I do have a long standing dream of a single chip wifi router, with the lowest SNR possible, and the minimum number of pins coming off of it. I'd settle for 32MB of (static?) ram on chip as that has proven sufficient to date to drive 802.11n which would let you get rid of both the L2 and L1 cache. That said, I think the cost of 32MB of on-chip static ram remains a bit high, and plugging it into a mips cpu, kind of silly. Someday there will be a case to just doing everything on a single chip, but... > > > ___ > Bloat mailing list > Bloat@lists.bufferbloat.net > https://lists.bufferbloat.net/listinfo/bloat ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
[Bloat] known buffer sizes on switches
> > Nice resource, thanks. > > If someone wonders why things look the way they do, so it's all about > on-die and off-die memory. Either you use off-die or on-die memory, often > SRAM which requires 6 gates per bit. So spending half a billion gates > gives you ~10MB buffer on-die. If you're doing off-die memory (DRAM or > similar) then you'll get the gigabytes of memory seen in some equipment. > There basically is nothing in between. As soon as you go off-die you might > as well put at least 2-6 GB in there. > > There are some reasearch on new memory devices with unexpected results... https://ieeexplore.ieee.org/document/8533260 The HMC memory allows improvements in execution time and consumed energy. > In some situations, this memory type permits removing the L2 cache from the > memory hierarchy. > HMC parts start at 2GB ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
Hi Luca, Am 28.11.18 um 11:48 schrieb Luca Muscariello: > And for BBR, I would say that one thing is the design principles another > is the implementations > and we better distinguish between them. The key design principles are > all valid. While the goal is certainly right to operate around the optimal point where the buffer is nearly empty, BBR's model is only valid from either the viewpoint of the bottleneck or that of a single sender. In BBR, one of the key design principle is to observe the achieved delivery rate. One assumption in BBRv1 is that if the delivery rate can still be increased, then the bottleneck isn't saturated. This doesn't necessarily hold if you have multiple BBR flows present at the bottleneck. Every BBR flow can (nearly always) increase its delivery rate while probing: it will simply decrease other flows' shares. This is not an _implementation_ issue of BBRv1 and has been explained in section III of our BBR evaluation paper. This section shows also that BBRv1 will (by concept) increase its amount of inflight data to the maximum of 2 * estimated_BDP if multiple flows are present. A BBR sender could also use packet loss or RTT increase as indicators that it is probably operating right from the optimal point, but this is not done in BBRv1. BBRv2 will be thus an improvement over BBRv1 in several ways. Regards, Roland ___ Bloat mailing list Bloat@lists.bufferbloat.net https://lists.bufferbloat.net/listinfo/bloat
Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
On Wed, Nov 28, 2018 at 11:40 AM Dave Taht wrote: > On Wed, Nov 28, 2018 at 1:56 AM Luca Muscariello > wrote: > > > > Dave, > > > > The single BDP inflight is a rule of thumb that does not account for > fluctuations of the RTT. > > And I am not talking about random fluctuations and noise. I am talking > about fluctuations > > from a control theoretic point of view to stabilise the system, e.g. the > trajectory of the system variable that > > gets to the optimal point no matter the initial conditions (Lyapunov). > > I have been trying all day to summon the gumption to make this argument: > > IF you have a good idea of the actual RTT... > > it is also nearly certain that there will be *at least* one other flow > you will be competing with... > therefore the fluctuations from every point of view are dominated by > the interaction between these flows and > the goal is, in general, is not to take up a full BDP for your single flow. > > And BBR aims for some tiny percentage less than what it thinks it can > get, when, well, everybody's seen it battle it out with itself and > with cubic. I hand it FQ at the bottleneck link and it works well. > > single flows exist only in the minds of theorists and labs. > > There's a relevant passage worth citing in the kleinrock paper, I > thought (did he write two recently?) that talked about this problem... > I *swear* when I first read it it had a deeper discussion of the > second sentence below and had two paragraphs that went into the issues > with multiple flows: > > "ch earlier and led to the Flow Deviation algorithm [28]. 17 The > reason that the early work of 40 years ago took so long to make its > current impact is because in [31] it was shown that the mechanism > presented in [2] and [3] could not be implemented in a decentralized > algorithm. This delayed the application of Power until the recent work > by the Google team in [1] demonstrated that the key elements of > response time and bandwidth could indeed be estimated using a > distributed control loop sliding window spanning approximately 10 > round-trip times." > > but I can't find it today. > > Here it is https://www.lk.cs.ucla.edu/data/files/Kleinrock/Internet%20Congestion%20Control%20Using%20the%20Power%20Metric-Keep%20the%20Pipe%20Just%20Full%2C%20But%20No%20Fuller%20July%202018.pdf > > The ACM queue paper talking about Codel makes a fairly intuitive and > accessible explanation of that. > > I haven't re-read the lola paper. I just wanted to make the assertion > above. And then duck. :) > > Also, when I last looked at BBR, it made a false assumption that 200ms > was "long enough" to probe the actual RTT, when my comcast links and > others are measured at 680ms+ of buffering. > This is essentially the same paper I cited which is Part I. > > And I always liked the stanford work, here, which tried to assert that > a link with n flows requires no more than B = (RTT ×C)/ √ n. > > http://yuba.stanford.edu/techreports/TR04-HPNG-060800.pdf That that paper does not say that that rule ALWAYS apply. It does under certain conditions. But my point is about optimality. I does NOT mean that the system HAS to work ALWAYS in that point because things change. And for BBR, I would say that one thing is the design principles another is the implementations and we better distinguish between them. The key design principles are all valid. > > > night! > > night ;) > > > > There is a less accessible literature talking about that, which dates > back to some time ago > > that may be useful to re-read again > > > > Damon Wischik and Nick McKeown. 2005. > > Part I: buffer sizes for core routers. > > SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 75-78. DOI= > http://dx.doi.org/10.1145/1070873.1070884 > > http://klamath.stanford.edu/~nickm/papers/BufferSizing.pdf.pdf > > > > and > > > > Gaurav Raina, Don Towsley, and Damon Wischik. 2005. > > Part II: control theory for buffer sizing. > > SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 79-82. > > DOI=http://dx.doi.org/10.1145/1070873.1070885 > > http://www.statslab.cam.ac.uk/~gr224/PAPERS/Control_Theory_Buffers.pdf > > > > One of the thing that Frank Kelly has brought to the literature is about > optimal control. > > From a pure optimization point of view we know since Robert Gallagher > (and Bertsekas 1981) that > > the optimal sending rate is a function of the shadow price at the > bottleneck. > > This shadow price is nothing more than the Lagrange multiplier of the > capacity constraint > > at the bottleneck. Some protocols such as XCP or RCP propose to carry > something > > very close to a shadow price in the ECN but that's not that simple. > > And currently we have a 0/1 "shadow price" which way insufficient. > > > > Optimal control as developed by Frank Kelly since 1998 tells you that > you have > > a stability region that is needed to get to the optimum. > > > > Wischik work, IMO, helps quite a lot to understand tradeoffs while > designing AQM > > and CC. I
Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
On Wed, Nov 28, 2018 at 1:56 AM Luca Muscariello wrote: > > Dave, > > The single BDP inflight is a rule of thumb that does not account for > fluctuations of the RTT. > And I am not talking about random fluctuations and noise. I am talking about > fluctuations > from a control theoretic point of view to stabilise the system, e.g. the > trajectory of the system variable that > gets to the optimal point no matter the initial conditions (Lyapunov). I have been trying all day to summon the gumption to make this argument: IF you have a good idea of the actual RTT... it is also nearly certain that there will be *at least* one other flow you will be competing with... therefore the fluctuations from every point of view are dominated by the interaction between these flows and the goal is, in general, is not to take up a full BDP for your single flow. And BBR aims for some tiny percentage less than what it thinks it can get, when, well, everybody's seen it battle it out with itself and with cubic. I hand it FQ at the bottleneck link and it works well. single flows exist only in the minds of theorists and labs. There's a relevant passage worth citing in the kleinrock paper, I thought (did he write two recently?) that talked about this problem... I *swear* when I first read it it had a deeper discussion of the second sentence below and had two paragraphs that went into the issues with multiple flows: "ch earlier and led to the Flow Deviation algorithm [28]. 17 The reason that the early work of 40 years ago took so long to make its current impact is because in [31] it was shown that the mechanism presented in [2] and [3] could not be implemented in a decentralized algorithm. This delayed the application of Power until the recent work by the Google team in [1] demonstrated that the key elements of response time and bandwidth could indeed be estimated using a distributed control loop sliding window spanning approximately 10 round-trip times." but I can't find it today. > The ACM queue paper talking about Codel makes a fairly intuitive and > accessible explanation of that. I haven't re-read the lola paper. I just wanted to make the assertion above. And then duck. :) Also, when I last looked at BBR, it made a false assumption that 200ms was "long enough" to probe the actual RTT, when my comcast links and others are measured at 680ms+ of buffering. And I always liked the stanford work, here, which tried to assert that a link with n flows requires no more than B = (RTT ×C)/ √ n. http://yuba.stanford.edu/techreports/TR04-HPNG-060800.pdf night! > There is a less accessible literature talking about that, which dates back to > some time ago > that may be useful to re-read again > > Damon Wischik and Nick McKeown. 2005. > Part I: buffer sizes for core routers. > SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 75-78. > DOI=http://dx.doi.org/10.1145/1070873.1070884 > http://klamath.stanford.edu/~nickm/papers/BufferSizing.pdf.pdf > > and > > Gaurav Raina, Don Towsley, and Damon Wischik. 2005. > Part II: control theory for buffer sizing. > SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 79-82. > DOI=http://dx.doi.org/10.1145/1070873.1070885 > http://www.statslab.cam.ac.uk/~gr224/PAPERS/Control_Theory_Buffers.pdf > > One of the thing that Frank Kelly has brought to the literature is about > optimal control. > From a pure optimization point of view we know since Robert Gallagher (and > Bertsekas 1981) that > the optimal sending rate is a function of the shadow price at the bottleneck. > This shadow price is nothing more than the Lagrange multiplier of the > capacity constraint > at the bottleneck. Some protocols such as XCP or RCP propose to carry > something > very close to a shadow price in the ECN but that's not that simple. > And currently we have a 0/1 "shadow price" which way insufficient. > > Optimal control as developed by Frank Kelly since 1998 tells you that you have > a stability region that is needed to get to the optimum. > > Wischik work, IMO, helps quite a lot to understand tradeoffs while designing > AQM > and CC. I feel like the people who wrote the codel ACM Queue paper are very > much aware of this literature, > because Codel design principles seem to take into account that. > And the BBR paper too. > > > On Tue, Nov 27, 2018 at 9:58 PM Dave Taht wrote: >> >> OK, wow, this conversation got long. and I'm still 20 messages behind. >> >> Two points, and I'm going to go back to work, and maybe I'll try to >> summarize a table >> of the competing viewpoints, as there's far more than BDP of >> discussion here, and what >> we need is sqrt(bdp) to deal with all the different conversational flows. :) >> >> On Tue, Nov 27, 2018 at 1:24 AM Luca Muscariello >> wrote: >> > >> > I think that this is a very good comment to the discussion at the defense >> > about the comparison between >> > SFQ with longest queue drop and FQ_Codel. >> > >> > A congestion controlled protocol such as TCP or others,
Re: [Bloat] when does the CoDel part of fq_codel help in the real world?
Dave, The single BDP inflight is a rule of thumb that does not account for fluctuations of the RTT. And I am not talking about random fluctuations and noise. I am talking about fluctuations from a control theoretic point of view to stabilise the system, e.g. the trajectory of the system variable that gets to the optimal point no matter the initial conditions (Lyapunov). The ACM queue paper talking about Codel makes a fairly intuitive and accessible explanation of that. There is a less accessible literature talking about that, which dates back to some time ago that may be useful to re-read again Damon Wischik and Nick McKeown. 2005. Part I: buffer sizes for core routers. SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 75-78. DOI= http://dx.doi.org/10.1145/1070873.1070884 http://klamath.stanford.edu/~nickm/papers/BufferSizing.pdf.pdf and Gaurav Raina, Don Towsley, and Damon Wischik. 2005. Part II: control theory for buffer sizing. SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 79-82. DOI=http://dx.doi.org/10.1145/1070873.1070885 http://www.statslab.cam.ac.uk/~gr224/PAPERS/Control_Theory_Buffers.pdf One of the thing that Frank Kelly has brought to the literature is about optimal control. >From a pure optimization point of view we know since Robert Gallagher (and Bertsekas 1981) that the optimal sending rate is a function of the shadow price at the bottleneck. This shadow price is nothing more than the Lagrange multiplier of the capacity constraint at the bottleneck. Some protocols such as XCP or RCP propose to carry something very close to a shadow price in the ECN but that's not that simple. And currently we have a 0/1 "shadow price" which way insufficient. Optimal control as developed by Frank Kelly since 1998 tells you that you have a stability region that is needed to get to the optimum. Wischik work, IMO, helps quite a lot to understand tradeoffs while designing AQM and CC. I feel like the people who wrote the codel ACM Queue paper are very much aware of this literature, because Codel design principles seem to take into account that. And the BBR paper too. On Tue, Nov 27, 2018 at 9:58 PM Dave Taht wrote: > OK, wow, this conversation got long. and I'm still 20 messages behind. > > Two points, and I'm going to go back to work, and maybe I'll try to > summarize a table > of the competing viewpoints, as there's far more than BDP of > discussion here, and what > we need is sqrt(bdp) to deal with all the different conversational flows. > :) > > On Tue, Nov 27, 2018 at 1:24 AM Luca Muscariello > wrote: > > > > I think that this is a very good comment to the discussion at the > defense about the comparison between > > SFQ with longest queue drop and FQ_Codel. > > > > A congestion controlled protocol such as TCP or others, including QUIC, > LEDBAT and so on > > need at least the BDP in the transmission queue to get full link > efficiency, i.e. the queue never empties out. > > no, I think it needs a BDP in flight. > > I think some of the confusion here is that your TCP stack needs to > keep around a BDP in order to deal with > retransmits, but that lives in another set of buffers entirely. > > > This gives rule of thumbs to size buffers which is also very practical > and thanks to flow isolation becomes very accurate. > > > > Which is: > > > > 1) find a way to keep the number of backlogged flows at a reasonable > value. > > This largely depends on the minimum fair rate an application may need in > the long term. > > We discussed a little bit of available mechanisms to achieve that in the > literature. > > > > 2) fix the largest RTT you want to serve at full utilization and size > the buffer using BDP * N_backlogged. > > Or the other way round: check how much memory you can use > > in the router/line card/device and for a fixed N, compute the largest > RTT you can serve at full utilization. > > My own take on the whole BDP argument is that *so long as the flows in > that BDP are thoroughly mixed* you win. > > > > > 3) there is still some memory to dimension for sparse flows in addition > to that, but this is not based on BDP. > > It is just enough to compute the total utilization of sparse flows and > use the same simple model Toke has used > > to compute the (de)prioritization probability. > > > > This procedure would allow to size FQ_codel but also SFQ. > > It would be interesting to compare the two under this buffer sizing. > > It would also be interesting to compare another mechanism that we have > mentioned during the defense > > which is AFD + a sparse flow queue. Which is, BTW, already available in > Cisco nexus switches for data centres. > > > > I think that the the codel part would still provide the ECN feature, > that all the others cannot have. > > However the others, the last one especially can be implemented in > silicon with reasonable cost. > > > > > > > > > > > > On Mon 26 Nov 2018 at 22:30, Jonathan Morton > wrote: > >> > >> > On 26 Nov, 2018, at 9:08 pm, Pete Heist wrote: > >> > >