Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Dave Taht
On Wed, Nov 28, 2018 at 11:45 PM Jonathan Morton  wrote:
>
> > On 29 Nov, 2018, at 9:39 am, Dave Taht  wrote:
> >
> > …when it is nearly certain that more than one flow exists, means aiming
> > for the BDP in a single flow is generally foolish.
>
> It might be more accurate to say that the BDP of the fair-share of the path 
> is the cwnd to aim for.  Plus epsilon for probing.

OK, much better, thanks.

>  - Jonathan Morton
>
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat



-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] incremental deployment, transport and L4S (Re: when does the CoDel part of fq_codel help in the real world?)

2018-11-28 Thread Mikael Abrahamsson

On Thu, 29 Nov 2018, Jonathan Morton wrote:


You are essentially proposing using ECT(1) to take over an intended function of 
Diffserv.


Well, I am not proposing anything. I am giving people a heads-up that the 
L4S authors are proposing this.


But yes, you're right. Diffserv has shown itself to be really hard to 
incrementally deploy across the Internet, so it's generally bleached 
mid-path.


In my view, that is the wrong approach.  Better to improve Diffserv to 
the point where it becomes useful in practice.


I agree, but unfortunately nobody has made me king of the Internet yet so 
I can't just decree it into existance.


 Cake has taken steps in that direction, by implementing some reasonable 
interpretation of some Diffserv codepoints.


Great. I don't know if I've asked this but is CAKE easily implementable in 
hardware? From what I can tell it's still only Marvell that is trying to 
put high performance enough CPUs into HGWs to do forwarding in CPU (which 
can do CAKE), all others still rely on packet accelerators to achieve the 
desired speeds.


My alternative use of ECT(1) is more in keeping with the other 
codepoints represented by those two bits, to allow ECN to provide more 
fine-grained information about congestion than it presently does.  The 
main challenge is communicating the relevant information back to the 
sender upon receipt, ideally without increasing overhead in the TCP/IP 
headers.


You need to go into the IETF process and voice this opinion then, because 
if nobody opposes in the near time then ECT(1) might go to L4S 
interpretation of what is going on. They do have ECN feedback mechanisms 
in their proposal, have you read it? It's a whole suite of documents, 
architecture, AQM proposal, transport proposal, the entire thing.


On the other hand, what you want to do and what L4S tries to do might be 
closely related. It doesn't sound too far off.


Also, Bob Briscoe works for Cable Labs now, so he will now have silicon 
behind him. This silicon might go into other things, not just DOCSIS 
equipment, so if you have use-cases that L4S doesn't do but might do with 
minor modification, it might be better to join him than to fight him.


--
Mikael Abrahamssonemail: swm...@swm.pp.se
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Jonathan Morton
> On 29 Nov, 2018, at 9:39 am, Dave Taht  wrote:
> 
> …when it is nearly certain that more than one flow exists, means aiming
> for the BDP in a single flow is generally foolish.

It might be more accurate to say that the BDP of the fair-share of the path is 
the cwnd to aim for.  Plus epsilon for probing.

 - Jonathan Morton

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Dave Taht
Mikael Abrahamsson  writes:

> On Tue, 27 Nov 2018, Luca Muscariello wrote:
>
>> link fully utilized is defined as Q>0 unless you don't include the
>> packet currently being transmitted. I do, so the TXtteer is never
>> idle. But that's a detail.
>
> As someone who works with moving packets, it's perplexing to me to
> interact with transport peeps who seem enormously focused on
> "goodput". My personal opinion is that most people would be better off
> with 80% of their available bandwidth being in use without any
> noticable buffer induced delay, as opposed to the transport protocol
> doing its damndest to fill up the link to 100% and sometimes failing
> and inducing delay instead.

+1

I came up with a new analogy today.

Some really like to build dragsters - that go fast but might explode at
the end of the strip - or even during the race!

I like to build churches - that will stand for a 1000 years.

You can reason about stable, deterministic systems, and build other
beautiful structures on top of them. I have faith in churches, not
dragsters.

>
> Could someone perhaps comment on the thinking in the transport
> protocol design "crowd" when it comes to this?
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] incremental deployment, transport and L4S (Re: when does the CoDel part of fq_codel help in the real world?)

2018-11-28 Thread Jonathan Morton
> On 29 Nov, 2018, at 9:28 am, Mikael Abrahamsson  wrote:
> 
> This is one thing about L4S, ETC(1) is the last "codepoint" in the header not 
> used, that can statelessly identify something. If anyone sees a better way to 
> use it compared to "let's put it in a separate queue and CE-mark it 
> agressively at very low queue depths and also do not care about re-ordering 
> so a ARQ L2 can re-order all it wants", then they need to speak up, soon.

You are essentially proposing using ECT(1) to take over an intended function of 
Diffserv.  In my view, that is the wrong approach.  Better to improve Diffserv 
to the point where it becomes useful in practice.  Cake has taken steps in that 
direction, by implementing some reasonable interpretation of some Diffserv 
codepoints.

My alternative use of ECT(1) is more in keeping with the other codepoints 
represented by those two bits, to allow ECN to provide more fine-grained 
information about congestion than it presently does.  The main challenge is 
communicating the relevant information back to the sender upon receipt, ideally 
without increasing overhead in the TCP/IP headers.

 - Jonathan Morton

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Dave Taht
"Bless, Roland (TM)"  writes:

> Hi Luca,
>
> Am 27.11.18 um 10:24 schrieb Luca Muscariello:
>> A congestion controlled protocol such as TCP or others, including QUIC,
>> LEDBAT and so on
>> need at least the BDP in the transmission queue to get full link
>> efficiency, i.e. the queue never empties out.
>
> This is not true. There are congestion control algorithms
> (e.g., TCP LoLa [1] or BBRv2) that can fully utilize the bottleneck link
> capacity without filling the buffer to its maximum capacity. The BDP

Just to stay cynical, I would rather like the BBR and Lola folk to look
closely at asymmetric networks, ack path delay, and lower rates than
1Gbit. And what the heck... wifi. :)

BBRv1, for example, is hard coded to reduce cwnd to 4, not lower - because
that works in the data center. Lola, so far as I know, achieves its
tested results at 1-10Gbits. My world and much of the rest of the world,
barely gets to a gbit, on a good day, with a tail-wind.

If either of these TCPs could be tuned to work well and not saturate
5Mbit links I would be a happier person. RRUL benchmarks anyone?

I did, honestly, want to run lola, (codebase was broken), and I am
patiently waiting for BBRv2 to escape (while hoping that the googlers
actually run some flent tests at edge bandwidths before I tear into it)

Personally, I'd settle for SFQ on the CMTSes, fq_codel on the home
routers, and then let the tcp-ers decide how much delay and loss they
can tolerate.

Another thought... I mean... can't we all just agree to make cubic
more gentle and go fix that, and not a have a flag day? "From linux 5.0
forward cubic shall:

Stop increasing its window at 250ms of delay greater than
the initial RTT? 

Have it occasionally rtt probe a bit, more like BBR?


> rule of thumb basically stems from the older loss-based congestion
> control variants that profit from the standing queue that they built
> over time when they detect a loss:
> while they back-off and stop sending, the queue keeps the bottleneck
> output busy and you'll not see underutilization of the link. Moreover,
> once you get good loss de-synchronization, the buffer size requirement
> for multiple long-lived flows decreases.
>
>> This gives rule of thumbs to size buffers which is also very practical
>> and thanks to flow isolation becomes very accurate.
>
> The positive effect of buffers is merely their role to absorb
> short-term bursts (i.e., mismatch in arrival and departure rates)
> instead of dropping packets. One does not need a big buffer to
> fully utilize a link (with perfect knowledge you can keep the link
> saturated even without a single packet waiting in the buffer).
> Furthermore, large buffers (e.g., using the BDP rule of thumb)
> are not useful/practical anymore at very high speed such as 100 Gbit/s:
> memory is also quite costly at such high speeds...
>
> Regards,
>  Roland
>
> [1] M. Hock, F. Neumeister, M. Zitterbart, R. Bless.
> TCP LoLa: Congestion Control for Low Latencies and High Throughput.
> Local Computer Networks (LCN), 2017 IEEE 42nd Conference on, pp.
> 215-218, Singapore, Singapore, October 2017
> http://doc.tm.kit.edu/2017-LCN-lola-paper-authors-copy.pdf


This whole thread, although diversive... well, I'd really like everybody
to get together and try to write a joint paper on the best stuff to do,
worldwide, to make bufferbloat go away.

>> Which is: 
>> 
>> 1) find a way to keep the number of backlogged flows at a reasonable value. 
>> This largely depends on the minimum fair rate an application may need in
>> the long term.
>> We discussed a little bit of available mechanisms to achieve that in the
>> literature.
>> 
>> 2) fix the largest RTT you want to serve at full utilization and size
>> the buffer using BDP * N_backlogged.  
>> Or the other way round: check how much memory you can use 
>> in the router/line card/device and for a fixed N, compute the largest
>> RTT you can serve at full utilization. 
>> 
>> 3) there is still some memory to dimension for sparse flows in addition
>> to that, but this is not based on BDP. 
>> It is just enough to compute the total utilization of sparse flows and
>> use the same simple model Toke has used 
>> to compute the (de)prioritization probability.
>> 
>> This procedure would allow to size FQ_codel but also SFQ.
>> It would be interesting to compare the two under this buffer sizing. 
>> It would also be interesting to compare another mechanism that we have
>> mentioned during the defense
>> which is AFD + a sparse flow queue. Which is, BTW, already available in
>> Cisco nexus switches for data centres.
>> 
>> I think that the the codel part would still provide the ECN feature,
>> that all the others cannot have.
>> However the others, the last one especially can be implemented in
>> silicon with reasonable cost.
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat

[Bloat] incremental deployment, transport and L4S (Re: when does the CoDel part of fq_codel help in the real world?)

2018-11-28 Thread Mikael Abrahamsson

On Wed, 28 Nov 2018, Dave Taht wrote:


see ecn-sane. Please try to write a position paper as to where and why
ecn is good and bad.

if one day we could merely establish a talmud of commentary
around this religion it would help.


From my viewpoint it seems to be all about incremental deployment. We have 
30 years of "crud" that things need to work with, and the worst-case needs 
to be a disaster for anything that wants to deploy.


This is one thing about L4S, ETC(1) is the last "codepoint" in the header 
not used, that can statelessly identify something. If anyone sees a better 
way to use it compared to "let's put it in a separate queue and CE-mark it 
agressively at very low queue depths and also do not care about 
re-ordering so a ARQ L2 can re-order all it wants", then they need to 
speak up, soon.


I actually think the "let's not care about re-ordering" would be a 
brilliant thing, it'd help quite a lot of packet network types become less 
costly and more efficient, while at the same time not doing blocking of 
subsequent packets just because some earlier packet needed to be 
retransmitted. Brilliant for QUIC for instance, that already handles this 
(at least per-stream).


--
Mikael Abrahamssonemail: swm...@swm.pp.se
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Dave Taht
"Bless, Roland (TM)"  writes:

> Hi Luca,
>
> Am 28.11.18 um 11:48 schrieb Luca Muscariello:
>
>> And for BBR, I would say that one thing is the design principles another
>> is the implementations
>> and we better distinguish between them. The key design principles are
>> all valid.
>
> While the goal is certainly right to operate around the optimal point
> where the buffer is nearly empty, BBR's model is only valid from either
> the viewpoint of the bottleneck or that of a single sender.

I think I agree with this, from my own experimental data.

>
> In BBR, one of the key design principle is to observe the
> achieved delivery rate. One assumption in BBRv1 is that if the delivery
> rate can still be increased, then the bottleneck isn't saturated. This
> doesn't necessarily hold if you have multiple BBR flows present at the
> bottleneck.
> Every BBR flow can (nearly always) increase its delivery rate while
> probing: it will simply decrease other flows' shares. This is not
> an _implementation_ issue of BBRv1 and has been explained in section III
> of our BBR evaluation paper.

Haven't re-read it yet.

>
> This section shows also that BBRv1 will (by concept) increase its amount
> of inflight data to the maximum of 2 * estimated_BDP if multiple flows
> are present. A BBR sender could also use packet loss or RTT increase as

Carnage!

> indicators that it is probably operating right from the optimal
> point, but this is not done in BBRv1.
> BBRv2 will be thus an improvement over BBRv1 in several ways.

I really really really want a sane response to ecn in bbr.

>
> Regards,
>  Roland
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Dave Taht
Luca Muscariello  writes:

> On Wed, Nov 28, 2018 at 11:40 AM Dave Taht 
> wrote:
>
> On Wed, Nov 28, 2018 at 1:56 AM Luca Muscariello
>  wrote:
> >
> > Dave,
> >
> > The single BDP inflight is a rule of thumb that does not account
> for fluctuations of the RTT.
> > And I am not talking about random fluctuations and noise. I am
> talking about fluctuations
> > from a control theoretic point of view to stabilise the system,
> e.g. the trajectory of the system variable that
> > gets to the optimal point no matter the initial conditions
> (Lyapunov).
> 
> I have been trying all day to summon the gumption to make this
> argument:
> 
> IF you have a good idea of the actual RTT...
> 
> it is also nearly certain that there will be *at least* one other
> flow
> you will be competing with...
> therefore the fluctuations from every point of view are dominated
> by
> the interaction between these flows and
> the goal is, in general, is not to take up a full BDP for your
> single flow.
> 
> And BBR aims for some tiny percentage less than what it thinks it
> can
> get, when, well, everybody's seen it battle it out with itself and
> with cubic. I hand it FQ at the bottleneck link and it works well.
> 
> single flows exist only in the minds of theorists and labs.
> 
> There's a relevant passage worth citing in the kleinrock paper, I
> thought (did he write two recently?) that talked about this
> problem...
> I *swear* when I first read it it had a deeper discussion of the
> second sentence below and had two paragraphs that went into the
> issues
> with multiple flows:
> 
> "ch earlier and led to the Flow Deviation algorithm [28]. 17 The
> reason that the early work of 40 years ago took so long to make
> its
> current impact is because in [31] it was shown that the mechanism
> presented in [2] and [3] could not be implemented in a
> decentralized
> algorithm. This delayed the application of Power until the recent
> work
> by the Google team in [1] demonstrated that the key elements of
> response time and bandwidth could indeed be estimated using a
> distributed control loop sliding window spanning approximately 10
> round-trip times."
> 
> but I can't find it today.
> 
> 
>
> Here it is
>
> https://www.lk.cs.ucla.edu/data/files/Kleinrock/Internet%20Congestion%20Control%20Using%20the%20Power%20Metric-Keep%20the%20Pipe%20Just%20Full%2C%20But%20No%20Fuller%20July%202018.pdf

Thank you that is more what I remember reading. That said, I still
remember a really two paragraph thing that went into footnote 17 of the
40+ years of history behind all this, that clicked with me about why
we're still going wrong... and I can't remember what it is. I'll go
deeper into the past and go read more refs off of this.

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Dave Taht
Michael Welzl  writes:

> Just a small clarification:
>
> 
> 
> 
> To me the switch to head dropping essentially killed the tail
> loss RTO
> problem, eliminated most of the need for ecn.
>
> 
> 
> I doubt that: TCP will need to retransmit that packet at the head,
> and that takes an RTT - all the packets after it will need to wait
> in the receiver buffer before the application gets them.
> But I don’t have measurements to prove my point, so I’m just
> hand-waving…
>
> I don’t doubt that this kills the tail loss RTO problem.

Yea! I wish we had more data on it though. We haven't really ever looked
at RTOs in our (enormous) data sets, it's just an assumption that we
don't see them. There's terabytes of captures

> I doubt that it eliminates the need for ECN.

A specific example that burned me was stuarts demo showing screen
sharing "just working", with ecn, on what was about a 20ms path.

GREAT demo! Very real result from codel. Ship it! Audience applauded madly.
fq_codel went into OSX earlier this year.

Thing was, there was a 16ms frame rate (at best, probably closer to
64ms), at least a 32ms jitter buffer (probably in the 100s of ms
actually), an encoder that took at least a frame's worth of time...

and having the flow retransmit a lost packet vs ecn - within a 15ms rtt
- with a jitter buffer already there - was utterly invisible also to the
application and user.

Sooo

see ecn-sane. Please try to write a position paper as to where and why
ecn is good and bad.

if one day we could merely establish a talmud of commentary
around this religion it would help.

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] [Codel] found another good use for a queue today, possibly

2018-11-28 Thread Dave Taht
Jonathan Morton  writes:

>>> "polylog(n)-wise Independent Hash Function". OK, my google-foo fails
>>> me: The authors use sha1, would something lighter weight suit?
>
>> The current favorite in DPDK land seems to be Cuckoo hashing.
>> It has better cache behavior than typical chaining.
>
> That paper describes an improved variant of cuckoo hashing, using a
> queue to help resolve collisions with better time complexity.  The
> proof relies on (among other things) a particular grade of hash
> function being used.  SHA1 is described as being suitable since it
> offers cryptographic-level performance… We actually need two hashes
> with independent behaviour on the same input, one for each table.
>
> If we were to assume table sizes up to 64K, using both halves of a
> good 32-bit hash might be suitable.  It may be that plain old Jenkins
> hash would work in that context.  Supplement that with a 64-entry
> queue with linear search (in software) or constant-time CAM search (in
> hardware).

I was aiming for 2million routes.

I gave up trying to wade through it and pinged the authors.

fiddling with blake at the moment

>
>  - Jonathan Morton
>
> ___
> Codel mailing list
> co...@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/codel
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] known buffer sizes on switches

2018-11-28 Thread Dave Taht
On Wed, Nov 28, 2018 at 8:55 AM Dave Taht  wrote:
>
> Bruno George Moraes  writes:
>
> > Nice resource, thanks.
> >
> > If someone wonders why things look the way they do, so it's all about
> > on-die and off-die memory. Either you use off-die or on-die memory, often
> > SRAM which requires 6 gates per bit. So spending half a billion gates
> > gives you ~10MB buffer on-die. If you're doing off-die memory (DRAM or
> > similar) then you'll get the gigabytes of memory seen in some equipment.
> > There basically is nothing in between. As soon as you go off-die you might
> > as well put at least 2-6 GB in there.
> >
> > There are some reasearch on new memory devices with unexpected
> > results...
> > https://ieeexplore.ieee.org/document/8533260
> >
> > The HMC memory allows improvements in execution time and consumed
> > energy. In some situations, this memory type permits removing the
> > L2 cache from the memory hierarchy.
> >
> > HMC parts start at 2GB

That effort actually looks pretty promising. I liked the support for
atomic ops too, offloaded.There are also so many useful operations
that I'd like to see offloaded to ram - like zeroing memory regions as
one example.

http://www.hybridmemorycube.org/

Will probably run hot. But: grump: I still don't "get" why the
traditional division between memory and cpu makers hasn't collapsed
yet. A package like that
with a cpu *in it*, and we're done. 4GB "ought to be enough for everybody".

27? years ago, back when I was attempting to write a SF novel, I had
an idea for a more efficient way to pack cores and memory together.
Basically: shrink the cray 1 design down to about the size of a nickel
(or dime!).

The cray had that rough shape for optimum routing and cooling, but...
the overall shape of the package becomes a hexagon
(https://en.wikipedia.org/wiki/Hexagon) cylinder. That gives you 6 or
12 vertical flat surfaces to mount chips on (or just let them stand in
slots on the package). There's one natural crossbar bus at the center,
connecting the 6 "core" chips more rapidly than the edges. Top, bottom
and sides of the package can be used for I/O, power and so on, and
each hexagonal component wedged tightly together (instead of today's
north-south east-west architectures you get 2 more dimensions
horizontally)

fill the package with some sort of coolant. Seal it up tight. Test the
module as a whole and ship 'em in palletloads. I'm pretty sure the
heat circulates from the center out naturally, in every orientation,
but what the heck, stick in some MEMs fans in there to keep things
pumping along.

that design naturally led to 2 cpu chips and 4 memories. Or 4 cpu
chips and 2 memories. or 2 cpus 2 mems and 2 IOs. Before you started
coming up with things to do with the outer 6 sides.

I I never thought separating ram from cpu by more than a millimeter
was a good idea.

It's a quite a jump to envision going from the cray-1 (115kw!!!) down
to the size of a nickel!

But everybody has a cray-1 now. They just run too hot. And are often
not suited to task, just like the cray was.

https://en.wikipedia.org/wiki/Cray-1

Don't know if anyone's ever tried to pattern any circuits on a cylinder though!

We are certainly seeing a lot of multi-package modules now (like in
epyc) but I'd like 'em to be taller and not need so many darn pins. A
full blown wifi
router on chip wouldn't need more than... oh... this many pins:

https://www.amazon.com/Makerfocus-ESP8266-Wireless-Transceiver-Compatible/dp/B01EA3UJJ4/ref=asc_df_B01EA3UJJ4/?tag=hyprod-20=df0=309773039951=1o1=g=15072864816819105911c===9032156=pla-599566692924=1

> Thank you for that. I do have a long standing dream of a single chip
> wifi router, with the lowest SNR possible, and the minimum number of
> pins coming off of it. I'd settle for 32MB of (static?) ram on chip as
> that has proven sufficient to date to drive 802.11n
>
> which would let you get rid of both the L2 and L1 cache. That said, I
> think the cost of 32MB of on-chip static ram remains a bit high, and
> plugging it into a mips cpu, kind of silly. Someday there will be a case
> to just doing everything on a single chip, but...
>
> >
> >
> > ___
> > Bloat mailing list
> > Bloat@lists.bufferbloat.net
> > https://lists.bufferbloat.net/listinfo/bloat
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat



-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] known buffer sizes on switches

2018-11-28 Thread David Collier-Brown

That would be really cool: I loved the Mips we had at YorkU.ca

--dave

On 2018-11-28 2:02 p.m., Dave Taht wrote:

I really don't know a whole heck of a lot about where mips is going.
Certainly they remain strong in the embedded market (I do like the
edgerouter X a lot), but as for their current direction or future
product lines, not a clue.

I used to know someone over there, maybe he's restored new directions.
Last I recall he was busy obsoleting a whole lot of instruction space
in order to make room for "new stuff". He'd even asked me if adding an
invsqrt to the instruction set would help, and I sadly replied that
that bit of codel was totally invisible on a trace.

I really like(d) mips. ton of registers, better instruction set than
arm (IMHO), no foolish processor extensions.

On Wed, Nov 28, 2018 at 10:26 AM David Collier-Brown  wrote:

On 2018-11-28 11:55 a.m., Dave Taht wrote:


Thank you for that. I do have a long standing dream of a single chip
wifi router, with the lowest SNR possible, and the minimum number of
pins coming off of it. I'd settle for 32MB of (static?) ram on chip as
that has proven sufficient to date to drive 802.11n

which would let you get rid of both the L2 and L1 cache. That said, I
think the cost of 32MB of on-chip static ram remains a bit high, and
plugging it into a mips cpu, kind of silly. Someday there will be a case
to just doing everything on a single chip, but...

I could see 32MB or more of fast memory on-chip as being attractive when
one is fighting with diminishing returns in CPU speed and program
parallelizability.

In the past that might have excited MIPS, but these days less so. Maybe
ARM? IBM?

--dave

--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
dav...@spamcop.net   |  -- Mark Twain

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat




--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
dav...@spamcop.net   |  -- Mark Twain

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] known buffer sizes on switches

2018-11-28 Thread Dave Taht
I really don't know a whole heck of a lot about where mips is going.
Certainly they remain strong in the embedded market (I do like the
edgerouter X a lot), but as for their current direction or future
product lines, not a clue.

I used to know someone over there, maybe he's restored new directions.
Last I recall he was busy obsoleting a whole lot of instruction space
in order to make room for "new stuff". He'd even asked me if adding an
invsqrt to the instruction set would help, and I sadly replied that
that bit of codel was totally invisible on a trace.

I really like(d) mips. ton of registers, better instruction set than
arm (IMHO), no foolish processor extensions.

On Wed, Nov 28, 2018 at 10:26 AM David Collier-Brown  wrote:
>
> On 2018-11-28 11:55 a.m., Dave Taht wrote:
>
> > Thank you for that. I do have a long standing dream of a single chip
> > wifi router, with the lowest SNR possible, and the minimum number of
> > pins coming off of it. I'd settle for 32MB of (static?) ram on chip as
> > that has proven sufficient to date to drive 802.11n
> >
> > which would let you get rid of both the L2 and L1 cache. That said, I
> > think the cost of 32MB of on-chip static ram remains a bit high, and
> > plugging it into a mips cpu, kind of silly. Someday there will be a case
> > to just doing everything on a single chip, but...
>
> I could see 32MB or more of fast memory on-chip as being attractive when
> one is fighting with diminishing returns in CPU speed and program
> parallelizability.
>
> In the past that might have excited MIPS, but these days less so. Maybe
> ARM? IBM?
>
> --dave
>
> --
> David Collier-Brown, | Always do right. This will gratify
> System Programmer and Author | some people and astonish the rest
> dav...@spamcop.net   |  -- Mark Twain
>
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat



-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] known buffer sizes on switches

2018-11-28 Thread David Collier-Brown

On 2018-11-28 11:55 a.m., Dave Taht wrote:


Thank you for that. I do have a long standing dream of a single chip
wifi router, with the lowest SNR possible, and the minimum number of
pins coming off of it. I'd settle for 32MB of (static?) ram on chip as
that has proven sufficient to date to drive 802.11n

which would let you get rid of both the L2 and L1 cache. That said, I
think the cost of 32MB of on-chip static ram remains a bit high, and
plugging it into a mips cpu, kind of silly. Someday there will be a case
to just doing everything on a single chip, but...


I could see 32MB or more of fast memory on-chip as being attractive when 
one is fighting with diminishing returns in CPU speed and program 
parallelizability.


In the past that might have excited MIPS, but these days less so. Maybe 
ARM? IBM?


--dave

--
David Collier-Brown, | Always do right. This will gratify
System Programmer and Author | some people and astonish the rest
dav...@spamcop.net   |  -- Mark Twain

___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] known buffer sizes on switches

2018-11-28 Thread Dave Taht
Bruno George Moraes  writes:

> Nice resource, thanks.
>
> If someone wonders why things look the way they do, so it's all about 
> on-die and off-die memory. Either you use off-die or on-die memory, often 
> SRAM which requires 6 gates per bit. So spending half a billion gates 
> gives you ~10MB buffer on-die. If you're doing off-die memory (DRAM or 
> similar) then you'll get the gigabytes of memory seen in some equipment. 
> There basically is nothing in between. As soon as you go off-die you might 
> as well put at least 2-6 GB in there.
>
> There are some reasearch on new memory devices with unexpected
> results...
> https://ieeexplore.ieee.org/document/8533260
>
> The HMC memory allows improvements in execution time and consumed
> energy. In some situations, this memory type permits removing the
> L2 cache from the memory hierarchy. 
>
> HMC parts start at 2GB 

Thank you for that. I do have a long standing dream of a single chip
wifi router, with the lowest SNR possible, and the minimum number of
pins coming off of it. I'd settle for 32MB of (static?) ram on chip as
that has proven sufficient to date to drive 802.11n

which would let you get rid of both the L2 and L1 cache. That said, I
think the cost of 32MB of on-chip static ram remains a bit high, and
plugging it into a mips cpu, kind of silly. Someday there will be a case
to just doing everything on a single chip, but...

>
>
> ___
> Bloat mailing list
> Bloat@lists.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


[Bloat] known buffer sizes on switches

2018-11-28 Thread Bruno George Moraes
>
> Nice resource, thanks.
>
> If someone wonders why things look the way they do, so it's all about
> on-die and off-die memory. Either you use off-die or on-die memory, often
> SRAM which requires 6 gates per bit. So spending half a billion gates
> gives you ~10MB buffer on-die. If you're doing off-die memory (DRAM or
> similar) then you'll get the gigabytes of memory seen in some equipment.
> There basically is nothing in between. As soon as you go off-die you might
> as well put at least 2-6 GB in there.
>
>
There are some reasearch on new memory devices with unexpected results...
https://ieeexplore.ieee.org/document/8533260

The HMC memory allows improvements in execution time and consumed energy.
> In some situations, this memory type permits removing the L2 cache from the
> memory hierarchy.
>

HMC parts start at 2GB
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Bless, Roland (TM)
Hi Luca,

Am 28.11.18 um 11:48 schrieb Luca Muscariello:

> And for BBR, I would say that one thing is the design principles another
> is the implementations
> and we better distinguish between them. The key design principles are
> all valid.

While the goal is certainly right to operate around the optimal point
where the buffer is nearly empty, BBR's model is only valid from either
the viewpoint of the bottleneck or that of a single sender.

In BBR, one of the key design principle is to observe the
achieved delivery rate. One assumption in BBRv1 is that if the delivery
rate can still be increased, then the bottleneck isn't saturated. This
doesn't necessarily hold if you have multiple BBR flows present at the
bottleneck.
Every BBR flow can (nearly always) increase its delivery rate while
probing: it will simply decrease other flows' shares. This is not
an _implementation_ issue of BBRv1 and has been explained in section III
of our BBR evaluation paper.

This section shows also that BBRv1 will (by concept) increase its amount
of inflight data to the maximum of 2 * estimated_BDP if multiple flows
are present. A BBR sender could also use packet loss or RTT increase as
indicators that it is probably operating right from the optimal
point, but this is not done in BBRv1.
BBRv2 will be thus an improvement over BBRv1 in several ways.

Regards,
 Roland
___
Bloat mailing list
Bloat@lists.bufferbloat.net
https://lists.bufferbloat.net/listinfo/bloat


Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Luca Muscariello
On Wed, Nov 28, 2018 at 11:40 AM Dave Taht  wrote:

> On Wed, Nov 28, 2018 at 1:56 AM Luca Muscariello
>  wrote:
> >
> > Dave,
> >
> > The single BDP inflight is a rule of thumb that does not account for
> fluctuations of the RTT.
> > And I am not talking about random fluctuations and noise. I am talking
> about fluctuations
> > from a control theoretic point of view to stabilise the system, e.g. the
> trajectory of the system variable that
> > gets to the optimal point no matter the initial conditions (Lyapunov).
>
> I have been trying all day to summon the gumption to make this argument:
>
> IF you have a good idea of the actual RTT...
>
> it is also nearly certain that there will be *at least* one other flow
> you will be competing with...
> therefore the fluctuations from every point of view are dominated by
> the interaction between these flows and
> the goal is, in general, is not to take up a full BDP for your single flow.
>
> And BBR aims for some tiny percentage less than what it thinks it can
> get, when, well, everybody's seen it battle it out with itself and
> with cubic. I hand it FQ at the bottleneck link and it works well.
>
> single flows exist only in the minds of theorists and labs.
>
> There's a relevant passage worth citing in the kleinrock paper, I
> thought (did he write two recently?) that talked about this problem...
> I *swear* when I first read it it had a deeper discussion of the
> second sentence below and had two paragraphs that went into the issues
> with multiple flows:
>
> "ch earlier and led to the Flow Deviation algorithm [28]. 17 The
> reason that the early work of 40 years ago took so long to make its
> current impact is because in [31] it was shown that the mechanism
> presented in [2] and [3] could not be implemented in a decentralized
> algorithm. This delayed the application of Power until the recent work
> by the Google team in [1] demonstrated that the key elements of
> response time and bandwidth could indeed be estimated using a
> distributed control loop sliding window spanning approximately 10
> round-trip times."
>
> but I can't find it today.
>
>
Here it is

https://www.lk.cs.ucla.edu/data/files/Kleinrock/Internet%20Congestion%20Control%20Using%20the%20Power%20Metric-Keep%20the%20Pipe%20Just%20Full%2C%20But%20No%20Fuller%20July%202018.pdf


> > The ACM queue paper talking about Codel makes a fairly intuitive and
> accessible explanation of that.
>
> I haven't re-read the lola paper. I just wanted to make the assertion
> above. And then duck. :)
>
> Also, when I last looked at BBR, it made a false assumption that 200ms
> was "long enough" to probe the actual RTT, when my comcast links and
> others are measured at 680ms+ of buffering.
>

This is essentially the same paper I cited which is Part I.



>
> And I always liked the stanford work, here, which tried to assert that
> a link with n flows requires no more than B = (RTT ×C)/ √ n.
>
> http://yuba.stanford.edu/techreports/TR04-HPNG-060800.pdf


That that paper does not say that that rule ALWAYS apply. It does under
certain conditions.
But my point is about optimality.

I does NOT mean that the system HAS to work ALWAYS in that point because
things change.

And for BBR, I would say that one thing is the design principles another is
the implementations
and we better distinguish between them. The key design principles are all
valid.



>
>
> night!
>
>
night ;)


>
>
> > There is a less accessible literature talking about that, which dates
> back to some time ago
> > that may be useful to re-read again
> >
> > Damon Wischik and Nick McKeown. 2005.
> > Part I: buffer sizes for core routers.
> > SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 75-78. DOI=
> http://dx.doi.org/10.1145/1070873.1070884
> > http://klamath.stanford.edu/~nickm/papers/BufferSizing.pdf.pdf
> >
> > and
> >
> > Gaurav Raina, Don Towsley, and Damon Wischik. 2005.
> > Part II: control theory for buffer sizing.
> > SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 79-82.
> > DOI=http://dx.doi.org/10.1145/1070873.1070885
> > http://www.statslab.cam.ac.uk/~gr224/PAPERS/Control_Theory_Buffers.pdf
> >
> > One of the thing that Frank Kelly has brought to the literature is about
> optimal control.
> > From a pure optimization point of view we know since Robert Gallagher
> (and Bertsekas 1981) that
> > the optimal sending rate is a function of the shadow price at the
> bottleneck.
> > This shadow price is nothing more than the Lagrange multiplier of the
> capacity constraint
> > at the bottleneck. Some protocols such as XCP or RCP propose to carry
> something
> > very close to a shadow price in the ECN but that's not that simple.
> > And currently we have a 0/1 "shadow price" which way insufficient.
> >
> > Optimal control as developed by Frank Kelly since 1998 tells you that
> you have
> > a stability region that is needed to get to the optimum.
> >
> > Wischik work, IMO, helps quite a lot to understand tradeoffs while
> designing AQM
> > and CC. I 

Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Dave Taht
On Wed, Nov 28, 2018 at 1:56 AM Luca Muscariello
 wrote:
>
> Dave,
>
> The single BDP inflight is a rule of thumb that does not account for 
> fluctuations of the RTT.
> And I am not talking about random fluctuations and noise. I am talking about 
> fluctuations
> from a control theoretic point of view to stabilise the system, e.g. the 
> trajectory of the system variable that
> gets to the optimal point no matter the initial conditions (Lyapunov).

I have been trying all day to summon the gumption to make this argument:

IF you have a good idea of the actual RTT...

it is also nearly certain that there will be *at least* one other flow
you will be competing with...
therefore the fluctuations from every point of view are dominated by
the interaction between these flows and
the goal is, in general, is not to take up a full BDP for your single flow.

And BBR aims for some tiny percentage less than what it thinks it can
get, when, well, everybody's seen it battle it out with itself and
with cubic. I hand it FQ at the bottleneck link and it works well.

single flows exist only in the minds of theorists and labs.

There's a relevant passage worth citing in the kleinrock paper, I
thought (did he write two recently?) that talked about this problem...
I *swear* when I first read it it had a deeper discussion of the
second sentence below and had two paragraphs that went into the issues
with multiple flows:

"ch earlier and led to the Flow Deviation algorithm [28]. 17 The
reason that the early work of 40 years ago took so long to make its
current impact is because in [31] it was shown that the mechanism
presented in [2] and [3] could not be implemented in a decentralized
algorithm. This delayed the application of Power until the recent work
by the Google team in [1] demonstrated that the key elements of
response time and bandwidth could indeed be estimated using a
distributed control loop sliding window spanning approximately 10
round-trip times."

but I can't find it today.

> The ACM queue paper talking about Codel makes a fairly intuitive and 
> accessible explanation of that.

I haven't re-read the lola paper. I just wanted to make the assertion
above. And then duck. :)

Also, when I last looked at BBR, it made a false assumption that 200ms
was "long enough" to probe the actual RTT, when my comcast links and
others are measured at 680ms+ of buffering.

And I always liked the stanford work, here, which tried to assert that
a link with n flows requires no more than B = (RTT ×C)/ √ n.

http://yuba.stanford.edu/techreports/TR04-HPNG-060800.pdf

night!



> There is a less accessible literature talking about that, which dates back to 
> some time ago
> that may be useful to re-read again
>
> Damon Wischik and Nick McKeown. 2005.
> Part I: buffer sizes for core routers.
> SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 75-78. 
> DOI=http://dx.doi.org/10.1145/1070873.1070884
> http://klamath.stanford.edu/~nickm/papers/BufferSizing.pdf.pdf
>
> and
>
> Gaurav Raina, Don Towsley, and Damon Wischik. 2005.
> Part II: control theory for buffer sizing.
> SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 79-82.
> DOI=http://dx.doi.org/10.1145/1070873.1070885
> http://www.statslab.cam.ac.uk/~gr224/PAPERS/Control_Theory_Buffers.pdf
>
> One of the thing that Frank Kelly has brought to the literature is about 
> optimal control.
> From a pure optimization point of view we know since Robert Gallagher (and 
> Bertsekas 1981) that
> the optimal sending rate is a function of the shadow price at the bottleneck.
> This shadow price is nothing more than the Lagrange multiplier of the 
> capacity constraint
> at the bottleneck. Some protocols such as XCP or RCP propose to carry 
> something
> very close to a shadow price in the ECN but that's not that simple.
> And currently we have a 0/1 "shadow price" which way insufficient.
>
> Optimal control as developed by Frank Kelly since 1998 tells you that you have
> a stability region that is needed to get to the optimum.
>
> Wischik work, IMO, helps quite a lot to understand tradeoffs while designing 
> AQM
> and CC. I feel like the people who wrote the codel ACM Queue paper are very 
> much aware of this literature,
> because Codel design principles seem to take into account that.
> And the BBR paper too.
>
>
> On Tue, Nov 27, 2018 at 9:58 PM Dave Taht  wrote:
>>
>> OK, wow, this conversation got long. and I'm still 20 messages behind.
>>
>> Two points, and I'm going to go back to work, and maybe I'll try to
>> summarize a table
>> of the competing viewpoints, as there's far more than BDP of
>> discussion here, and what
>> we need is sqrt(bdp) to deal with all the different conversational flows. :)
>>
>> On Tue, Nov 27, 2018 at 1:24 AM Luca Muscariello
>>  wrote:
>> >
>> > I think that this is a very good comment to the discussion at the defense 
>> > about the comparison between
>> > SFQ with longest queue drop and FQ_Codel.
>> >
>> > A congestion controlled protocol such as TCP or others, 

Re: [Bloat] when does the CoDel part of fq_codel help in the real world?

2018-11-28 Thread Luca Muscariello
Dave,

The single BDP inflight is a rule of thumb that does not account for
fluctuations of the RTT.
And I am not talking about random fluctuations and noise. I am talking
about fluctuations
from a control theoretic point of view to stabilise the system, e.g. the
trajectory of the system variable that
gets to the optimal point no matter the initial conditions (Lyapunov).
The ACM queue paper talking about Codel makes a fairly intuitive and
accessible explanation of that.

There is a less accessible literature talking about that, which dates back
to some time ago
that may be useful to re-read again

Damon Wischik and Nick McKeown. 2005.
Part I: buffer sizes for core routers.
SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 75-78. DOI=
http://dx.doi.org/10.1145/1070873.1070884
http://klamath.stanford.edu/~nickm/papers/BufferSizing.pdf.pdf

and

Gaurav Raina, Don Towsley, and Damon Wischik. 2005.
Part II: control theory for buffer sizing.
SIGCOMM Comput. Commun. Rev. 35, 3 (July 2005), 79-82.
DOI=http://dx.doi.org/10.1145/1070873.1070885
http://www.statslab.cam.ac.uk/~gr224/PAPERS/Control_Theory_Buffers.pdf

One of the thing that Frank Kelly has brought to the literature is about
optimal control.
>From a pure optimization point of view we know since Robert Gallagher (and
Bertsekas 1981) that
the optimal sending rate is a function of the shadow price at the
bottleneck.
This shadow price is nothing more than the Lagrange multiplier of the
capacity constraint
at the bottleneck. Some protocols such as XCP or RCP propose to carry
something
very close to a shadow price in the ECN but that's not that simple.
And currently we have a 0/1 "shadow price" which way insufficient.

Optimal control as developed by Frank Kelly since 1998 tells you that you
have
a stability region that is needed to get to the optimum.

Wischik work, IMO, helps quite a lot to understand tradeoffs while
designing AQM
and CC. I feel like the people who wrote the codel ACM Queue paper are very
much aware of this literature,
because Codel design principles seem to take into account that.
And the BBR paper too.


On Tue, Nov 27, 2018 at 9:58 PM Dave Taht  wrote:

> OK, wow, this conversation got long. and I'm still 20 messages behind.
>
> Two points, and I'm going to go back to work, and maybe I'll try to
> summarize a table
> of the competing viewpoints, as there's far more than BDP of
> discussion here, and what
> we need is sqrt(bdp) to deal with all the different conversational flows.
> :)
>
> On Tue, Nov 27, 2018 at 1:24 AM Luca Muscariello
>  wrote:
> >
> > I think that this is a very good comment to the discussion at the
> defense about the comparison between
> > SFQ with longest queue drop and FQ_Codel.
> >
> > A congestion controlled protocol such as TCP or others, including QUIC,
> LEDBAT and so on
> > need at least the BDP in the transmission queue to get full link
> efficiency, i.e. the queue never empties out.
>
> no, I think it needs a BDP in flight.
>
> I think some of the confusion here is that your TCP stack needs to
> keep around a BDP in order to deal with
> retransmits, but that lives in another set of buffers entirely.
>
> > This gives rule of thumbs to size buffers which is also very practical
> and thanks to flow isolation becomes very accurate.
> >
> > Which is:
> >
> > 1) find a way to keep the number of backlogged flows at a reasonable
> value.
> > This largely depends on the minimum fair rate an application may need in
> the long term.
> > We discussed a little bit of available mechanisms to achieve that in the
> literature.
> >
> > 2) fix the largest RTT you want to serve at full utilization and size
> the buffer using BDP * N_backlogged.
> > Or the other way round: check how much memory you can use
> > in the router/line card/device and for a fixed N, compute the largest
> RTT you can serve at full utilization.
>
> My own take on the whole BDP argument is that *so long as the flows in
> that BDP are thoroughly mixed* you win.
>
> >
> > 3) there is still some memory to dimension for sparse flows in addition
> to that, but this is not based on BDP.
> > It is just enough to compute the total utilization of sparse flows and
> use the same simple model Toke has used
> > to compute the (de)prioritization probability.
> >
> > This procedure would allow to size FQ_codel but also SFQ.
> > It would be interesting to compare the two under this buffer sizing.
> > It would also be interesting to compare another mechanism that we have
> mentioned during the defense
> > which is AFD + a sparse flow queue. Which is, BTW, already available in
> Cisco nexus switches for data centres.
> >
> > I think that the the codel part would still provide the ECN feature,
> that all the others cannot have.
> > However the others, the last one especially can be implemented in
> silicon with reasonable cost.
> >
> >
> >
> >
> >
> > On Mon 26 Nov 2018 at 22:30, Jonathan Morton 
> wrote:
> >>
> >> > On 26 Nov, 2018, at 9:08 pm, Pete Heist  wrote:
> >> >
>