Re: [freenet-dev] Beyond New Load Management: A proposal

Matthew Toseland Mon, 29 Aug 2011 16:18:50 -0700

On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote:
> After discussing and going deeper into this, it became apparent that the 
> problem is not overload of queues. I’ll repeat the result I came to here (and 
> not in chatlog form :) ): 
> 
> The problem is in the load limiter: 
> 
> 
> 1) we need to use all bandwidth, because latency depends on the number of  
>       finishing transfers.
> 
> 2) SSKs fail 85% of the time.
> 
> 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for 
>       them.
> 
> 4) this now bites us, because NLM has shorter transfer times and longer wait 
>       times. So the over-reservation of bandwidth for SSKs lasts longer. 
> 
> 5) solution: count each SSK as only 
>       average_SSK_success_rate * data_to_transfer_on_success.
> 
> 
> To (1): When we have less bandwidth, than less CHKs succeed per second, so 
> less slots get freed and the queueing time it bigger.


Another way to look at this:

Both me and Ian have been assuming that the resource is some number 
representing a property of the node, for example, how many requests are 
running. But the real resource is the external bandwidth usage, which is rather 
different.

An alternative solution - if we are worried about bursts of requests succeeding 
and therefore timing out - is to allocate a fixed chunk of bandwidth for SSKs, 
and not let them interfere with CHKs. Then CHKs can have long transfer times, 
and therefore long queueing times when they fail (but this is only ~ 40% of the 
time with NLM's accurate routing), and SSKs can have short transfer times, and 
therefore short queueing times. This may be less disruptive and should also be 
relatively easy to implement.

But the other question is, can queueing ever be helpful? It can if it allows us 
to route more accurately (which NLM clearly does), and/or to run enough 
requests in parallel that the longer time taken for the request to reach its 
destination is offset. Is this condition met?
> 
> Best wishes, 
> Arne
> 
> 
> 
> Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland:
> > Okay, I don't understand most of that, I might be able to check the math if
> > it was written properly, but it looks difficult. However, as far as I can
> > see: - The most obvious way to increase bandwidth usage would be to
> > increase the timeout time for output bandwidth liability (and at the same
> > time increase the relevant block transfer timeouts). - This would increase
> > the number of slots but it would also increase the number of requests
> > seeking them; I don't see why it would help matters. - Running an excessive
> > number of requests without adjusting the block transfer timeouts would
> > result in some of the transfers timing out. - It would also, as you
> > mention, make SSK requests (especially failed SSK requests) even slower. -
> > I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. -
> > As far as I can see, all the benefits of NLM re attackers are achieved by
> > fair sharing. - A system which is more resistant to attacks but slower
> > probably isn't all that interesting if the attacks in question are
> > relatively expensive anyway. - Smaller block sizes would have a significant
> > efficiency cost, and would probably make load management more difficult.
> > 
> > I apologise if you see this as rather a broadside after I encouraged you to
> > analyse the problem, but it is not yet a convincing demonstration that
> > queueing is a viable strategy and not the spawn of satan! :)
> > On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> > > Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > > > Yes, small tweaks have worked so well for us for the last decade,
> > > > leaving us pretty-much where we were in 2003.  No, we don't
> > > > understand how the current system works, there is no point in
> > > > trying to fix something when we don't even know what is broken.
> > > 
> > > I’d like to present a clue what is broken in NLM. Before I kill you with
> > > the> 
> > > log, here’s the result:
> > >   With NLM the latency of a request is a function of the raw bandwidth
> > >   (not so with OLM), and NLM used only half my bandwidth after it had
> > >   been deployed for 2 days (at the start much more).
> > > 
> > > τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue
> > > time (time to find the node), nlm: new load management, olm: old load
> > > management.
> > > 
> > > So first step: make sure all bandwidth gets used - maybe by allocating
> > > more slots till we use all allowed bandwidth. Better having to throttle
> > > a transfer than not using bandwidth.
> > > 
> > > *NLM should with the current network be slower than OLM by 23%. But in
> > > 18
> > > months it should actually be faster by ~8% — given Moores Law holds for
> > > upload bandwidth — because the routes are shorter.*
> > > 
> > > The main advantage of NLM is, that it should be much more resilient
> > > against attackers (DoS).
> > > 
> > > Now to the log - it’s math and not cleaned up; you have been warned :)
> > > 
> > > <ArneBab> SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf.
> > > <ArneBab> queue-time: q, transfer-time: τ, hops remaining: h, total
> > > hops: h₀, w: success rate
> > > <ArneBab> ψs = τ(h) + q(h)
> > > <ArneBab> ψf = q(h)
> > > <ArneBab> ψ ~ w₁·ψs + (1-w₁)·ψf
> > > <ArneBab> σs = τ(h) + q(h)
> > > <ArneBab> σf = q(h)
> > > <ArneBab> σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
> > > <ArneBab> num(ψ) / num(σ) ~ 1
> > > <ArneBab> → time ~ σ + ψ
> > > <ArneBab> q(h) depends on timeouts, as do w₁ and w₂
> > > <ArneBab> time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
> > > <ArneBab> = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
> > > w₂)·q(h)
> > > <ArneBab> = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
> > > <ArneBab> = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
> > > <ArneBab> in the congestion case q(h) ~ timeout
> > > <ArneBab> timeout = o
> > > <ArneBab> timeout: o
> > > <ArneBab> w depends on the timeout *somehow*, but inversely
> > > <ArneBab> o=0 → w=0
> > > <ArneBab> assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
> > > <ArneBab> assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
> > > <ArneBab> correction: in the congestion case: q(h) ~ min(timeout, τ(h))
> > > <ArneBab> timeout matters for q(h) only when timeout < τ(h)
> > > <ArneBab> I try to: I still need a dependency of w on timeout
> > > <ArneBab> … lets call it t(w)
> > > <ArneBab> better: w(o) :)
> > > <toad_> well, if there is a timeout, we have a fixed time, but we reduce
> > > the hops ...
> > > <toad_> i thought w was success rate
> > > <ArneBab> ah!
> > > <ArneBab> and the success rates where in the NLM stats
> > > <ArneBab> going mostly smoothly from 60% to 0%
> > > <ArneBab> for the HTL
> > > <toad_> right, success rate peaks at 18 or sometimes 16
> > > <toad_> what are w1 vs w2?
> > > <toad_> chk vs ssk i guess
> > > <ArneBab> yes
> > > -*- toad_ thinks considering both is probably overambitious at this
> > > stage? <ArneBab> should not be too bad: SSKs drop much more rapidly at
> > > decreasing hops
> > > <ArneBab> hops→HTL
> > > <toad_> ψs is time for a successful chk; ψf is time for a failed chk ...
> > > in which case h in the first instance is low, and in the second
> > > instance is h0 <ArneBab> yes
> > > <toad_> okay, i don't follow this line: time =  w₁·ψs + (1-w₁)·ψf + 
> > > w₂·ψs + (1-w₂)·ψf
> > > <toad_> i thought w2 related to SSKs?
> > > <ArneBab> uh, yes…
> > > <ArneBab> time =  w₁·ψs + (1-w₁)·ψf +  w₂·σs + (1-w₂)·σf
> > > <toad_> you have to appreciate i'm only just getting back into maths and
> > > physics after 12 years ...
> > > <toad_> (retaking a-levels to get a degree)
> > > <ArneBab> no probs, I’m also no expert in this. I try to get a relation
> > > between the time and the timeout, so we can try to find a minimum
> > > <toad_> in any case, there are two different h's for the two uses of
> > > q(h) - h0 and h_avg
> > > <toad_> h_avg for success and h0 for failure
> > > <ArneBab> hm, yes
> > > <ArneBab> which makes this harder…
> > > <ArneBab> it’s wrong anyway… the q(h_avg) was missing
> > > <toad_> h_avg is somewhere between 5 and 10 imho
> > > <toad_> at least it is if everything is working well and the input load
> > > isn't all really popular stuff (in which case it's answered quickly and
> > > can be ignored)
> > > <ArneBab> = τ(h) · (w₁+w₂) + q(h) · (2-w₁-w₂) + q(h_avg) · (w₁+w₂)
> > > <ArneBab> would have been too easy :)
> > > <toad_> okay so here q(h) means q(h0) i.e. h = h0, max hops?
> > > <ArneBab> jepp, and max hops sinks with falling timeout
> > > <toad_> hmm?
> > > <ArneBab> the max actual hops
> > > <toad_> on the upside, q() is linear
> > > <-- Torgal (~Torgal@78.251.49.8) hat das Netzwerk verlassen (Ping
> > > timeout: 276 seconds)
> > > <toad_> hopefully
> > > <ArneBab> yes: q(h) = h·o
> > > <ArneBab> (in the congestion case)
> > > <toad_> the problem i have is it looks like q(1) ~= time [ for a full
> > > request ], unless load is very low
> > > <ArneBab> so τ « q?
> > > <ArneBab> τ much smaller than q?
> > > <toad_> of course it's bounded by timeouts, but i'd expect a runaway
> > > feedback loop until it reaches heavy timeouts and effectively cuts the
> > > htl <toad_> well, with OLM, success time for a CHK is 1m25s,
> > > unsuccessful is 19sec, so transfer time is at least 1 minute
> > > <toad_> and less than 1m25; but with NLM, unsuccessful is 3 min+
> > > <ArneBab> well, for SSKs in OLM the first 6 hops are the most
> > > successful, later ones only contribute 1% success, which piles up to ~
> > > 12%
> > > <toad_> okay...
> > > <ArneBab> (only possible because 85% are unsuccessful)
> > > <ArneBab> (otherwise this would be wrong: the contribution of later ones
> > > would be smaller)
> > > <ArneBab> from the numbers q(h₀) ~ τ(h_avg)
> > > <toad_> well, queueing time on any hop is the time it takes to get a
> > > slot to route to, which is roughly equal to the time it takes for a
> > > request to complete divided by the number of waiters, right?
> > > <toad_> errr multiplied by the number of waiters
> > > <toad_> if the network is homogenous, that's exactly the time it takes
> > > for a request to complete
> > > <toad_> so we expect ridiculous queue times
> > > <toad_> however if there is spare capacity this may be avoidable
> > > -*- toad_ hopes you can establish that NLM isn't totally pointless
> > > anyway :) <ArneBab> actually that fits it quite well, but it leaves out
> > > that routes with NLM should be shorter
> > > <ArneBab> and that for me the point of NLM is not speed but
> > > attack-resilience <ArneBab> that network can’t be spammed efficiently
> > > <ArneBab> simplified: time = (τ + q) · hops ; τ and q as times per hop
> > > <ArneBab> hops for CHK are less with NLM
> > > <ArneBab> hops for SSK are equal
> > > <ArneBab> (most are unsuccessful)
> > > <ArneBab> → time = q(SSK) + τ(CHK) + q(CHK)
> > > <ArneBab> in OLM: “q”(SSK) ~ 16s, “q”(CHK) ~ 18s, τ(CHK) ~ 45s
> > > <ArneBab> (my stats)
> > > <ArneBab> in NLM q(SSK) = τ(CHK)
> > > <ArneBab> or so
> > > <ArneBab> → there we might have the general problem
> > > <ArneBab> toad_: the queue times of SSKs depend on the transfer times of
> > > CHKs, so they have to be higher
> > > <toad_> well, ian thinks there is a fundamental problem with queueing;
> > > the alternative is to allow a larger window between when we start
> > > complaining and when stuff breaks i.e. use less % of the total capacity
> > > <ArneBab> in NLM: q(SSK) ~ q(CHK) ~ τ(CHK), τ(CHK) lower due to better
> > > routes? <toad_> which might be faster in practice
> > > <ArneBab> τ(CHK) depends on the length of the route. with 25% better
> > > success rates per hop, it should be much lower
> > > <ArneBab> …need NLM stats… do you have some handy?
> > > <ArneBab> let’s estimate 60%/50%/50%/50%  for HTL 18/17/16/15
> > > <ArneBab> and I currentlo have 45%/50/25%/25% with OLM
> > > <ArneBab> starting with 1000 requests, in NLM 600 have 1 hop, 200 have 2
> > > hops, 100 3 and 50 4, 50 have more → irrelevant.
> > > -*- toad_ not following
> > > <ArneBab> in OLM 450 have 1 hop, 275 have 2 hops, 69 3 and 51 4, 150
> > > have more <ArneBab> I’m trying to estimate the hops a transfer has to
> > > take <ArneBab> we can’t ignore the 150 with more than 4 hops in OLM
> > > <ArneBab> I’ll just go down to 50, too
> > > <toad_> what are you trying to compute?
> > > <toad_> ian is convinced that queueing always makes the underlying
> > > problem worse
> > > <toad_> i'm inclined to agree with him unless you come up with a
> > > persuasive theoretical argument
> > > <ArneBab> 120 have 5, 96 have 6, 77 have 7, 61 have 8, 50 have more
> > > <ArneBab> so a 95% of the transfers in OLM take on average …
> > > <ArneBab> gah… need to divide the numbers, too
> > > <ArneBab> (I need to generate data to make an argument - that’s what I’m
> > > doing right now)
> > > <ArneBab> average hops for OLM: 450*1 + 275*2 + 69*3 + 51*4 + [now with
> > > correction] 150*0.22*5+120*0.2*6+96*0.2*7+77*0.2*8+61*0.2*9
> > > <ArneBab> → 2087.4
> > > <ArneBab> for NLM 95% of 1000 transfers need 600*1+200*2+100*3+50*4
> > > <ArneBab> = 1500 hops together
> > > <ArneBab> that’s 2.09 hops per transfer for OLM and 1.5 hops for NLM →
> > > τ_nlm / τ_olm ~ 0.71
> > > <toad_> ArneBab: okay, that's plausible
> > > <toad_> ArneBab: however, it should be possible with smart load limiting
> > > on the originator to achieve NLM-level success rates
> > > <ArneBab> but not the resilience
> > > <ArneBab> it still keeps freenet open to a DoS, NLM should help there.
> > > <ArneBab> now back to the queueing: OLM had: “q”(SSK) ~ 16s, “q”(CHK) ~
> > > 18s, τ(CHK) ~ 45s (my stats)
> > > <toad_> possibly - fair sharing limits our vulnerability to a DoS,
> > > possibly enough as long as we don't have to worry about incentives
> > > issues <ArneBab> that’s about: q = ⅓ · τ (OLM)
> > > <ArneBab> NLM: q ~ τ
> > > <ArneBab> NLM: q ~ τ (NLM)
> > > <ArneBab> time: 2·q + τ
> > > <ArneBab> OLM: time ~ 5/3 τ_olm
> > > <ArneBab> NLM: time = 3 · 0.72 τ_olm = 2.15 τ_olm
> > > <operhiem1> toad_: Alright, it's alive. https://github.com/freenet/fred-
> > > staging/pull/55
> > > <ArneBab> → time_nlm / time_olm ~ 2.15 / (5/3) ~ 1.3
> > > <ArneBab> so the time to transfer should be a bit longer
> > > <ArneBab> (not yet finished: this is the current state)
> > > <ArneBab> now, if we decrease the timeout time, the chance that a given
> > > timeout happens in the first 4 hops should be about 4/20 = 0.2
> > > <ArneBab> …cut that…
> > > <ArneBab> if we decrease the timeout time below the transfer time per
> > > hop, there should be more misrouting → τ goes up, q might go down or up
> > > → cut that. <ArneBab> transfer time per hop in OLM ~ 45s / hops_olm =
> > > 45s/2.09 = 21.5s <ArneBab> …actually, the time in NLM is so dependant
> > > on transfer time, that the most efficient stratigy would likely be to
> > > decrease the block size… <ArneBab> or to get a faster network
> > > <ArneBab> toad_: got it, damnit: NLM is so much slower than OLM, because
> > > it used less bandwidth!
> > > <ArneBab> the time is a function of the raw bandwidth (not so with OLM),
> > > and NLM used only half my bandwidth after it had been deployed for 2
> > > days (at the start much more)
> > > <ArneBab> when we double the bandwidth (1.8 years?), NLM should be
> > > faster than OLM
> > > <ArneBab> operhiem1: cool!
> > > <ArneBab> toad_: actually I think the slot number calculation is flawed
> > > → less bandwith used than possible
> > > <ArneBab> that’s why it did not break down, but slowed down to 1/5 OLM.
> > > From the math here I’d have guessed 1/2.6
> > > <ArneBab> but adding SSKs with many more hops and time almost pure queue
> > > time it fits
> > > <ArneBab> q_nlm ~ 3·“q”_olm; in the full bandwidth case
> > > <ArneBab> but with half bandwidth we actually are at 6·q_olm
> > > <ArneBab> → more slots should actually make it much better
> > > <ArneBab> toad_: summary: τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! → using
> > > only 50% of bandwidth (too little slots) massively slows down NLM.
> > > <ArneBab> the transfer times should actually be dominant
> > > <ArneBab> though they are lower than the queue time.
> > > <ArneBab> and freenet should get faster with faster network or lower
> > > chunk sizes.
> > > <ArneBab> toad_: so first step: make sure all bandwidth gets used -
> > > maybe by allocating more slots till about 2× the current number are
> > > transferring -*- ArneBab is happy
> > > <digger3> cool, lot's of stuff to read tomorrow morning. :)
> > > <ArneBab> NLM should with the current network be slower than OLM by 23%.
> > > But in 18 month it should actually be faster by ~8%, given Moores Law
> > > holds for upload bandwidth.
> > > <ArneBab> :)
> > > <ArneBab> with faster I mean time to complete a request.
> > > <ArneBab> reaction time — latency
> > > <ArneBab> digger3: maybe you can doublecheck the reasoning
> --
> A man in the streets faces a knife.
> Two policemen are there it once. They raise a sign:
> 
>     “Illegal Scene! Noone may watch this!”
> 
> The man gets robbed and stabbed and bleeds to death.
> The police had to hold the sign.
> 
> …Welcome to Europe, citizen. Censorship is beautiful.
> 
>    ( http://draketo.de/stichwort/censorship )
> 
> 
>

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

Reply via email to