Re: [freenet-dev] Beyond New Load Management: A proposal

Matthew Toseland Mon, 29 Aug 2011 14:28:00 -0700

Okay, I don't understand most of that, I might be able to check the math if it 
was written properly, but it looks difficult. However, as far as I can see:
- The most obvious way to increase bandwidth usage would be to increase the 
timeout time for output bandwidth liability (and at the same time increase the 
relevant block transfer timeouts).
- This would increase the number of slots but it would also increase the number 
of requests seeking them; I don't see why it would help matters.
- Running an excessive number of requests without adjusting the block transfer 
timeouts would result in some of the transfers timing out.
- It would also, as you mention, make SSK requests (especially failed SSK 
requests) even slower.
- I am quite confident that Moore's Law DOES NOT hold for upload bandwidth.
- As far as I can see, all the benefits of NLM re attackers are achieved by 
fair sharing.
- A system which is more resistant to attacks but slower probably isn't all 
that interesting if the attacks in question are relatively expensive anyway.
- Smaller block sizes would have a significant efficiency cost, and would 
probably make load management more difficult.


I apologise if you see this as rather a broadside after I encouraged you to 
analyse the problem, but it is not yet a convincing demonstration that queueing 
is a viable strategy and not the spawn of satan! :)

On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote:
> Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke:
> > Yes, small tweaks have worked so well for us for the last decade, leaving us
> > pretty-much where we were in 2003.  No, we don't understand how the current
> > system works, there is no point in trying to fix something when we don't
> > even know what is broken.
> 
> I’d like to present a clue what is broken in NLM. Before I kill you with the 
> log, here’s the result: 
> 
>       With NLM the latency of a request is a function of the raw bandwidth 
>       (not so with OLM), and NLM used only half my bandwidth after it had 
>       been deployed for 2 days (at the start much more).
> 
> τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time 
> (time to find the node), nlm: new load management, olm: old load management.
> 
> So first step: make sure all bandwidth gets used - maybe by allocating more 
> slots till we use all allowed bandwidth. Better having to throttle a transfer 
> than not using bandwidth.
> 
> *NLM should with the current network be slower than OLM by 23%. But in 18 
> months it should actually be faster by ~8% — given Moores Law holds for 
> upload 
> bandwidth — because the routes are shorter.*
> 
> The main advantage of NLM is, that it should be much more resilient against 
> attackers (DoS).
> 
> Now to the log - it’s math and not cleaned up; you have been warned :)
> 
> <ArneBab> SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf. 
> <ArneBab> queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀, 
> w: success rate
> <ArneBab> ψs = τ(h) + q(h)
> <ArneBab> ψf = q(h)
> <ArneBab> ψ ~ w₁·ψs + (1-w₁)·ψf
> <ArneBab> σs = τ(h) + q(h)
> <ArneBab> σf = q(h)
> <ArneBab> σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15%
> <ArneBab> num(ψ) / num(σ) ~ 1
> <ArneBab> → time ~ σ + ψ
> <ArneBab> q(h) depends on timeouts, as do w₁ and w₂
> <ArneBab> time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + (1-w₂)·ψf
> <ArneBab> = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1-
> w₂)·q(h)
> <ArneBab> = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
> <ArneBab> = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂)
> <ArneBab> in the congestion case q(h) ~ timeout
> <ArneBab> timeout = o
> <ArneBab> timeout: o
> <ArneBab> w depends on the timeout *somehow*, but inversely
> <ArneBab> o=0 → w=0
> <ArneBab> assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100%
> <ArneBab> assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1
> <ArneBab> correction: in the congestion case: q(h) ~ min(timeout, τ(h))
> <ArneBab> timeout matters for q(h) only when timeout < τ(h)
> <ArneBab> I try to: I still need a dependency of w on timeout
> <ArneBab> … lets call it t(w)
> <ArneBab> better: w(o) :)
> <toad_> well, if there is a timeout, we have a fixed time, but we reduce the 
> hops ...
> <toad_> i thought w was success rate
> <ArneBab> ah! 
> <ArneBab> and the success rates where in the NLM stats
> <ArneBab> going mostly smoothly from 60% to 0%
> <ArneBab> for the HTL
> <toad_> right, success rate peaks at 18 or sometimes 16
> <toad_> what are w1 vs w2?
> <toad_> chk vs ssk i guess
> <ArneBab> yes
> -*- toad_ thinks considering both is probably overambitious at this stage?
> <ArneBab> should not be too bad: SSKs drop much more rapidly at decreasing 
> hops
> <ArneBab> hops→HTL
> <toad_> ψs is time for a successful chk; ψf is time for a failed chk ... in 
> which case h in the first instance is low, and in the second instance is h0
> <ArneBab> yes
> <toad_> okay, i don't follow this line: time =  w₁·ψs + (1-w₁)·ψf +  w₂·ψs + 
> (1-w₂)·ψf
> <toad_> i thought w2 related to SSKs?
> <ArneBab> uh, yes…
> <ArneBab> time =  w₁·ψs + (1-w₁)·ψf +  w₂·σs + (1-w₂)·σf
> <toad_> you have to appreciate i'm only just getting back into maths and 
> physics after 12 years ...
> <toad_> (retaking a-levels to get a degree)
> <ArneBab> no probs, I’m also no expert in this. I try to get a relation 
> between the time and the timeout, so we can try to find a minimum
> <toad_> in any case, there are two different h's for the two uses of q(h) - 
> h0 
> and h_avg
> <toad_> h_avg for success and h0 for failure
> <ArneBab> hm, yes
> <ArneBab> which makes this harder… 
> <ArneBab> it’s wrong anyway… the q(h_avg) was missing
> <toad_> h_avg is somewhere between 5 and 10 imho
> <toad_> at least it is if everything is working well and the input load isn't 
> all really popular stuff (in which case it's answered quickly and can be 
> ignored)
> <ArneBab> = τ(h) · (w₁+w₂) + q(h) · (2-w₁-w₂) + q(h_avg) · (w₁+w₂)
> <ArneBab> would have been too easy :)
> <toad_> okay so here q(h) means q(h0) i.e. h = h0, max hops?
> <ArneBab> jepp, and max hops sinks with falling timeout
> <toad_> hmm?
> <ArneBab> the max actual hops
> <toad_> on the upside, q() is linear
> <-- Torgal (~Torgal@78.251.49.8) hat das Netzwerk verlassen (Ping timeout: 
> 276 
> seconds)
> <toad_> hopefully
> <ArneBab> yes: q(h) = h·o
> <ArneBab> (in the congestion case)
> <toad_> the problem i have is it looks like q(1) ~= time [ for a full request 
> ], unless load is very low
> <ArneBab> so τ « q?
> <ArneBab> τ much smaller than q?
> <toad_> of course it's bounded by timeouts, but i'd expect a runaway feedback 
> loop until it reaches heavy timeouts and effectively cuts the htl
> <toad_> well, with OLM, success time for a CHK is 1m25s, unsuccessful is 
> 19sec, so transfer time is at least 1 minute
> <toad_> and less than 1m25; but with NLM, unsuccessful is 3 min+
> <ArneBab> well, for SSKs in OLM the first 6 hops are the most successful, 
> later 
> ones only contribute 1% success, which piles up to ~ 12%
> <toad_> okay...
> <ArneBab> (only possible because 85% are unsuccessful)
> <ArneBab> (otherwise this would be wrong: the contribution of later ones 
> would 
> be smaller)
> <ArneBab> from the numbers q(h₀) ~ τ(h_avg)
> <toad_> well, queueing time on any hop is the time it takes to get a slot to 
> route to, which is roughly equal to the time it takes for a request to 
> complete divided by the number of waiters, right?
> <toad_> errr multiplied by the number of waiters
> <toad_> if the network is homogenous, that's exactly the time it takes for a 
> request to complete
> <toad_> so we expect ridiculous queue times
> <toad_> however if there is spare capacity this may be avoidable
> -*- toad_ hopes you can establish that NLM isn't totally pointless anyway :)
> <ArneBab> actually that fits it quite well, but it leaves out that routes 
> with 
> NLM should be shorter
> <ArneBab> and that for me the point of NLM is not speed but attack-resilience
> <ArneBab> that network can’t be spammed efficiently
> <ArneBab> simplified: time = (τ + q) · hops ; τ and q as times per hop
> <ArneBab> hops for CHK are less with NLM
> <ArneBab> hops for SSK are equal
> <ArneBab> (most are unsuccessful)
> <ArneBab> → time = q(SSK) + τ(CHK) + q(CHK)
> <ArneBab> in OLM: “q”(SSK) ~ 16s, “q”(CHK) ~ 18s, τ(CHK) ~ 45s 
> <ArneBab> (my stats)
> <ArneBab> in NLM q(SSK) = τ(CHK)
> <ArneBab> or so
> <ArneBab> → there we might have the general problem
> <ArneBab> toad_: the queue times of SSKs depend on the transfer times of 
> CHKs, 
> so they have to be higher
> <toad_> well, ian thinks there is a fundamental problem with queueing; the 
> alternative is to allow a larger window between when we start complaining and 
> when stuff breaks i.e. use less % of the total capacity
> <ArneBab> in NLM: q(SSK) ~ q(CHK) ~ τ(CHK), τ(CHK) lower due to better routes?
> <toad_> which might be faster in practice
> <ArneBab> τ(CHK) depends on the length of the route. with 25% better success 
> rates per hop, it should be much lower
> <ArneBab> …need NLM stats… do you have some handy?
> <ArneBab> let’s estimate 60%/50%/50%/50%  for HTL 18/17/16/15
> <ArneBab> and I currentlo have 45%/50/25%/25% with OLM
> <ArneBab> starting with 1000 requests, in NLM 600 have 1 hop, 200 have 2 
> hops, 
> 100 3 and 50 4, 50 have more → irrelevant.
> -*- toad_ not following
> <ArneBab> in OLM 450 have 1 hop, 275 have 2 hops, 69 3 and 51 4, 150 have more
> <ArneBab> I’m trying to estimate the hops a transfer has to take
> <ArneBab> we can’t ignore the 150 with more than 4 hops in OLM
> <ArneBab> I’ll just go down to 50, too
> <toad_> what are you trying to compute?
> <toad_> ian is convinced that queueing always makes the underlying problem 
> worse
> <toad_> i'm inclined to agree with him unless you come up with a persuasive 
> theoretical argument
> <ArneBab> 120 have 5, 96 have 6, 77 have 7, 61 have 8, 50 have more
> <ArneBab> so a 95% of the transfers in OLM take on average …
> <ArneBab> gah… need to divide the numbers, too
> <ArneBab> (I need to generate data to make an argument - that’s what I’m 
> doing 
> right now)
> <ArneBab> average hops for OLM: 450*1 + 275*2 + 69*3 + 51*4 + [now with 
> correction] 150*0.22*5+120*0.2*6+96*0.2*7+77*0.2*8+61*0.2*9
> <ArneBab> → 2087.4
> <ArneBab> for NLM 95% of 1000 transfers need 600*1+200*2+100*3+50*4
> <ArneBab> = 1500 hops together
> <ArneBab> that’s 2.09 hops per transfer for OLM and 1.5 hops for NLM → τ_nlm 
> / 
> τ_olm ~ 0.71
> <toad_> ArneBab: okay, that's plausible
> <toad_> ArneBab: however, it should be possible with smart load limiting on 
> the originator to achieve NLM-level success rates
> <ArneBab> but not the resilience
> <ArneBab> it still keeps freenet open to a DoS, NLM should help there.
> <ArneBab> now back to the queueing: OLM had: “q”(SSK) ~ 16s, “q”(CHK) ~ 18s, 
> τ(CHK) ~ 45s (my stats)
> <toad_> possibly - fair sharing limits our vulnerability to a DoS, possibly 
> enough as long as we don't have to worry about incentives issues
> <ArneBab> that’s about: q = ⅓ · τ (OLM)
> <ArneBab> NLM: q ~ τ
> <ArneBab> NLM: q ~ τ (NLM)
> <ArneBab> time: 2·q + τ
> <ArneBab> OLM: time ~ 5/3 τ_olm
> <ArneBab> NLM: time = 3 · 0.72 τ_olm = 2.15 τ_olm
> <operhiem1> toad_: Alright, it's alive. https://github.com/freenet/fred-
> staging/pull/55
> <ArneBab> → time_nlm / time_olm ~ 2.15 / (5/3) ~ 1.3
> <ArneBab> so the time to transfer should be a bit longer
> <ArneBab> (not yet finished: this is the current state)
> <ArneBab> now, if we decrease the timeout time, the chance that a given 
> timeout happens in the first 4 hops should be about 4/20 = 0.2
> <ArneBab> …cut that…
> <ArneBab> if we decrease the timeout time below the transfer time per hop, 
> there should be more misrouting → τ goes up, q might go down or up → cut that.
> <ArneBab> transfer time per hop in OLM ~ 45s / hops_olm = 45s/2.09 = 21.5s
> <ArneBab> …actually, the time in NLM is so dependant on transfer time, that 
> the most efficient stratigy would likely be to decrease the block size… 
> <ArneBab> or to get a faster network
> <ArneBab> toad_: got it, damnit: NLM is so much slower than OLM, because it 
> used less bandwidth!
> <ArneBab> the time is a function of the raw bandwidth (not so with OLM), and 
> NLM used only half my bandwidth after it had been deployed for 2 days (at the 
> start much more)
> <ArneBab> when we double the bandwidth (1.8 years?), NLM should be faster 
> than 
> OLM
> <ArneBab> operhiem1: cool!
> <ArneBab> toad_: actually I think the slot number calculation is flawed → 
> less 
> bandwith used than possible
> <ArneBab> that’s why it did not break down, but slowed down to 1/5 OLM. From 
> the math here I’d have guessed 1/2.6
> <ArneBab> but adding SSKs with many more hops and time almost pure queue time 
> it fits
> <ArneBab> q_nlm ~ 3·“q”_olm; in the full bandwidth case
> <ArneBab> but with half bandwidth we actually are at 6·q_olm
> <ArneBab> → more slots should actually make it much better
> <ArneBab> toad_: summary: τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! → using only 
> 50% of bandwidth (too little slots) massively slows down NLM.
> <ArneBab> the transfer times should actually be dominant
> <ArneBab> though they are lower than the queue time. 
> <ArneBab> and freenet should get faster with faster network or lower chunk 
> sizes.
> <ArneBab> toad_: so first step: make sure all bandwidth gets used - maybe by 
> allocating more slots till about 2× the current number are transferring
> -*- ArneBab is happy
> <digger3> cool, lot's of stuff to read tomorrow morning. :)
> <ArneBab> NLM should with the current network be slower than OLM by 23%. But 
> in 18 month it should actually be faster by ~8%, given Moores Law holds for 
> upload bandwidth.
> <ArneBab> :)
> <ArneBab> with faster I mean time to complete a request.
> <ArneBab> reaction time — latency
> <ArneBab> digger3: maybe you can doublecheck the reasoning

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

Reply via email to