Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke: > Yes, small tweaks have worked so well for us for the last decade, leaving us > pretty-much where we were in 2003. No, we don't understand how the current > system works, there is no point in trying to fix something when we don't > even know what is broken.
I?d like to present a clue what is broken in NLM. Before I kill you with the log, here?s the result: With NLM the latency of a request is a function of the raw bandwidth (not so with OLM), and NLM used only half my bandwidth after it had been deployed for 2 days (at the start much more). ? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ; with ?: transfer time, q: queue time (time to find the node), nlm: new load management, olm: old load management. So first step: make sure all bandwidth gets used - maybe by allocating more slots till we use all allowed bandwidth. Better having to throttle a transfer than not using bandwidth. *NLM should with the current network be slower than OLM by 23%. But in 18 months it should actually be faster by ~8% ? given Moores Law holds for upload bandwidth ? because the routes are shorter.* The main advantage of NLM is, that it should be much more resilient against attackers (DoS). Now to the log - it?s math and not cleaned up; you have been warned :) <ArneBab> SSK-time: ?, CHK-time: ?, success: Xs, fail: Xf. <ArneBab> queue-time: q, transfer-time: ?, hops remaining: h, total hops: h?, w: success rate <ArneBab> ?s = ?(h) + q(h) <ArneBab> ?f = q(h) <ArneBab> ? ~ w???s + (1-w?)??f <ArneBab> ?s = ?(h) + q(h) <ArneBab> ?f = q(h) <ArneBab> ? ~ w???s + (1-w?)??f; w? ~ 15% <ArneBab> num(?) / num(?) ~ 1 <ArneBab> ? time ~ ? + ? <ArneBab> q(h) depends on timeouts, as do w? and w? <ArneBab> time = w???s + (1-w?)??f + w???s + (1-w?)??f <ArneBab> = w? ? (?(h) + q(h)) + (1-w?)?q(h) + w? ? (?(h) + q(h)) + (1- w?)?q(h) <ArneBab> = t(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?) <ArneBab> = ?(h) ? (w?+w?) + 2?q(h) ? (2-w?-w?) <ArneBab> in the congestion case q(h) ~ timeout <ArneBab> timeout = o <ArneBab> timeout: o <ArneBab> w depends on the timeout *somehow*, but inversely <ArneBab> o=0 ? w=0 <ArneBab> assumption: o = ? ? w? ~ 20%, w? ~ 100% <ArneBab> assumption: o = ? ? w? ~ 0.2, w? ~ 1 <ArneBab> correction: in the congestion case: q(h) ~ min(timeout, ?(h)) <ArneBab> timeout matters for q(h) only when timeout < ?(h) <ArneBab> I try to: I still need a dependency of w on timeout <ArneBab> ? lets call it t(w) <ArneBab> better: w(o) :) <toad_> well, if there is a timeout, we have a fixed time, but we reduce the hops ... <toad_> i thought w was success rate <ArneBab> ah! <ArneBab> and the success rates where in the NLM stats <ArneBab> going mostly smoothly from 60% to 0% <ArneBab> for the HTL <toad_> right, success rate peaks at 18 or sometimes 16 <toad_> what are w1 vs w2? <toad_> chk vs ssk i guess <ArneBab> yes -*- toad_ thinks considering both is probably overambitious at this stage? <ArneBab> should not be too bad: SSKs drop much more rapidly at decreasing hops <ArneBab> hops?HTL <toad_> ?s is time for a successful chk; ?f is time for a failed chk ... in which case h in the first instance is low, and in the second instance is h0 <ArneBab> yes <toad_> okay, i don't follow this line: time = w???s + (1-w?)??f + w???s + (1-w?)??f <toad_> i thought w2 related to SSKs? <ArneBab> uh, yes? <ArneBab> time = w???s + (1-w?)??f + w???s + (1-w?)??f <toad_> you have to appreciate i'm only just getting back into maths and physics after 12 years ... <toad_> (retaking a-levels to get a degree) <ArneBab> no probs, I?m also no expert in this. I try to get a relation between the time and the timeout, so we can try to find a minimum <toad_> in any case, there are two different h's for the two uses of q(h) - h0 and h_avg <toad_> h_avg for success and h0 for failure <ArneBab> hm, yes <ArneBab> which makes this harder? <ArneBab> it?s wrong anyway? the q(h_avg) was missing <toad_> h_avg is somewhere between 5 and 10 imho <toad_> at least it is if everything is working well and the input load isn't all really popular stuff (in which case it's answered quickly and can be ignored) <ArneBab> = ?(h) ? (w?+w?) + q(h) ? (2-w?-w?) + q(h_avg) ? (w?+w?) <ArneBab> would have been too easy :) <toad_> okay so here q(h) means q(h0) i.e. h = h0, max hops? <ArneBab> jepp, and max hops sinks with falling timeout <toad_> hmm? <ArneBab> the max actual hops <toad_> on the upside, q() is linear <-- Torgal (~Torgal at 78.251.49.8) hat das Netzwerk verlassen (Ping timeout: 276 seconds) <toad_> hopefully <ArneBab> yes: q(h) = h?o <ArneBab> (in the congestion case) <toad_> the problem i have is it looks like q(1) ~= time [ for a full request ], unless load is very low <ArneBab> so ? ? q? <ArneBab> ? much smaller than q? <toad_> of course it's bounded by timeouts, but i'd expect a runaway feedback loop until it reaches heavy timeouts and effectively cuts the htl <toad_> well, with OLM, success time for a CHK is 1m25s, unsuccessful is 19sec, so transfer time is at least 1 minute <toad_> and less than 1m25; but with NLM, unsuccessful is 3 min+ <ArneBab> well, for SSKs in OLM the first 6 hops are the most successful, later ones only contribute 1% success, which piles up to ~ 12% <toad_> okay... <ArneBab> (only possible because 85% are unsuccessful) <ArneBab> (otherwise this would be wrong: the contribution of later ones would be smaller) <ArneBab> from the numbers q(h?) ~ ?(h_avg) <toad_> well, queueing time on any hop is the time it takes to get a slot to route to, which is roughly equal to the time it takes for a request to complete divided by the number of waiters, right? <toad_> errr multiplied by the number of waiters <toad_> if the network is homogenous, that's exactly the time it takes for a request to complete <toad_> so we expect ridiculous queue times <toad_> however if there is spare capacity this may be avoidable -*- toad_ hopes you can establish that NLM isn't totally pointless anyway :) <ArneBab> actually that fits it quite well, but it leaves out that routes with NLM should be shorter <ArneBab> and that for me the point of NLM is not speed but attack-resilience <ArneBab> that network can?t be spammed efficiently <ArneBab> simplified: time = (? + q) ? hops ; ? and q as times per hop <ArneBab> hops for CHK are less with NLM <ArneBab> hops for SSK are equal <ArneBab> (most are unsuccessful) <ArneBab> ? time = q(SSK) + ?(CHK) + q(CHK) <ArneBab> in OLM: ?q?(SSK) ~ 16s, ?q?(CHK) ~ 18s, ?(CHK) ~ 45s <ArneBab> (my stats) <ArneBab> in NLM q(SSK) = ?(CHK) <ArneBab> or so <ArneBab> ? there we might have the general problem <ArneBab> toad_: the queue times of SSKs depend on the transfer times of CHKs, so they have to be higher <toad_> well, ian thinks there is a fundamental problem with queueing; the alternative is to allow a larger window between when we start complaining and when stuff breaks i.e. use less % of the total capacity <ArneBab> in NLM: q(SSK) ~ q(CHK) ~ ?(CHK), ?(CHK) lower due to better routes? <toad_> which might be faster in practice <ArneBab> ?(CHK) depends on the length of the route. with 25% better success rates per hop, it should be much lower <ArneBab> ?need NLM stats? do you have some handy? <ArneBab> let?s estimate 60%/50%/50%/50% for HTL 18/17/16/15 <ArneBab> and I currentlo have 45%/50/25%/25% with OLM <ArneBab> starting with 1000 requests, in NLM 600 have 1 hop, 200 have 2 hops, 100 3 and 50 4, 50 have more ? irrelevant. -*- toad_ not following <ArneBab> in OLM 450 have 1 hop, 275 have 2 hops, 69 3 and 51 4, 150 have more <ArneBab> I?m trying to estimate the hops a transfer has to take <ArneBab> we can?t ignore the 150 with more than 4 hops in OLM <ArneBab> I?ll just go down to 50, too <toad_> what are you trying to compute? <toad_> ian is convinced that queueing always makes the underlying problem worse <toad_> i'm inclined to agree with him unless you come up with a persuasive theoretical argument <ArneBab> 120 have 5, 96 have 6, 77 have 7, 61 have 8, 50 have more <ArneBab> so a 95% of the transfers in OLM take on average ? <ArneBab> gah? need to divide the numbers, too <ArneBab> (I need to generate data to make an argument - that?s what I?m doing right now) <ArneBab> average hops for OLM: 450*1 + 275*2 + 69*3 + 51*4 + [now with correction] 150*0.22*5+120*0.2*6+96*0.2*7+77*0.2*8+61*0.2*9 <ArneBab> ? 2087.4 <ArneBab> for NLM 95% of 1000 transfers need 600*1+200*2+100*3+50*4 <ArneBab> = 1500 hops together <ArneBab> that?s 2.09 hops per transfer for OLM and 1.5 hops for NLM ? ?_nlm / ?_olm ~ 0.71 <toad_> ArneBab: okay, that's plausible <toad_> ArneBab: however, it should be possible with smart load limiting on the originator to achieve NLM-level success rates <ArneBab> but not the resilience <ArneBab> it still keeps freenet open to a DoS, NLM should help there. <ArneBab> now back to the queueing: OLM had: ?q?(SSK) ~ 16s, ?q?(CHK) ~ 18s, ?(CHK) ~ 45s (my stats) <toad_> possibly - fair sharing limits our vulnerability to a DoS, possibly enough as long as we don't have to worry about incentives issues <ArneBab> that?s about: q = ? ? ? (OLM) <ArneBab> NLM: q ~ ? <ArneBab> NLM: q ~ ? (NLM) <ArneBab> time: 2?q + ? <ArneBab> OLM: time ~ 5/3 ?_olm <ArneBab> NLM: time = 3 ? 0.72 ?_olm = 2.15 ?_olm <operhiem1> toad_: Alright, it's alive. https://github.com/freenet/fred- staging/pull/55 <ArneBab> ? time_nlm / time_olm ~ 2.15 / (5/3) ~ 1.3 <ArneBab> so the time to transfer should be a bit longer <ArneBab> (not yet finished: this is the current state) <ArneBab> now, if we decrease the timeout time, the chance that a given timeout happens in the first 4 hops should be about 4/20 = 0.2 <ArneBab> ?cut that? <ArneBab> if we decrease the timeout time below the transfer time per hop, there should be more misrouting ? ? goes up, q might go down or up ? cut that. <ArneBab> transfer time per hop in OLM ~ 45s / hops_olm = 45s/2.09 = 21.5s <ArneBab> ?actually, the time in NLM is so dependant on transfer time, that the most efficient stratigy would likely be to decrease the block size? <ArneBab> or to get a faster network <ArneBab> toad_: got it, damnit: NLM is so much slower than OLM, because it used less bandwidth! <ArneBab> the time is a function of the raw bandwidth (not so with OLM), and NLM used only half my bandwidth after it had been deployed for 2 days (at the start much more) <ArneBab> when we double the bandwidth (1.8 years?), NLM should be faster than OLM <ArneBab> operhiem1: cool! <ArneBab> toad_: actually I think the slot number calculation is flawed ? less bandwith used than possible <ArneBab> that?s why it did not break down, but slowed down to 1/5 OLM. From the math here I?d have guessed 1/2.6 <ArneBab> but adding SSKs with many more hops and time almost pure queue time it fits <ArneBab> q_nlm ~ 3??q?_olm; in the full bandwidth case <ArneBab> but with half bandwidth we actually are at 6?q_olm <ArneBab> ? more slots should actually make it much better <ArneBab> toad_: summary: ? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ? using only 50% of bandwidth (too little slots) massively slows down NLM. <ArneBab> the transfer times should actually be dominant <ArneBab> though they are lower than the queue time. <ArneBab> and freenet should get faster with faster network or lower chunk sizes. <ArneBab> toad_: so first step: make sure all bandwidth gets used - maybe by allocating more slots till about 2? the current number are transferring -*- ArneBab is happy <digger3> cool, lot's of stuff to read tomorrow morning. :) <ArneBab> NLM should with the current network be slower than OLM by 23%. But in 18 month it should actually be faster by ~8%, given Moores Law holds for upload bandwidth. <ArneBab> :) <ArneBab> with faster I mean time to complete a request. <ArneBab> reaction time ? latency <ArneBab> digger3: maybe you can doublecheck the reasoning -- Konstruktive Kritik: - http://draketo.de/licht/krude-ideen/konstruktive-kritik -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 316 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/37cfcada/attachment.pgp>