Re: [freenet-dev] Beyond New Load Management: A proposal
Am Mittwoch, 31. August 2011, 13:25:35 schrieb Matthew Toseland: > On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote: > > what this means: if a SSK has a mean success rate of 0.16, then using > > 0.25 as value makes sure that 95% of the possible cases don’t exhaust > > the bandwidth. We then use only 64% of the bandwidth on average, > > though. With 0.2, we’d get 68% of the possible distributions safe and > > use 80% of bandwidth on average. > Only if there are no clusters - a single requestor fetches a bunch of stuff > that is all already there, rather than polling keys that usually aren't. > IMHO there will be. For instance, when a new chat client is started up, a > lot of what it does will be fetching existing messages rather than polling > for new ones. But these are in cache, so the routes will be very short. And SSKs with very high HTL (18, 17, 16) have good success rates, so this code won’t affect them much less than low-HTL SSKs (the unsuccessful ones). After HTL 15 or so, their success rate drops massively - which is quite easy to explain: The content mostly isn’t there. Simply take the success-rate per HTL - as in the statistics window. Best wishes, Arne signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
On Monday 29 Aug 2011 20:34:01 Ian Clarke wrote: > On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland > wrote: > > > > >- What is the average load reported by responses this node > > forwarded, per > > > >remote node > > > > > > Ahhh, this one could be interesting - you could use it to penalise nodes > > which spam excessively. > > > > Actually, thinking about it, I don't think this will work. Requests are > > usually distributed evenly across the keyspace, and downstream nodes only > > know they are from this node - not which previous node they are from. So the > > pain will be shared across all the peers sending us requests, including our > > own local requests, and won't be identifiable as due to any single node. You > > have to do it by volume, not by reported load. > > But if a node is being abusive won't the nodes its requesting from tend to > have a higher than average load? If its requests differ from the specialisation of the node in general, so are routed to a particular subset, then they might be different. But this is not necessarily true if it is a deliberate attack. > Perhaps you are right though, but this is > the kind of thought-process we need. To prevent a node from using an excessive proportion of our capacity, we need to *measure the proportion of our capacity that the node uses*. This much is straightforward IMHO. And that's the basis of fair sharing between peers. signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
On Tuesday 30 Aug 2011 21:02:54 Arne Babenhauserheide wrote: > Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide: > > 5) solution: count each SSK as only > > average_SSK_success_rate * data_to_transfer_on_success. > > Some more data: > > chances of having at least this many successful transfers for 40 SSKs with a > mean success rate of 16%: > > for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done > > 0 1.0 > 1 0.999064224991 > 2 0.99193451064 > 3 0.965452714478 > 4 0.901560126912 > 5 0.788987472629 > 6 0.634602118184 > 7 0.463062835467 > 8 0.304359825607 > 9 0.179664603573 > 10 0.0952149293922 > 11 0.0453494074947 > 12 0.0194452402752 > 13 0.00752109980912 > 14 0.0026291447461 > 15 0.000832100029072 > 16 0.00023879002726 > > what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as > value makes sure that 95% of the possible cases don’t exhaust the bandwidth. > We then use only 64% of the bandwidth on average, though. With 0.2, we’d get > 68% of the possible distributions safe and use 80% of bandwidth on average. Only if there are no clusters - a single requestor fetches a bunch of stuff that is all already there, rather than polling keys that usually aren't. IMHO there will be. For instance, when a new chat client is started up, a lot of what it does will be fetching existing messages rather than polling for new ones. signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
Am Dienstag, 30. August 2011, 01:08:16 schrieb Arne Babenhauserheide: > 5) solution: count each SSK as only > average_SSK_success_rate * data_to_transfer_on_success. Some more data: chances of having at least this many successful transfers for 40 SSKs with a mean success rate of 16%: for i in {0..16}; do echo $i $(./spielfaehig.py 0.16 40 $i); done 0 1.0 1 0.999064224991 2 0.99193451064 3 0.965452714478 4 0.901560126912 5 0.788987472629 6 0.634602118184 7 0.463062835467 8 0.304359825607 9 0.179664603573 10 0.0952149293922 11 0.0453494074947 12 0.0194452402752 13 0.00752109980912 14 0.0026291447461 15 0.000832100029072 16 0.00023879002726 what this means: if a SSK has a mean success rate of 0.16, then using 0.25 as value makes sure that 95% of the possible cases don’t exhaust the bandwidth. We then use only 64% of the bandwidth on average, though. With 0.2, we’d get 68% of the possible distributions safe and use 80% of bandwidth on average. Note: this is just a binomial spread: from math import factorial fac = factorial def nük(n, k): if k > n: return 0 return fac(n) / (fac(k)*fac(n-k)) def binom(p, n, k): return nük(n, k) * p** k * (1-p)**(n-k) def spielfähig(p, n, min_spieler): return sum([binom(p, n, k) for k in range(min_spieler, n+1)]) → USK@6~ZDYdvAgMoUfG6M5Kwi7SQqyS- gTcyFeaNN1Pf3FvY,OSOT4OEeg4xyYnwcGECZUX6~lnmYrZsz05Km7G7bvOQ,AQACAAE/bab/9/Content- D426DC7.html Best wishes, Arne signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
Von: "Matthew Toseland" >But the other question is, can queueing ever be helpful? It can if it allows >us to route more accurately (which NLM clearly does), and/or to run enough >requests in parallel that the longer time taken for the request to reach its >destination is offset. Is this condition met? Experience with the deployed NLM showed that even in the fully congested case it had success rates of 60% for HTL 18,17 and 16, compared to less than 40% for OLM. This means that the requests are sent over fewer hops on average, because find the content fewer hops away from the requester. A download of 1MiB which is sent over 2 hops needs 2 MiB in total network bandwidth. If it is sent over only 1.5 hops on average, then it needs only 1.5 MiB total network bandwidth. So essentially NLM can distribute 30% more content with the same network resources¹. And these numbers are actual observations. The only reason why this did not result in increased performance is that the nodes used less than 50% of their allocated bandwidth² - which is a problem with the bandwidth scheduler and not with queueing. Best wishes, Arne ¹: The relevant network resource is upload bandwidth. ²: Source: observations from me and two other freenet users. PS: How exactly the bandwidth limiter is fixed is an implementation detail. I think you are actually the only person who can judge how to do this most efficiently. ___ Schon gehört? WEB.DE hat einen genialen Phishing-Filter in die Toolbar eingebaut! http://produkte.web.de/go/toolbar smime.p7s Description: S/MIME Cryptographic Signature ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
On Tuesday 30 Aug 2011 00:18:35 Matthew Toseland wrote: > On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote: > > After discussing and going deeper into this, it became apparent that the > > problem is not overload of queues. I’ll repeat the result I came to here > > (and > > not in chatlog form :) ): > > > > The problem is in the load limiter: > > > > > > 1) we need to use all bandwidth, because latency depends on the number of > > finishing transfers. > > > > 2) SSKs fail 85% of the time. > > > > 3) the load limiter assumes that all SSKs succeed and reserves bandwidth > > for > > them. > > > > 4) this now bites us, because NLM has shorter transfer times and longer > > wait > > times. So the over-reservation of bandwidth for SSKs lasts longer. > > > > 5) solution: count each SSK as only > > average_SSK_success_rate * data_to_transfer_on_success. > > > > > > To (1): When we have less bandwidth, than less CHKs succeed per second, so > > less slots get freed and the queueing time it bigger. > > Another way to look at this: > > Both me and Ian have been assuming that the resource is some number > representing a property of the node, for example, how many requests are > running. But the real resource is the external bandwidth usage, which is > rather different. > > An alternative solution - if we are worried about bursts of requests > succeeding and therefore timing out - is to allocate a fixed chunk of > bandwidth for SSKs, and not let them interfere with CHKs. Then CHKs can have > long transfer times, and therefore long queueing times when they fail (but > this is only ~ 40% of the time with NLM's accurate routing), and SSKs can > have short transfer times, and therefore short queueing times. This may be > less disruptive and should also be relatively easy to implement. > > But the other question is, can queueing ever be helpful? It can if it allows > us to route more accurately (which NLM clearly does), and/or to run enough > requests in parallel that the longer time taken for the request to reach its > destination is offset. Is this condition met? Yes it is! If a request takes P seconds without queueing, and P+Q seconds with queueing, then if we have the same number of requests running at once, we have a slowdown of (P+Q)/P. However, if we start (P+Q)/P times more requests, and the queueing time remains constant, we can compensate for this. Does it? There are two main components: 1) The queueing time. This is roughly proportional to (number of requests running) / (number of slots available). If we increase the capacity of the node, we multiply both the top and bottom by the same factor - so the queueing time is unchanged, unless the data transfer time is changed. 2) The data transfer time. In the above case with a slowdown, the data transfer time is the same, or a little shorter, because most of our bandwidth isn't being used. If we increase it by running more requests at once, it should reach where it was, and if we increase it further, transfers will start to take longer again. ArneBab has analysed this in more detail and shown that the largest component of the queueing time is proportional to the data transfer time, hence, separating SSKs (which have a very short transfer time) from CHKs (which have a much longer transfer time) would help significantly. The key advantage of NLM, of course, is that it routes accurately, resulting in 10%-30% improvements in success rates, and transferring data over significantly fewer hops (= more effective bandwidth). > > > > Best wishes, > > Arne > > > > > > > > Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland: > > > Okay, I don't understand most of that, I might be able to check the math > > > if > > > it was written properly, but it looks difficult. However, as far as I can > > > see: - The most obvious way to increase bandwidth usage would be to > > > increase the timeout time for output bandwidth liability (and at the same > > > time increase the relevant block transfer timeouts). - This would increase > > > the number of slots but it would also increase the number of requests > > > seeking them; I don't see why it would help matters. - Running an > > > excessive > > > number of requests without adjusting the block transfer timeouts would > > > result in some of the transfers timing out. - It would also, as you > > > mention, make SSK requests (especially failed SSK requests) even slower. - > > > I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. > > > - > > > As far as I can see, all the benefits of NLM re attackers are achieved by > > > fair sharing. - A system which is more resistant to attacks but slower > > > probably isn't all that interesting if the attacks in question are > > > relatively expensive anyway. - Smaller block sizes would have a > > > significant > > > efficiency cost, and would probably make load management more dif
Re: [freenet-dev] Beyond New Load Management: A proposal
On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote: > After discussing and going deeper into this, it became apparent that the > problem is not overload of queues. I’ll repeat the result I came to here (and > not in chatlog form :) ): > > The problem is in the load limiter: > > > 1) we need to use all bandwidth, because latency depends on the number of > finishing transfers. > > 2) SSKs fail 85% of the time. > > 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for > them. > > 4) this now bites us, because NLM has shorter transfer times and longer wait > times. So the over-reservation of bandwidth for SSKs lasts longer. > > 5) solution: count each SSK as only > average_SSK_success_rate * data_to_transfer_on_success. > > > To (1): When we have less bandwidth, than less CHKs succeed per second, so > less slots get freed and the queueing time it bigger. Another way to look at this: Both me and Ian have been assuming that the resource is some number representing a property of the node, for example, how many requests are running. But the real resource is the external bandwidth usage, which is rather different. An alternative solution - if we are worried about bursts of requests succeeding and therefore timing out - is to allocate a fixed chunk of bandwidth for SSKs, and not let them interfere with CHKs. Then CHKs can have long transfer times, and therefore long queueing times when they fail (but this is only ~ 40% of the time with NLM's accurate routing), and SSKs can have short transfer times, and therefore short queueing times. This may be less disruptive and should also be relatively easy to implement. But the other question is, can queueing ever be helpful? It can if it allows us to route more accurately (which NLM clearly does), and/or to run enough requests in parallel that the longer time taken for the request to reach its destination is offset. Is this condition met? > > Best wishes, > Arne > > > > Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland: > > Okay, I don't understand most of that, I might be able to check the math if > > it was written properly, but it looks difficult. However, as far as I can > > see: - The most obvious way to increase bandwidth usage would be to > > increase the timeout time for output bandwidth liability (and at the same > > time increase the relevant block transfer timeouts). - This would increase > > the number of slots but it would also increase the number of requests > > seeking them; I don't see why it would help matters. - Running an excessive > > number of requests without adjusting the block transfer timeouts would > > result in some of the transfers timing out. - It would also, as you > > mention, make SSK requests (especially failed SSK requests) even slower. - > > I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. - > > As far as I can see, all the benefits of NLM re attackers are achieved by > > fair sharing. - A system which is more resistant to attacks but slower > > probably isn't all that interesting if the attacks in question are > > relatively expensive anyway. - Smaller block sizes would have a significant > > efficiency cost, and would probably make load management more difficult. > > > > I apologise if you see this as rather a broadside after I encouraged you to > > analyse the problem, but it is not yet a convincing demonstration that > > queueing is a viable strategy and not the spawn of satan! :) > > On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote: > > > Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke: > > > > Yes, small tweaks have worked so well for us for the last decade, > > > > leaving us pretty-much where we were in 2003. No, we don't > > > > understand how the current system works, there is no point in > > > > trying to fix something when we don't even know what is broken. > > > > > > I’d like to present a clue what is broken in NLM. Before I kill you with > > > the> > > > log, here’s the result: > > > With NLM the latency of a request is a function of the raw bandwidth > > > (not so with OLM), and NLM used only half my bandwidth after it had > > > been deployed for 2 days (at the start much more). > > > > > > τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue > > > time (time to find the node), nlm: new load management, olm: old load > > > management. > > > > > > So first step: make sure all bandwidth gets used - maybe by allocating > > > more slots till we use all allowed bandwidth. Better having to throttle > > > a transfer than not using bandwidth. > > > > > > *NLM should with the current network be slower than OLM by 23%. But in > > > 18 > > > months it should actually be faster by ~8% — given Moores Law holds for > > > upload bandwidth — because the routes are shorter.* > > > > > > The main advantage of NLM is, that it should be much more resilient > > >
Re: [freenet-dev] Beyond New Load Management: A proposal
After discussing and going deeper into this, it became apparent that the problem is not overload of queues. I’ll repeat the result I came to here (and not in chatlog form :) ): The problem is in the load limiter: 1) we need to use all bandwidth, because latency depends on the number of finishing transfers. 2) SSKs fail 85% of the time. 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for them. 4) this now bites us, because NLM has shorter transfer times and longer wait times. So the over-reservation of bandwidth for SSKs lasts longer. 5) solution: count each SSK as only average_SSK_success_rate * data_to_transfer_on_success. To (1): When we have less bandwidth, than less CHKs succeed per second, so less slots get freed and the queueing time it bigger. Best wishes, Arne Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland: > Okay, I don't understand most of that, I might be able to check the math if > it was written properly, but it looks difficult. However, as far as I can > see: - The most obvious way to increase bandwidth usage would be to > increase the timeout time for output bandwidth liability (and at the same > time increase the relevant block transfer timeouts). - This would increase > the number of slots but it would also increase the number of requests > seeking them; I don't see why it would help matters. - Running an excessive > number of requests without adjusting the block transfer timeouts would > result in some of the transfers timing out. - It would also, as you > mention, make SSK requests (especially failed SSK requests) even slower. - > I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. - > As far as I can see, all the benefits of NLM re attackers are achieved by > fair sharing. - A system which is more resistant to attacks but slower > probably isn't all that interesting if the attacks in question are > relatively expensive anyway. - Smaller block sizes would have a significant > efficiency cost, and would probably make load management more difficult. > > I apologise if you see this as rather a broadside after I encouraged you to > analyse the problem, but it is not yet a convincing demonstration that > queueing is a viable strategy and not the spawn of satan! :) > On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote: > > Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke: > > > Yes, small tweaks have worked so well for us for the last decade, > > > leaving us pretty-much where we were in 2003. No, we don't > > > understand how the current system works, there is no point in > > > trying to fix something when we don't even know what is broken. > > > > I’d like to present a clue what is broken in NLM. Before I kill you with > > the> > > log, here’s the result: > > With NLM the latency of a request is a function of the raw bandwidth > > (not so with OLM), and NLM used only half my bandwidth after it had > > been deployed for 2 days (at the start much more). > > > > τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue > > time (time to find the node), nlm: new load management, olm: old load > > management. > > > > So first step: make sure all bandwidth gets used - maybe by allocating > > more slots till we use all allowed bandwidth. Better having to throttle > > a transfer than not using bandwidth. > > > > *NLM should with the current network be slower than OLM by 23%. But in > > 18 > > months it should actually be faster by ~8% — given Moores Law holds for > > upload bandwidth — because the routes are shorter.* > > > > The main advantage of NLM is, that it should be much more resilient > > against attackers (DoS). > > > > Now to the log - it’s math and not cleaned up; you have been warned :) > > > > SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf. > > queue-time: q, transfer-time: τ, hops remaining: h, total > > hops: h₀, w: success rate > > ψs = τ(h) + q(h) > > ψf = q(h) > > ψ ~ w₁·ψs + (1-w₁)·ψf > > σs = τ(h) + q(h) > > σf = q(h) > > σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15% > > num(ψ) / num(σ) ~ 1 > > → time ~ σ + ψ > > q(h) depends on timeouts, as do w₁ and w₂ > > time = w₁·ψs + (1-w₁)·ψf + w₂·ψs + (1-w₂)·ψf > > = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1- > > w₂)·q(h) > > = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) > > = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) > > in the congestion case q(h) ~ timeout > > timeout = o > > timeout: o > > w depends on the timeout *somehow*, but inversely > > o=0 → w=0 > > assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100% > > assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1 > > correction: in the congestion case: q(h) ~ min(timeout, τ(h)) > > timeout matters for q(h) only when timeout < τ(h) > > I try to: I still need a dependency of w on timeout > > … lets call it t(w) > > better: w(o) :) > > well, if there is a timeout, we have a fixed time, but we reduce > > the hops ... > > i thought w was success rate
Re: [freenet-dev] Beyond New Load Management: A proposal
On Monday 29 Aug 2011 20:32:13 Ian Clarke wrote: > On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland > wrote: > > > On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: > > > Ok, thinking about it, here is a proposal, or rather, the beginning of a > > > proposal. I'm assuming that we get rid of NLM, fair sharing, and > > anything > > > else intended to control load, and replace it with this. We will > > absolutely > > > need to simulate this before we write a single line of code to deploy it. > > > > I have justified fair sharing in my other reply. However alternate > > mechanisms might replace it. > > > > I agree we should simulate, although this might be significant work. > > > > I doubt it would be more work than trying it without simulation and failing > for a decade, which was our previous approach. > > > > > The core idea is that a node will include a floating point number in > > > response to any kind of request showing how close that node is to being > > > overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is > > > completely saturated and must reject requests. Clearly the goal is to > > avoid > > > getting anywhere near to 1.0. > > > > > > A node tracks several things: > > > > All of these apply to both locally originated requests and remotely > > originated requests? And are they for only our direct peers? > > > > Every peer along the reply path will take note of the load value, so it will > be a mixture of the loads of close-by peers, and more distant peers. > Obviously, since we are more likely to receive messages from peers that are > "closer", our average will automatically be biased towards closer peers. So it is from the data source? Or every node along the way? > > >- What is the average load reported by responses this node forwarded, > > per > > >remote node > > > > Ahhh, this one could be interesting - you could use it to penalise nodes > > which spam excessively. > > > Exactly. > > > > I think, given these metrics, we should be able to do the following: > > > > > >- Limit our overall rate of initiating local requests based on the > > global > > >average reported load > > > > We can already do this on the basis of AIMD's, based purely on behaviour of > > local requests. However, taking more data into account might result in more > > accurate results - although it might also dilute the feedback. > > Right, AIMD tends to control the rate at which a specific event occurs, like > a dropped IP packet with TCP, or a dropped request as we currently use it. > This is a bit like signaling that you have a headache by banging your head > against the wall. It is also a rather coarse way to deliver feedback. Fair point. > > A floating point load number conveys more information in a non-destructive > way. > > In any case, this part is "safe": it's not going to cause feedback problems, > > on its own. Conversely, it doesn't deal with problems such as network > > bottlenecks. > > Well, it depends. If we are tracking load on a per-outbound-connection > basis then we can potentially take action to manage load per-connection in > addition to overall. Not sure I follow here. Routing by load is bad, unless it is severely limited. > > > However, while AIMD's are based on the *global* average load (or rather the > > load of all the nodes along the chain), it sounds like you are proposing to > > only report the *local* load of our direct peers? How are you going to > > compute the global average load? > > No, its not the local load of our direct peers. Its the load reported in > all reply messages that we forward, some of those will be from our direct > peers, but most will be from remote peers. > > > >- Limit our rate of local requests based on the average load of the > > >connection to the peer it would need to be forwarded to > > > > Not sure I follow here. Are you proposing that we initiate more requests in > > keyspace areas where there is lower load? > > Well, kinda, except the opposite. We might reduce requests in keyspaces > where there is higher load. > > > > >- Detect when remote peers are abusing the system by disregarding load > > - > > >as evidenced by a significantly higher average load in replies > > forwarded to > > >them > > > > Okay, that is useful - although I doubt it would be an unambiguous > > detection, and what would we do about it? We would have to reject requests, > > and if we're not sure, or if it could be a downstream troublemaker, couldn't > > that result in misrouting and make things worse? On the other hand, any > > malicious node would be most easily detected by the nodes directly connected > > to it, so this could work much the same way as fair sharing - provided that > > it can reject or terminate the requests. > > No, we probably wouldn't start rejecting requests - more likely we would > disconnect from that node. Obviously we'd need to be fairly sure of abusive > behavior before doing that -
Re: [freenet-dev] Beyond New Load Management: A proposal
Okay, I don't understand most of that, I might be able to check the math if it was written properly, but it looks difficult. However, as far as I can see: - The most obvious way to increase bandwidth usage would be to increase the timeout time for output bandwidth liability (and at the same time increase the relevant block transfer timeouts). - This would increase the number of slots but it would also increase the number of requests seeking them; I don't see why it would help matters. - Running an excessive number of requests without adjusting the block transfer timeouts would result in some of the transfers timing out. - It would also, as you mention, make SSK requests (especially failed SSK requests) even slower. - I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. - As far as I can see, all the benefits of NLM re attackers are achieved by fair sharing. - A system which is more resistant to attacks but slower probably isn't all that interesting if the attacks in question are relatively expensive anyway. - Smaller block sizes would have a significant efficiency cost, and would probably make load management more difficult. I apologise if you see this as rather a broadside after I encouraged you to analyse the problem, but it is not yet a convincing demonstration that queueing is a viable strategy and not the spawn of satan! :) On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote: > Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke: > > Yes, small tweaks have worked so well for us for the last decade, leaving us > > pretty-much where we were in 2003. No, we don't understand how the current > > system works, there is no point in trying to fix something when we don't > > even know what is broken. > > I’d like to present a clue what is broken in NLM. Before I kill you with the > log, here’s the result: > > With NLM the latency of a request is a function of the raw bandwidth > (not so with OLM), and NLM used only half my bandwidth after it had > been deployed for 2 days (at the start much more). > > τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time > (time to find the node), nlm: new load management, olm: old load management. > > So first step: make sure all bandwidth gets used - maybe by allocating more > slots till we use all allowed bandwidth. Better having to throttle a transfer > than not using bandwidth. > > *NLM should with the current network be slower than OLM by 23%. But in 18 > months it should actually be faster by ~8% — given Moores Law holds for > upload > bandwidth — because the routes are shorter.* > > The main advantage of NLM is, that it should be much more resilient against > attackers (DoS). > > Now to the log - it’s math and not cleaned up; you have been warned :) > > SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf. > queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀, > w: success rate > ψs = τ(h) + q(h) > ψf = q(h) > ψ ~ w₁·ψs + (1-w₁)·ψf > σs = τ(h) + q(h) > σf = q(h) > σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15% > num(ψ) / num(σ) ~ 1 > → time ~ σ + ψ > q(h) depends on timeouts, as do w₁ and w₂ > time = w₁·ψs + (1-w₁)·ψf + w₂·ψs + (1-w₂)·ψf > = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1- > w₂)·q(h) > = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) > = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) > in the congestion case q(h) ~ timeout > timeout = o > timeout: o > w depends on the timeout *somehow*, but inversely > o=0 → w=0 > assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100% > assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1 > correction: in the congestion case: q(h) ~ min(timeout, τ(h)) > timeout matters for q(h) only when timeout < τ(h) > I try to: I still need a dependency of w on timeout > … lets call it t(w) > better: w(o) :) > well, if there is a timeout, we have a fixed time, but we reduce the > hops ... > i thought w was success rate > ah! > and the success rates where in the NLM stats > going mostly smoothly from 60% to 0% > for the HTL > right, success rate peaks at 18 or sometimes 16 > what are w1 vs w2? > chk vs ssk i guess > yes > -*- toad_ thinks considering both is probably overambitious at this stage? > should not be too bad: SSKs drop much more rapidly at decreasing > hops > hops→HTL > ψs is time for a successful chk; ψf is time for a failed chk ... in > which case h in the first instance is low, and in the second instance is h0 > yes > okay, i don't follow this line: time = w₁·ψs + (1-w₁)·ψf + w₂·ψs + > (1-w₂)·ψf > i thought w2 related to SSKs? > uh, yes… > time = w₁·ψs + (1-w₁)·ψf + w₂·σs + (1-w₂)·σf > you have to appreciate i'm only just getting back into maths and > physics after 12 years ... > (retaking a-levels to get a degree) > no probs, I’m also no expert in this. I try to get a relation > between the time and the timeout, so we can try to find a minimum > in any case, there are two different h's for the two
Re: [freenet-dev] Beyond New Load Management: A proposal
Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke: > Yes, small tweaks have worked so well for us for the last decade, leaving us > pretty-much where we were in 2003. No, we don't understand how the current > system works, there is no point in trying to fix something when we don't > even know what is broken. I’d like to present a clue what is broken in NLM. Before I kill you with the log, here’s the result: With NLM the latency of a request is a function of the raw bandwidth (not so with OLM), and NLM used only half my bandwidth after it had been deployed for 2 days (at the start much more). τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time (time to find the node), nlm: new load management, olm: old load management. So first step: make sure all bandwidth gets used - maybe by allocating more slots till we use all allowed bandwidth. Better having to throttle a transfer than not using bandwidth. *NLM should with the current network be slower than OLM by 23%. But in 18 months it should actually be faster by ~8% — given Moores Law holds for upload bandwidth — because the routes are shorter.* The main advantage of NLM is, that it should be much more resilient against attackers (DoS). Now to the log - it’s math and not cleaned up; you have been warned :) SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf. queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀, w: success rate ψs = τ(h) + q(h) ψf = q(h) ψ ~ w₁·ψs + (1-w₁)·ψf σs = τ(h) + q(h) σf = q(h) σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15% num(ψ) / num(σ) ~ 1 → time ~ σ + ψ q(h) depends on timeouts, as do w₁ and w₂ time = w₁·ψs + (1-w₁)·ψf + w₂·ψs + (1-w₂)·ψf = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1- w₂)·q(h) = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) in the congestion case q(h) ~ timeout timeout = o timeout: o w depends on the timeout *somehow*, but inversely o=0 → w=0 assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100% assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1 correction: in the congestion case: q(h) ~ min(timeout, τ(h)) timeout matters for q(h) only when timeout < τ(h) I try to: I still need a dependency of w on timeout … lets call it t(w) better: w(o) :) well, if there is a timeout, we have a fixed time, but we reduce the hops ... i thought w was success rate ah! and the success rates where in the NLM stats going mostly smoothly from 60% to 0% for the HTL right, success rate peaks at 18 or sometimes 16 what are w1 vs w2? chk vs ssk i guess yes -*- toad_ thinks considering both is probably overambitious at this stage? should not be too bad: SSKs drop much more rapidly at decreasing hops hops→HTL ψs is time for a successful chk; ψf is time for a failed chk ... in which case h in the first instance is low, and in the second instance is h0 yes okay, i don't follow this line: time = w₁·ψs + (1-w₁)·ψf + w₂·ψs + (1-w₂)·ψf i thought w2 related to SSKs? uh, yes… time = w₁·ψs + (1-w₁)·ψf + w₂·σs + (1-w₂)·σf you have to appreciate i'm only just getting back into maths and physics after 12 years ... (retaking a-levels to get a degree) no probs, I’m also no expert in this. I try to get a relation between the time and the timeout, so we can try to find a minimum in any case, there are two different h's for the two uses of q(h) - h0 and h_avg h_avg for success and h0 for failure hm, yes which makes this harder… it’s wrong anyway… the q(h_avg) was missing h_avg is somewhere between 5 and 10 imho at least it is if everything is working well and the input load isn't all really popular stuff (in which case it's answered quickly and can be ignored) = τ(h) · (w₁+w₂) + q(h) · (2-w₁-w₂) + q(h_avg) · (w₁+w₂) would have been too easy :) okay so here q(h) means q(h0) i.e. h = h0, max hops? jepp, and max hops sinks with falling timeout hmm? the max actual hops on the upside, q() is linear <-- Torgal (~Torgal@78.251.49.8) hat das Netzwerk verlassen (Ping timeout: 276 seconds) hopefully yes: q(h) = h·o (in the congestion case) the problem i have is it looks like q(1) ~= time [ for a full request ], unless load is very low so τ « q? τ much smaller than q? of course it's bounded by timeouts, but i'd expect a runaway feedback loop until it reaches heavy timeouts and effectively cuts the htl well, with OLM, success time for a CHK is 1m25s, unsuccessful is 19sec, so transfer time is at least 1 minute and less than 1m25; but with NLM, unsuccessful is 3 min+ well, for SSKs in OLM the first 6 hops are the most successful, later ones only contribute 1% success, which piles up to ~ 12% okay... (only possible because 85% are unsuccessful) (otherwise this would be wrong: the contribution of later ones would be smaller) from the numbers q(h₀) ~ τ(h_avg) well, queueing time on any hop is the time it takes to get a slot to route to, which is roughly equal to the time it takes for a request to complete divided b
Re: [freenet-dev] Beyond New Load Management: A proposal
On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland wrote: > > >- What is the average load reported by responses this node > forwarded, per > > >remote node > > > > Ahhh, this one could be interesting - you could use it to penalise nodes > which spam excessively. > > Actually, thinking about it, I don't think this will work. Requests are > usually distributed evenly across the keyspace, and downstream nodes only > know they are from this node - not which previous node they are from. So the > pain will be shared across all the peers sending us requests, including our > own local requests, and won't be identifiable as due to any single node. You > have to do it by volume, not by reported load. > But if a node is being abusive won't the nodes its requesting from tend to have a higher than average load? Perhaps you are right though, but this is the kind of thought-process we need. Ian. -- Ian Clarke Founder, The Freenet Project Email: i...@freenetproject.org ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland wrote: > On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: > > Ok, thinking about it, here is a proposal, or rather, the beginning of a > > proposal. I'm assuming that we get rid of NLM, fair sharing, and > anything > > else intended to control load, and replace it with this. We will > absolutely > > need to simulate this before we write a single line of code to deploy it. > > I have justified fair sharing in my other reply. However alternate > mechanisms might replace it. > > I agree we should simulate, although this might be significant work. > I doubt it would be more work than trying it without simulation and failing for a decade, which was our previous approach. > > The core idea is that a node will include a floating point number in > > response to any kind of request showing how close that node is to being > > overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is > > completely saturated and must reject requests. Clearly the goal is to > avoid > > getting anywhere near to 1.0. > > > > A node tracks several things: > > All of these apply to both locally originated requests and remotely > originated requests? And are they for only our direct peers? > Every peer along the reply path will take note of the load value, so it will be a mixture of the loads of close-by peers, and more distant peers. Obviously, since we are more likely to receive messages from peers that are "closer", our average will automatically be biased towards closer peers. >- What is the average load reported by responses this node forwarded, > per > >remote node > > Ahhh, this one could be interesting - you could use it to penalise nodes > which spam excessively. > Exactly. > > I think, given these metrics, we should be able to do the following: > > > >- Limit our overall rate of initiating local requests based on the > global > >average reported load > > We can already do this on the basis of AIMD's, based purely on behaviour of > local requests. However, taking more data into account might result in more > accurate results - although it might also dilute the feedback. > Right, AIMD tends to control the rate at which a specific event occurs, like a dropped IP packet with TCP, or a dropped request as we currently use it. This is a bit like signaling that you have a headache by banging your head against the wall. It is also a rather coarse way to deliver feedback. A floating point load number conveys more information in a non-destructive way. In any case, this part is "safe": it's not going to cause feedback problems, > on its own. Conversely, it doesn't deal with problems such as network > bottlenecks. > Well, it depends. If we are tracking load on a per-outbound-connection basis then we can potentially take action to manage load per-connection in addition to overall. > However, while AIMD's are based on the *global* average load (or rather the > load of all the nodes along the chain), it sounds like you are proposing to > only report the *local* load of our direct peers? How are you going to > compute the global average load? > No, its not the local load of our direct peers. Its the load reported in all reply messages that we forward, some of those will be from our direct peers, but most will be from remote peers. > >- Limit our rate of local requests based on the average load of the > >connection to the peer it would need to be forwarded to > > Not sure I follow here. Are you proposing that we initiate more requests in > keyspace areas where there is lower load? > Well, kinda, except the opposite. We might reduce requests in keyspaces where there is higher load. > >- Detect when remote peers are abusing the system by disregarding load > - > >as evidenced by a significantly higher average load in replies > forwarded to > >them > > Okay, that is useful - although I doubt it would be an unambiguous > detection, and what would we do about it? We would have to reject requests, > and if we're not sure, or if it could be a downstream troublemaker, couldn't > that result in misrouting and make things worse? On the other hand, any > malicious node would be most easily detected by the nodes directly connected > to it, so this could work much the same way as fair sharing - provided that > it can reject or terminate the requests. > No, we probably wouldn't start rejecting requests - more likely we would disconnect from that node. Obviously we'd need to be fairly sure of abusive behavior before doing that - but abuse should be rare. > Arguably fair sharing achieves the same objective. It could be extended in > various ways e.g. if we are relaying a lot of slow-down notices to a peer we > could start to think about squeezing the peer's allocation. > Yes, except it doesn't work, and its too complicated to reason about. > > Of course, lots of questions: > > > >- How do we compute the averages? A decayi
Re: [freenet-dev] Beyond New Load Management: A proposal
On Monday 29 Aug 2011 20:09:58 Matthew Toseland wrote: > On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: > > Ok, thinking about it, here is a proposal, or rather, the beginning of a > > proposal. I'm assuming that we get rid of NLM, fair sharing, and anything > > else intended to control load, and replace it with this. We will absolutely > > need to simulate this before we write a single line of code to deploy it. > > I have justified fair sharing in my other reply. However alternate mechanisms > might replace it. > > I agree we should simulate, although this might be significant work. > > > > The core idea is that a node will include a floating point number in > > response to any kind of request showing how close that node is to being > > overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is > > completely saturated and must reject requests. Clearly the goal is to avoid > > getting anywhere near to 1.0. > > > > A node tracks several things: > > All of these apply to both locally originated requests and remotely > originated requests? And are they for only our direct peers? > > > >- What is the overall average load reported by responses this node has > >received > >- What is the average load reported by responses this node has received, > >per remote node > >- What is the average load reported by responses this node forwarded, per > >remote node > > Ahhh, this one could be interesting - you could use it to penalise nodes > which spam excessively. Actually, thinking about it, I don't think this will work. Requests are usually distributed evenly across the keyspace, and downstream nodes only know they are from this node - not which previous node they are from. So the pain will be shared across all the peers sending us requests, including our own local requests, and won't be identifiable as due to any single node. You have to do it by volume, not by reported load. signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: > Ok, thinking about it, here is a proposal, or rather, the beginning of a > proposal. I'm assuming that we get rid of NLM, fair sharing, and anything > else intended to control load, and replace it with this. We will absolutely > need to simulate this before we write a single line of code to deploy it. I have justified fair sharing in my other reply. However alternate mechanisms might replace it. I agree we should simulate, although this might be significant work. > > The core idea is that a node will include a floating point number in > response to any kind of request showing how close that node is to being > overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is > completely saturated and must reject requests. Clearly the goal is to avoid > getting anywhere near to 1.0. > > A node tracks several things: All of these apply to both locally originated requests and remotely originated requests? And are they for only our direct peers? > >- What is the overall average load reported by responses this node has >received >- What is the average load reported by responses this node has received, >per remote node >- What is the average load reported by responses this node forwarded, per >remote node Ahhh, this one could be interesting - you could use it to penalise nodes which spam excessively. > > I think, given these metrics, we should be able to do the following: > >- Limit our overall rate of initiating local requests based on the global >average reported load We can already do this on the basis of AIMD's, based purely on behaviour of local requests. However, taking more data into account might result in more accurate results - although it might also dilute the feedback. In any case, this part is "safe": it's not going to cause feedback problems, on its own. Conversely, it doesn't deal with problems such as network bottlenecks. However, while AIMD's are based on the *global* average load (or rather the load of all the nodes along the chain), it sounds like you are proposing to only report the *local* load of our direct peers? How are you going to compute the global average load? >- Limit our rate of local requests based on the average load of the >connection to the peer it would need to be forwarded to Not sure I follow here. Are you proposing that we initiate more requests in keyspace areas where there is lower load? >- Detect when remote peers are abusing the system by disregarding load - >as evidenced by a significantly higher average load in replies forwarded to >them Okay, that is useful - although I doubt it would be an unambiguous detection, and what would we do about it? We would have to reject requests, and if we're not sure, or if it could be a downstream troublemaker, couldn't that result in misrouting and make things worse? On the other hand, any malicious node would be most easily detected by the nodes directly connected to it, so this could work much the same way as fair sharing - provided that it can reject or terminate the requests. Arguably fair sharing achieves the same objective. It could be extended in various ways e.g. if we are relaying a lot of slow-down notices to a peer we could start to think about squeezing the peer's allocation. > > Of course, lots of questions: > >- How do we compute the averages? A decaying running average of some >kind? What time period? Yay more tunables. :( >- How do we translate load into a desired rate of requests? And even more tunables. :( >- What criteria indicate that a peer is abusing the system? What is the >remedy? In the current system, don't accept their requests. In ANY system, there will always be a borderline below which they can get away with it. We can either try to make this as small as possible or try not to hit legitimate traffic too hard. > > This is basically control theory stuff, and we'll need robust answers to > these questions before we proceed (ideally with a theoretical rather than > experimental foundation). IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much the same without having to wait years for our theoreticians to wake up. signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl