[freenet-dev] Beyond New Load Management: A proposal
we do about it? We would have to reject requests, > > and if we're not sure, or if it could be a downstream troublemaker, couldn't > > that result in misrouting and make things worse? On the other hand, any > > malicious node would be most easily detected by the nodes directly connected > > to it, so this could work much the same way as fair sharing - provided that > > it can reject or terminate the requests. > > No, we probably wouldn't start rejecting requests - more likely we would > disconnect from that node. Obviously we'd need to be fairly sure of abusive > behavior before doing that - but abuse should be rare. Well then we'd need to be very sure, and this means you could get away with a lot before triggering it. Wierd topologies for instance could cause this sort of problem on a temporary basis. > > > Arguably fair sharing achieves the same objective. It could be extended in > > various ways e.g. if we are relaying a lot of slow-down notices to a peer we > > could start to think about squeezing the peer's allocation. > > Yes, except it doesn't work, and its too complicated to reason about. What makes you think it doesn't work? What makes you think it's complicated either? > > > > Of course, lots of questions: > > > > > >- How do we compute the averages? A decaying running average of some > > >kind? What time period? > > > > Yay more tunables. :( > > Ideally we'd do this in a way that avoids tunables. This is control theory, > we shouldn't have to reinvent the wheel. Control theory doesn't apply because we are not dealing with feedback going to a single system, it's thousands of nodes, which all have separate counters. > > > >- How do we translate load into a desired rate of requests? > > > > And even more tunables. :( > > Don't be pessimistic, hopefully we can avoid tunables if we are smart about > it. > > > >- What criteria indicate that a peer is abusing the system? What is > > the > > >remedy? > > > > In the current system, don't accept their requests. > > Yes, although I'm leaning towards disconnecting from them, which is > more-or-less the same thing, just more permanent and they don't consume our > resources any more. Only if we can be really sure. And I'm not sure how realistic that is. > > > In ANY system, there will always be a borderline below which they can get > > away with it. We can either try to make this as small as possible or try not > > to hit legitimate traffic too hard. > > Abuse should be rare, we should bias against false positives when we look > for it. Which means that an abuser can stay below the threshold for a long time on a lot of nodes ... IMHO telling the difference could be difficult. > > > > This is basically control theory stuff, and we'll need robust answers to > > > these questions before we proceed (ideally with a theoretical rather than > > > experimental foundation). > > > > IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much > > the same without having to wait years for our theoreticians to wake up. > > Yes, small tweaks have worked so well for us for the last decade, leaving us > pretty-much where we were in 2003. No, we don't understand how the current > system works, there is no point in trying to fix something when we don't > even know what is broken. I think we have a much better understanding of it than we have had until recently: All forms of load management tried so far transmute incoming load into something bad - misrouting, queueing delays, reduced effective HTL. We need to reduce the incoming load, WITHOUT these things. The simplest implementation of this concept is to send non-local RejectedOverload's when we are *close* to the limits, not only local ones when we are *at* the limits (which are then relayed as non-local); in other words, to send a slow down message *before* it's too late. > > I am touched by your faith in "theoreticians", but I have a feeling we're in > uncharted territory here. We need to agree on an approach that is simple > enough to reason about, and then build a simulation for it. I think the > simulation should just be a few days of work, we're not talking years here, > hopefully not even months. Weeks probably - and simulations themselves tend to have tunables. Anyway if I start writing simulations people get mad because I'm not fixing bugs. -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/975bb061/attachment.pgp>
[freenet-dev] Beyond New Load Management: A proposal
and 1.5 hops for NLM ? ?_nlm > / > ?_olm ~ 0.71 > ArneBab: okay, that's plausible > ArneBab: however, it should be possible with smart load limiting on > the originator to achieve NLM-level success rates > but not the resilience > it still keeps freenet open to a DoS, NLM should help there. > now back to the queueing: OLM had: ?q?(SSK) ~ 16s, ?q?(CHK) ~ 18s, > ?(CHK) ~ 45s (my stats) > possibly - fair sharing limits our vulnerability to a DoS, possibly > enough as long as we don't have to worry about incentives issues > that?s about: q = ? ? ? (OLM) > NLM: q ~ ? > NLM: q ~ ? (NLM) > time: 2?q + ? > OLM: time ~ 5/3 ?_olm > NLM: time = 3 ? 0.72 ?_olm = 2.15 ?_olm > toad_: Alright, it's alive. https://github.com/freenet/fred- > staging/pull/55 > ? time_nlm / time_olm ~ 2.15 / (5/3) ~ 1.3 > so the time to transfer should be a bit longer > (not yet finished: this is the current state) > now, if we decrease the timeout time, the chance that a given > timeout happens in the first 4 hops should be about 4/20 = 0.2 > ?cut that? > if we decrease the timeout time below the transfer time per hop, > there should be more misrouting ? ? goes up, q might go down or up ? cut that. > transfer time per hop in OLM ~ 45s / hops_olm = 45s/2.09 = 21.5s > ?actually, the time in NLM is so dependant on transfer time, that > the most efficient stratigy would likely be to decrease the block size? > or to get a faster network > toad_: got it, damnit: NLM is so much slower than OLM, because it > used less bandwidth! > the time is a function of the raw bandwidth (not so with OLM), and > NLM used only half my bandwidth after it had been deployed for 2 days (at the > start much more) > when we double the bandwidth (1.8 years?), NLM should be faster > than > OLM > operhiem1: cool! > toad_: actually I think the slot number calculation is flawed ? > less > bandwith used than possible > that?s why it did not break down, but slowed down to 1/5 OLM. From > the math here I?d have guessed 1/2.6 > but adding SSKs with many more hops and time almost pure queue time > it fits > q_nlm ~ 3??q?_olm; in the full bandwidth case > but with half bandwidth we actually are at 6?q_olm > ? more slots should actually make it much better > toad_: summary: ? ~ bandwidth. q_olm ~ 16s, q_nlm ~ ?! ? using only > 50% of bandwidth (too little slots) massively slows down NLM. > the transfer times should actually be dominant > though they are lower than the queue time. > and freenet should get faster with faster network or lower chunk > sizes. > toad_: so first step: make sure all bandwidth gets used - maybe by > allocating more slots till about 2? the current number are transferring > -*- ArneBab is happy > cool, lot's of stuff to read tomorrow morning. :) > NLM should with the current network be slower than OLM by 23%. But > in 18 month it should actually be faster by ~8%, given Moores Law holds for > upload bandwidth. > :) > with faster I mean time to complete a request. > reaction time ? latency > digger3: maybe you can doublecheck the reasoning -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/7d766e77/attachment.pgp>
[freenet-dev] Beyond New Load Management: A proposal
freenet should get faster with faster network or lower chunk sizes. toad_: so first step: make sure all bandwidth gets used - maybe by allocating more slots till about 2? the current number are transferring -*- ArneBab is happy cool, lot's of stuff to read tomorrow morning. :) NLM should with the current network be slower than OLM by 23%. But in 18 month it should actually be faster by ~8%, given Moores Law holds for upload bandwidth. :) with faster I mean time to complete a request. reaction time ? latency digger3: maybe you can doublecheck the reasoning -- Konstruktive Kritik: - http://draketo.de/licht/krude-ideen/konstruktive-kritik -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 316 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/37cfcada/attachment.pgp>
[freenet-dev] Beyond New Load Management: A proposal
On Monday 29 Aug 2011 20:09:58 Matthew Toseland wrote: > On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: > > Ok, thinking about it, here is a proposal, or rather, the beginning of a > > proposal. I'm assuming that we get rid of NLM, fair sharing, and anything > > else intended to control load, and replace it with this. We will absolutely > > need to simulate this before we write a single line of code to deploy it. > > I have justified fair sharing in my other reply. However alternate mechanisms > might replace it. > > I agree we should simulate, although this might be significant work. > > > > The core idea is that a node will include a floating point number in > > response to any kind of request showing how close that node is to being > > overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is > > completely saturated and must reject requests. Clearly the goal is to avoid > > getting anywhere near to 1.0. > > > > A node tracks several things: > > All of these apply to both locally originated requests and remotely > originated requests? And are they for only our direct peers? > > > >- What is the overall average load reported by responses this node has > >received > >- What is the average load reported by responses this node has received, > >per remote node > >- What is the average load reported by responses this node forwarded, per > >remote node > > Ahhh, this one could be interesting - you could use it to penalise nodes > which spam excessively. Actually, thinking about it, I don't think this will work. Requests are usually distributed evenly across the keyspace, and downstream nodes only know they are from this node - not which previous node they are from. So the pain will be shared across all the peers sending us requests, including our own local requests, and won't be identifiable as due to any single node. You have to do it by volume, not by reported load. -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/0761719f/attachment.pgp>
[freenet-dev] Beyond New Load Management: A proposal
On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: > Ok, thinking about it, here is a proposal, or rather, the beginning of a > proposal. I'm assuming that we get rid of NLM, fair sharing, and anything > else intended to control load, and replace it with this. We will absolutely > need to simulate this before we write a single line of code to deploy it. I have justified fair sharing in my other reply. However alternate mechanisms might replace it. I agree we should simulate, although this might be significant work. > > The core idea is that a node will include a floating point number in > response to any kind of request showing how close that node is to being > overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is > completely saturated and must reject requests. Clearly the goal is to avoid > getting anywhere near to 1.0. > > A node tracks several things: All of these apply to both locally originated requests and remotely originated requests? And are they for only our direct peers? > >- What is the overall average load reported by responses this node has >received >- What is the average load reported by responses this node has received, >per remote node >- What is the average load reported by responses this node forwarded, per >remote node Ahhh, this one could be interesting - you could use it to penalise nodes which spam excessively. > > I think, given these metrics, we should be able to do the following: > >- Limit our overall rate of initiating local requests based on the global >average reported load We can already do this on the basis of AIMD's, based purely on behaviour of local requests. However, taking more data into account might result in more accurate results - although it might also dilute the feedback. In any case, this part is "safe": it's not going to cause feedback problems, on its own. Conversely, it doesn't deal with problems such as network bottlenecks. However, while AIMD's are based on the *global* average load (or rather the load of all the nodes along the chain), it sounds like you are proposing to only report the *local* load of our direct peers? How are you going to compute the global average load? >- Limit our rate of local requests based on the average load of the >connection to the peer it would need to be forwarded to Not sure I follow here. Are you proposing that we initiate more requests in keyspace areas where there is lower load? >- Detect when remote peers are abusing the system by disregarding load - >as evidenced by a significantly higher average load in replies forwarded to >them Okay, that is useful - although I doubt it would be an unambiguous detection, and what would we do about it? We would have to reject requests, and if we're not sure, or if it could be a downstream troublemaker, couldn't that result in misrouting and make things worse? On the other hand, any malicious node would be most easily detected by the nodes directly connected to it, so this could work much the same way as fair sharing - provided that it can reject or terminate the requests. Arguably fair sharing achieves the same objective. It could be extended in various ways e.g. if we are relaying a lot of slow-down notices to a peer we could start to think about squeezing the peer's allocation. > > Of course, lots of questions: > >- How do we compute the averages? A decaying running average of some >kind? What time period? Yay more tunables. :( >- How do we translate load into a desired rate of requests? And even more tunables. :( >- What criteria indicate that a peer is abusing the system? What is the >remedy? In the current system, don't accept their requests. In ANY system, there will always be a borderline below which they can get away with it. We can either try to make this as small as possible or try not to hit legitimate traffic too hard. > > This is basically control theory stuff, and we'll need robust answers to > these questions before we proceed (ideally with a theoretical rather than > experimental foundation). IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much the same without having to wait years for our theoreticians to wake up. -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/c06395ac/attachment.pgp>
[freenet-dev] Beyond New Load Management
ook in, rather than be rejected at the same rate as everyone else. > > > - The easiest way to implement #1 is with AIMD's. We can keep them, but > > send slow-down messages earlier. > > - We could keep NLM, i.e. queue for limited times to smooth out delays. > > Doesn't that violate the principal that the load balancing stuff shouldn't > do anything that makes the problem worse? Queueing does exactly that. So what do you do if the incoming requests are not at the same time as the completed outgoing requests? I guess you keep a large number of request slots free at any given time? > > > Currently we RNF after a (longish) period waiting for a node to route to. > > We would send a slow-down message when we have queued for more than a > > certain period. This means we can have a fixed, limited amount of > > misrouting, we can keep NLM's clear benefits for routing accuracy (as > > visible in success rates), and we can ensure that the input load is low > > enough to avoid severe slowdown due to queueing for a long time. > > No, I think we need to avoid queueing except in case of emergency. Queueing > only makes things worse by tying up more resources for longer. I have the same attitude to misrouting. -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/0394eae5/attachment.pgp>
[freenet-dev] Beyond New Load Management
*local* requests are initiated, based on > how many slow-down signals they get. > > (This ensures that such gridlock or feedback loop problems can't happen) > > The main problems are security: > - Deliberate flooding: Malicious nodes not following AIMD's, in order to clog > up the network. > > Fair sharing can limit this. Limiting backoff so that we only misroute when > nodes are severely overloaded can also help. We could queue (provided this > happens only rarely and briefly), allow bounded misrouting (e.g. route to one > of the top two routes), or terminate requests (thus freeing up resources). > > - Opportunistic flooding: Users hack their nodes to not follow AIMD's because > it speeds up their downloads. > > This is harder than solving deliberate flooding but also less urgent. Fair > sharing helps, but the slow-down signals, or rejections/terminations, may be > generated a long way away from the originator. In which case, they are only > proportionately affected, meaning it's still beneficial for them to flood. It > may be that this isn't resolvable and we'll have to continue to rely on a > certain amount of altruism; ask people to not use the Fr33n3t L33t L34ch > Cl13nt cos it slows things down for everyone else, should this arise. > > - Local privacy: We can maybe identify which requests are local based on > timings. > > This is somewhat speculative; while it is probably exploitable, there are > much easier attacks if you are directly connected to the target. > > - Remote privacy: We can maybe correlate different bundles of requests based > on timings. > > This is speculative, but it could be solved by using separate AIMD's for > separate unconnected bundles of requests (we will need some sort of > tagging/grouping mechanism like this when we have tunnels anyway). > > 2. Nodes control the rate at which remote requests are initiated too, > propagating information on the capacity of the network. For instance, they > could estimate the capacity of the network and generate slow-down signals > when incoming requests are over that capacity (and eventually terminate > requests); this is self-feeding because their estimate of capacity depends on > *incoming* slow-down messages! > > The main problem with this is that feedback between nodes could well result > in the estimated capacity collapsing to zero or whatever the minimum is. It > is probably possible to surmount that but would certainly require a great > deal of theoretical (simulation) work. > > We can reuse existing code to a significant degree: > - RejectedOverload should be split up. It already represents a slow-down > message (provided local=false). It should also indicate whether this is due > to an actual rejection or to using more than expected resources. > - The easiest way to implement #1 is with AIMD's. We can keep them, but send > slow-down messages earlier. > - We could keep NLM, i.e. queue for limited times to smooth out delays. > Currently we RNF after a (longish) period waiting for a node to route to. We > would send a slow-down message when we have queued for more than a certain > period. This means we can have a fixed, limited amount of misrouting, we can > keep NLM's clear benefits for routing accuracy (as visible in success rates), > and we can ensure that the input load is low enough to avoid severe slowdown > due to queueing for a long time. To be clear, the benefits of keeping NLM/queueing are that incoming request timings are often not precisely matched with the availability of outgoing slots. However it is essential that the load be propagated back to the originators AS WELL, so that queueing times are kept very low on average. (Culmination of a long conversation where ArneBab was arguing I shouldn't give up on NLM, see logs) [18:37:55] ArneBab: i think we might be able to keep NLM [18:38:32] mainly using queueing as a way to smooth out the fact that the inter-request interval is quite long, so there's a good chance there isn't a free slot immediately when a request arrives [18:38:50] ArneBab: we will need to combine it with some sort of early-slow-down-signal mechanism though [18:38:55] ArneBab: see my latest reply to ian on devl [18:39:17] yes [18:39:33] okay, so we are on roughly the same page [18:40:34] jupp -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/aff20b05/attachment.pgp>
[freenet-dev] Beyond New Load Management
riefly), allow bounded misrouting (e.g. route to one of the top two routes), or terminate requests (thus freeing up resources). - Opportunistic flooding: Users hack their nodes to not follow AIMD's because it speeds up their downloads. This is harder than solving deliberate flooding but also less urgent. Fair sharing helps, but the slow-down signals, or rejections/terminations, may be generated a long way away from the originator. In which case, they are only proportionately affected, meaning it's still beneficial for them to flood. It may be that this isn't resolvable and we'll have to continue to rely on a certain amount of altruism; ask people to not use the Fr33n3t L33t L34ch Cl13nt cos it slows things down for everyone else, should this arise. - Local privacy: We can maybe identify which requests are local based on timings. This is somewhat speculative; while it is probably exploitable, there are much easier attacks if you are directly connected to the target. - Remote privacy: We can maybe correlate different bundles of requests based on timings. This is speculative, but it could be solved by using separate AIMD's for separate unconnected bundles of requests (we will need some sort of tagging/grouping mechanism like this when we have tunnels anyway). 2. Nodes control the rate at which remote requests are initiated too, propagating information on the capacity of the network. For instance, they could estimate the capacity of the network and generate slow-down signals when incoming requests are over that capacity (and eventually terminate requests); this is self-feeding because their estimate of capacity depends on *incoming* slow-down messages! The main problem with this is that feedback between nodes could well result in the estimated capacity collapsing to zero or whatever the minimum is. It is probably possible to surmount that but would certainly require a great deal of theoretical (simulation) work. We can reuse existing code to a significant degree: - RejectedOverload should be split up. It already represents a slow-down message (provided local=false). It should also indicate whether this is due to an actual rejection or to using more than expected resources. - The easiest way to implement #1 is with AIMD's. We can keep them, but send slow-down messages earlier. - We could keep NLM, i.e. queue for limited times to smooth out delays. Currently we RNF after a (longish) period waiting for a node to route to. We would send a slow-down message when we have queued for more than a certain period. This means we can have a fixed, limited amount of misrouting, we can keep NLM's clear benefits for routing accuracy (as visible in success rates), and we can ensure that the input load is low enough to avoid severe slowdown due to queueing for a long time. > > Ian. -- next part -- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/b007f4b4/attachment.pgp>
[freenet-dev] Fundraising
The project's current state is reflected in a dollar amount of remaining project funds, along with an estimate of how much additional development time it corresponds to. This amount will decrease suddenly and significantly when a payment is made. It's not - at least supposed to be - a secret that the Freenet project is set to run out of funds mid-September. I've mentioned to people - on Sone, in #freenet - that the project is running out of funds and they had no idea. This is a problem. I propose (in decreasing importance) that: 1) The donation page mention _when_ the last ~2-month payment was made. - Or ideally, that the time remaining estimate include remaining funded development time. This need not be exact: it would allow the estimate to decrease gradually instead of in sudden chunks. 2) Fundraising be presented as drives for a month's worth of funding. - This would allow donating to have goals, and be framed as increasing toward something tangible, (a month of development) rather than building up funds against sudden, quiet decrease. - The estimate ceases to be meaningful if it sits at "8 days" for two months. 3) Release announcements include statements on the project's current financial state and how to donate. - Fred could have integrated release announcements for greater visibility. These could of course be disabled by the user. 4) Release announcements also be displayed on the freenetproject.org homepage in something akin to the "Latest Releases"/"News" sidebar on winehq.org. - Having only news posts that are several months old makes the project appear inactive, which it is not. It should go without saying that a 0.8 release would probably make good press, assuming it can be made stable before funding runs out.
[freenet-dev] Fundraising
These are good ideas Steve, I must confess that I've been a little distracted lately - and we do definitely need to get our act together on funding. Matthew, what is our current "drop-dead" date as best as you can estimate it? Ian. On Mon, Aug 29, 2011 at 4:18 PM, Steve Dougherty wrote: > The project's current state is reflected in a dollar amount of remaining > project funds, along with an estimate of how much additional development > time > it corresponds to. This amount will decrease suddenly and significantly > when a > payment is made. It's not - at least supposed to be - a secret that the > Freenet > project is set to run out of funds mid-September. I've mentioned to people > - on > Sone, in #freenet - that the project is running out of funds and they had > no > idea. This is a problem. I propose (in decreasing importance) that: > > 1) The donation page mention _when_ the last ~2-month payment was made. >- Or ideally, that the time remaining estimate include remaining >funded development time. This need not be exact: it would allow the >estimate to decrease gradually instead of in sudden chunks. > > 2) Fundraising be presented as drives for a month's worth of funding. >- This would allow donating to have goals, and be framed as increasing >toward something tangible, (a month of development) rather than building > up >funds against sudden, quiet decrease. >- The estimate ceases to be meaningful if it sits at "8 days" for two >months. > > 3) Release announcements include statements on the project's current > financial > state and how to donate. >- Fred could have integrated release announcements for greater > visibility. >These could of course be disabled by the user. > > 4) Release announcements also be displayed on the freenetproject.orghomepage > in something akin to the "Latest Releases"/"News" sidebar on winehq.org. >- Having only news posts that are several months old makes the project >appear inactive, which it is not. > > It should go without saying that a 0.8 release would probably make good > press, > assuming it can be made stable before funding runs out. > ___ > Devl mailing list > Devl at freenetproject.org > http://freenetproject.org/cgi-bin/mailman/listinfo/devl > -- Ian Clarke Personal blog: http://blog.locut.us/ -- next part -- An HTML attachment was scrubbed... URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/79bd2904/attachment.html>
[freenet-dev] Beyond New Load Management: A proposal
On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland wrote: > > >- What is the average load reported by responses this node > forwarded, per > > >remote node > > > > Ahhh, this one could be interesting - you could use it to penalise nodes > which spam excessively. > > Actually, thinking about it, I don't think this will work. Requests are > usually distributed evenly across the keyspace, and downstream nodes only > know they are from this node - not which previous node they are from. So the > pain will be shared across all the peers sending us requests, including our > own local requests, and won't be identifiable as due to any single node. You > have to do it by volume, not by reported load. > But if a node is being abusive won't the nodes its requesting from tend to have a higher than average load? Perhaps you are right though, but this is the kind of thought-process we need. Ian. -- Ian Clarke Founder, The Freenet Project Email: ian at freenetproject.org -- next part -- An HTML attachment was scrubbed... URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/32287d0b/attachment.html>
[freenet-dev] Beyond New Load Management: A proposal
in > various ways e.g. if we are relaying a lot of slow-down notices to a peer we > could start to think about squeezing the peer's allocation. > Yes, except it doesn't work, and its too complicated to reason about. > > Of course, lots of questions: > > > >- How do we compute the averages? A decaying running average of some > >kind? What time period? > > Yay more tunables. :( > Ideally we'd do this in a way that avoids tunables. This is control theory, we shouldn't have to reinvent the wheel. > >- How do we translate load into a desired rate of requests? > > And even more tunables. :( > Don't be pessimistic, hopefully we can avoid tunables if we are smart about it. > >- What criteria indicate that a peer is abusing the system? What is > the > >remedy? > > In the current system, don't accept their requests. > Yes, although I'm leaning towards disconnecting from them, which is more-or-less the same thing, just more permanent and they don't consume our resources any more. > In ANY system, there will always be a borderline below which they can get > away with it. We can either try to make this as small as possible or try not > to hit legitimate traffic too hard. > Abuse should be rare, we should bias against false positives when we look for it. > > This is basically control theory stuff, and we'll need robust answers to > > these questions before we proceed (ideally with a theoretical rather than > > experimental foundation). > > IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much > the same without having to wait years for our theoreticians to wake up. > Yes, small tweaks have worked so well for us for the last decade, leaving us pretty-much where we were in 2003. No, we don't understand how the current system works, there is no point in trying to fix something when we don't even know what is broken. I am touched by your faith in "theoreticians", but I have a feeling we're in uncharted territory here. We need to agree on an approach that is simple enough to reason about, and then build a simulation for it. I think the simulation should just be a few days of work, we're not talking years here, hopefully not even months. Ian. -- Ian Clarke Founder, The Freenet Project Email: ian at freenetproject.org -- next part -- An HTML attachment was scrubbed... URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/caeca808/attachment.html>
[freenet-dev] Beyond New Load Management: A proposal
Ok, thinking about it, here is a proposal, or rather, the beginning of a proposal. I'm assuming that we get rid of NLM, fair sharing, and anything else intended to control load, and replace it with this. We will absolutely need to simulate this before we write a single line of code to deploy it. The core idea is that a node will include a floating point number in response to any kind of request showing how close that node is to being overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is completely saturated and must reject requests. Clearly the goal is to avoid getting anywhere near to 1.0. A node tracks several things: - What is the overall average load reported by responses this node has received - What is the average load reported by responses this node has received, per remote node - What is the average load reported by responses this node forwarded, per remote node I think, given these metrics, we should be able to do the following: - Limit our overall rate of initiating local requests based on the global average reported load - Limit our rate of local requests based on the average load of the connection to the peer it would need to be forwarded to - Detect when remote peers are abusing the system by disregarding load - as evidenced by a significantly higher average load in replies forwarded to them Of course, lots of questions: - How do we compute the averages? A decaying running average of some kind? What time period? - How do we translate load into a desired rate of requests? - What criteria indicate that a peer is abusing the system? What is the remedy? This is basically control theory stuff, and we'll need robust answers to these questions before we proceed (ideally with a theoretical rather than experimental foundation). Ian. -- Ian Clarke Founder, The Freenet Project Email: ian at freenetproject.org -- next part -- An HTML attachment was scrubbed... URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/9f83f964/attachment.html>
[freenet-dev] Beyond New Load Management
On Mon, Aug 29, 2011 at 12:37 PM, Matthew Toseland < toad at amphibian.dyndns.org> wrote: > That is because we do not have the time or funding to empirically test > hypotheses. We don't gather enough data, we don't have a huge testnet to try > stuff on over extended periods, and so on. Most software development has > deadlines and resource limitations, but most software development doesn't > NEED to be empirical, at least not very often! See my other mail about this. > I don't think the problem is a lack of time or funding. The problem is that we come up with solutions that are too complicated to analyze or fix when they don't work. Throwing more time and money at that might help, but that is addressing the symptom, not the cause. The cause is complexity, which just grows and grows as we try to fix problems we don't understand by layering on more complexity. > Misrouting is unacceptable, in general. Extremely overloaded or extremely > low capacity nodes may be routed around. We might even allow some bounded > amount of misrouting in the more general case (e.g. go to either of the top > two peers for the key). But in general, transforming load into misrouting > (or into reduced HTL, or any other bogus escape valve) is a bad idea. We > need to reduce the incoming load. > Right, the same is true of queueing. If nodes are forced to do things to deal with overloading that make the problem worse then the load balancing algorithm has failed. Its job is to prevent that from happening. > > What if all nodes told other nodes that they were overloaded because all > > their peers are overloaded. Such a situation would basically cause the > > entire network to think it was overloaded, even though nobody actually > was! > > It becomes a bit like this cartoon: http://flic.kr/p/5npfm2 How can > this > > be avoided? > > Exactly. There are two possible approaches: > 1. Nodes control the rate at which *local* requests are initiated, based on > how many slow-down signals they get. > > (This ensures that such gridlock or feedback loop problems can't happen) > Ah, so slow-downs only ever affect local request initiation? The problem there is that if a peer is overloaded, only those peers that are directly connected to it will slow down their rate of requests. Peers elsewhere on the network will continue to fire requests at it, and there will be no way for them to know they need to slow down :-/ Unless... what if the DataReply message (or whatever its called these days) contains a slow-down request, and its respected by every peer along that path according to AIMD? The problem is that there will be a very week link between the rate of requests, and the rate of slow-down messages. It would be like having 10,000 individual heaters with thermostats all trying to regulate the temperature of a gigantic space. Each individual thermostat/heater will have an almost imperceptible impact. Although perhaps that would work...? Fair sharing can limit this. Limiting backoff so that we only misroute when > nodes are severely overloaded can also help. We could queue (provided this > happens only rarely and briefly), allow bounded misrouting (e.g. route to > one of the top two routes), or terminate requests (thus freeing up > resources). > I think "fair sharing" is one of those complexities that we need to go back to the drawing board on. Its clear that even now we have hypotheses about how fair sharing could be causing problems. Its too complicated to really reason about, or to anticipate all of its implications. > - The easiest way to implement #1 is with AIMD's. We can keep them, but > send slow-down messages earlier. > - We could keep NLM, i.e. queue for limited times to smooth out delays. Doesn't that violate the principal that the load balancing stuff shouldn't do anything that makes the problem worse? Queueing does exactly that. > Currently we RNF after a (longish) period waiting for a node to route to. > We would send a slow-down message when we have queued for more than a > certain period. This means we can have a fixed, limited amount of > misrouting, we can keep NLM's clear benefits for routing accuracy (as > visible in success rates), and we can ensure that the input load is low > enough to avoid severe slowdown due to queueing for a long time. > No, I think we need to avoid queueing except in case of emergency. Queueing only makes things worse by tying up more resources for longer. Ian. -- Ian Clarke Founder, The Freenet Project Email: ian at freenetproject.org -- next part -- An HTML attachment was scrubbed... URL: <https://emu.freenetproject.org/pipermail/devl/attachments/20110829/1d8153cd/attachment.html>
Re: [freenet-dev] Beyond New Load Management
On Saturday 27 Aug 2011 21:35:58 Ian Clarke wrote: Matthew, This makes sense from the perspective of making incremental changes, but I think we need to be more drastic than that. I think we need to go back to the drawing board with load management. We need to find a solution that is simple enough to reason about, and to debug if we have problems with it. The entire approach of coming up with hypotheses about what is wrong, building a solution based on these hypotheses (without actually confirming that the hypotheses are accurate) and deploying it is deja vu, we've been doing it for a decade, and we still haven't got load management right. We're just layering more complexity onto a system that we already don't understand, based on guesses as to what the problems were with the previous iteration that we can't test because the system is too complicated with too many interactions for anyone to get their heads around it. That is because we do not have the time or funding to empirically test hypotheses. We don't gather enough data, we don't have a huge testnet to try stuff on over extended periods, and so on. Most software development has deadlines and resource limitations, but most software development doesn't NEED to be empirical, at least not very often! See my other mail about this. If something isn't working, and we don't understand for definite what is wrong with it, then we shouldn't build more on top of it in the hope that we'll accidentally solve the problem, we should replace it with something simple enough that we do understand it, right? The purpose of load management is relatively simple: *Don't allow clients to pump more requests into the network than the network can handle, while ensuring that this workload is distributed across the network in a reasonably efficient manner. This must be done in a decentralized way that is not vulnerable to abuse.* * The current load management system includes many different interacting components that make it nearly impossible to understand or debug. I think we need to go back to the drawing board with load management, starting from the goal I cite above. I would invite people to suggest the simplest possible load management schemes that might work, let's then discuss them, and figure out which is most likely to work, and if it doesn't work, which will be easiest to debug. We can bear in mind a few lesson's we've learned though. What characteristics should our load balancing system have? - Dropping requests should be extremely rare because this just exacerbates overloading - Delaying requests should also be extremely rare for the same reason - Misrouting requests should be limited, but perhaps acceptable occasionally, for the same reason Misrouting is unacceptable, in general. Extremely overloaded or extremely low capacity nodes may be routed around. We might even allow some bounded amount of misrouting in the more general case (e.g. go to either of the top two peers for the key). But in general, transforming load into misrouting (or into reduced HTL, or any other bogus escape valve) is a bad idea. We need to reduce the incoming load. It therefore really comes down to nodes anticipating whether the network is in danger of having to drop, delay, or misroute requests, and reduce the rate at which they are pumping requests into the network if this danger point is reached. So how do nodes inform each-other that they are at risk of having to drop, delay, or misroute requests? There are two reasons this might happen. Either a node itself is close to its own capacity for relaying requests, or the other nodes it relays to are themselves at or close to capacity. One problem I'm concerned about when nodes share information about how overloaded their peers are is a gridlock: What if all nodes told other nodes that they were overloaded because all their peers are overloaded. Such a situation would basically cause the entire network to think it was overloaded, even though nobody actually was! It becomes a bit like this cartoon: http://flic.kr/p/5npfm2 How can this be avoided? Exactly. There are two possible approaches: 1. Nodes control the rate at which *local* requests are initiated, based on how many slow-down signals they get. (This ensures that such gridlock or feedback loop problems can't happen) The main problems are security: - Deliberate flooding: Malicious nodes not following AIMD's, in order to clog up the network. Fair sharing can limit this. Limiting backoff so that we only misroute when nodes are severely overloaded can also help. We could queue (provided this happens only rarely and briefly), allow bounded misrouting (e.g. route to one of the top two routes), or terminate requests (thus freeing up resources). - Opportunistic flooding: Users hack their nodes to not follow AIMD's because it speeds up their downloads. This is harder than
Re: [freenet-dev] Beyond New Load Management
On Monday 29 Aug 2011 18:37:39 Matthew Toseland wrote: On Saturday 27 Aug 2011 21:35:58 Ian Clarke wrote: Matthew, This makes sense from the perspective of making incremental changes, but I think we need to be more drastic than that. I think we need to go back to the drawing board with load management. We need to find a solution that is simple enough to reason about, and to debug if we have problems with it. The entire approach of coming up with hypotheses about what is wrong, building a solution based on these hypotheses (without actually confirming that the hypotheses are accurate) and deploying it is deja vu, we've been doing it for a decade, and we still haven't got load management right. We're just layering more complexity onto a system that we already don't understand, based on guesses as to what the problems were with the previous iteration that we can't test because the system is too complicated with too many interactions for anyone to get their heads around it. That is because we do not have the time or funding to empirically test hypotheses. We don't gather enough data, we don't have a huge testnet to try stuff on over extended periods, and so on. Most software development has deadlines and resource limitations, but most software development doesn't NEED to be empirical, at least not very often! See my other mail about this. If something isn't working, and we don't understand for definite what is wrong with it, then we shouldn't build more on top of it in the hope that we'll accidentally solve the problem, we should replace it with something simple enough that we do understand it, right? The purpose of load management is relatively simple: *Don't allow clients to pump more requests into the network than the network can handle, while ensuring that this workload is distributed across the network in a reasonably efficient manner. This must be done in a decentralized way that is not vulnerable to abuse.* * The current load management system includes many different interacting components that make it nearly impossible to understand or debug. I think we need to go back to the drawing board with load management, starting from the goal I cite above. I would invite people to suggest the simplest possible load management schemes that might work, let's then discuss them, and figure out which is most likely to work, and if it doesn't work, which will be easiest to debug. We can bear in mind a few lesson's we've learned though. What characteristics should our load balancing system have? - Dropping requests should be extremely rare because this just exacerbates overloading - Delaying requests should also be extremely rare for the same reason - Misrouting requests should be limited, but perhaps acceptable occasionally, for the same reason Misrouting is unacceptable, in general. Extremely overloaded or extremely low capacity nodes may be routed around. We might even allow some bounded amount of misrouting in the more general case (e.g. go to either of the top two peers for the key). But in general, transforming load into misrouting (or into reduced HTL, or any other bogus escape valve) is a bad idea. We need to reduce the incoming load. It therefore really comes down to nodes anticipating whether the network is in danger of having to drop, delay, or misroute requests, and reduce the rate at which they are pumping requests into the network if this danger point is reached. So how do nodes inform each-other that they are at risk of having to drop, delay, or misroute requests? There are two reasons this might happen. Either a node itself is close to its own capacity for relaying requests, or the other nodes it relays to are themselves at or close to capacity. One problem I'm concerned about when nodes share information about how overloaded their peers are is a gridlock: What if all nodes told other nodes that they were overloaded because all their peers are overloaded. Such a situation would basically cause the entire network to think it was overloaded, even though nobody actually was! It becomes a bit like this cartoon: http://flic.kr/p/5npfm2 How can this be avoided? Exactly. There are two possible approaches: 1. Nodes control the rate at which *local* requests are initiated, based on how many slow-down signals they get. (This ensures that such gridlock or feedback loop problems can't happen) The main problems are security: - Deliberate flooding: Malicious nodes not following AIMD's, in order to clog up the network. Fair sharing can limit this. Limiting backoff so that we only misroute when nodes are severely overloaded can also help. We could queue (provided this happens only rarely and briefly), allow bounded misrouting (e.g. route to one of the top two routes), or terminate requests (thus freeing up
Re: [freenet-dev] Beyond New Load Management
On Mon, Aug 29, 2011 at 12:37 PM, Matthew Toseland t...@amphibian.dyndns.org wrote: That is because we do not have the time or funding to empirically test hypotheses. We don't gather enough data, we don't have a huge testnet to try stuff on over extended periods, and so on. Most software development has deadlines and resource limitations, but most software development doesn't NEED to be empirical, at least not very often! See my other mail about this. I don't think the problem is a lack of time or funding. The problem is that we come up with solutions that are too complicated to analyze or fix when they don't work. Throwing more time and money at that might help, but that is addressing the symptom, not the cause. The cause is complexity, which just grows and grows as we try to fix problems we don't understand by layering on more complexity. Misrouting is unacceptable, in general. Extremely overloaded or extremely low capacity nodes may be routed around. We might even allow some bounded amount of misrouting in the more general case (e.g. go to either of the top two peers for the key). But in general, transforming load into misrouting (or into reduced HTL, or any other bogus escape valve) is a bad idea. We need to reduce the incoming load. Right, the same is true of queueing. If nodes are forced to do things to deal with overloading that make the problem worse then the load balancing algorithm has failed. Its job is to prevent that from happening. What if all nodes told other nodes that they were overloaded because all their peers are overloaded. Such a situation would basically cause the entire network to think it was overloaded, even though nobody actually was! It becomes a bit like this cartoon: http://flic.kr/p/5npfm2 How can this be avoided? Exactly. There are two possible approaches: 1. Nodes control the rate at which *local* requests are initiated, based on how many slow-down signals they get. (This ensures that such gridlock or feedback loop problems can't happen) Ah, so slow-downs only ever affect local request initiation? The problem there is that if a peer is overloaded, only those peers that are directly connected to it will slow down their rate of requests. Peers elsewhere on the network will continue to fire requests at it, and there will be no way for them to know they need to slow down :-/ Unless... what if the DataReply message (or whatever its called these days) contains a slow-down request, and its respected by every peer along that path according to AIMD? The problem is that there will be a very week link between the rate of requests, and the rate of slow-down messages. It would be like having 10,000 individual heaters with thermostats all trying to regulate the temperature of a gigantic space. Each individual thermostat/heater will have an almost imperceptible impact. Although perhaps that would work...? Fair sharing can limit this. Limiting backoff so that we only misroute when nodes are severely overloaded can also help. We could queue (provided this happens only rarely and briefly), allow bounded misrouting (e.g. route to one of the top two routes), or terminate requests (thus freeing up resources). I think fair sharing is one of those complexities that we need to go back to the drawing board on. Its clear that even now we have hypotheses about how fair sharing could be causing problems. Its too complicated to really reason about, or to anticipate all of its implications. - The easiest way to implement #1 is with AIMD's. We can keep them, but send slow-down messages earlier. - We could keep NLM, i.e. queue for limited times to smooth out delays. Doesn't that violate the principal that the load balancing stuff shouldn't do anything that makes the problem worse? Queueing does exactly that. Currently we RNF after a (longish) period waiting for a node to route to. We would send a slow-down message when we have queued for more than a certain period. This means we can have a fixed, limited amount of misrouting, we can keep NLM's clear benefits for routing accuracy (as visible in success rates), and we can ensure that the input load is low enough to avoid severe slowdown due to queueing for a long time. No, I think we need to avoid queueing except in case of emergency. Queueing only makes things worse by tying up more resources for longer. Ian. -- Ian Clarke Founder, The Freenet Project Email: i...@freenetproject.org ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
[freenet-dev] Beyond New Load Management: A proposal
Ok, thinking about it, here is a proposal, or rather, the beginning of a proposal. I'm assuming that we get rid of NLM, fair sharing, and anything else intended to control load, and replace it with this. We will absolutely need to simulate this before we write a single line of code to deploy it. The core idea is that a node will include a floating point number in response to any kind of request showing how close that node is to being overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is completely saturated and must reject requests. Clearly the goal is to avoid getting anywhere near to 1.0. A node tracks several things: - What is the overall average load reported by responses this node has received - What is the average load reported by responses this node has received, per remote node - What is the average load reported by responses this node forwarded, per remote node I think, given these metrics, we should be able to do the following: - Limit our overall rate of initiating local requests based on the global average reported load - Limit our rate of local requests based on the average load of the connection to the peer it would need to be forwarded to - Detect when remote peers are abusing the system by disregarding load - as evidenced by a significantly higher average load in replies forwarded to them Of course, lots of questions: - How do we compute the averages? A decaying running average of some kind? What time period? - How do we translate load into a desired rate of requests? - What criteria indicate that a peer is abusing the system? What is the remedy? This is basically control theory stuff, and we'll need robust answers to these questions before we proceed (ideally with a theoretical rather than experimental foundation). Ian. -- Ian Clarke Founder, The Freenet Project Email: i...@freenetproject.org ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management
On Monday 29 Aug 2011 18:58:26 Ian Clarke wrote: On Mon, Aug 29, 2011 at 12:37 PM, Matthew Toseland t...@amphibian.dyndns.org wrote: That is because we do not have the time or funding to empirically test hypotheses. We don't gather enough data, we don't have a huge testnet to try stuff on over extended periods, and so on. Most software development has deadlines and resource limitations, but most software development doesn't NEED to be empirical, at least not very often! See my other mail about this. I don't think the problem is a lack of time or funding. The problem is that we come up with solutions that are too complicated to analyze or fix when they don't work. Throwing more time and money at that might help, but that is addressing the symptom, not the cause. The cause is complexity, which just grows and grows as we try to fix problems we don't understand by layering on more complexity. Perhaps. Ripping everything out and starting again is always tempting especially when subordinates have been doing things without asking you. :) Misrouting is unacceptable, in general. Extremely overloaded or extremely low capacity nodes may be routed around. We might even allow some bounded amount of misrouting in the more general case (e.g. go to either of the top two peers for the key). But in general, transforming load into misrouting (or into reduced HTL, or any other bogus escape valve) is a bad idea. We need to reduce the incoming load. Right, the same is true of queueing. If nodes are forced to do things to deal with overloading that make the problem worse then the load balancing algorithm has failed. Its job is to prevent that from happening. Okay. What if all nodes told other nodes that they were overloaded because all their peers are overloaded. Such a situation would basically cause the entire network to think it was overloaded, even though nobody actually was! It becomes a bit like this cartoon: http://flic.kr/p/5npfm2 How can this be avoided? Exactly. There are two possible approaches: 1. Nodes control the rate at which *local* requests are initiated, based on how many slow-down signals they get. (This ensures that such gridlock or feedback loop problems can't happen) Ah, so slow-downs only ever affect local request initiation? The problem there is that if a peer is overloaded, only those peers that are directly connected to it will slow down their rate of requests. Peers elsewhere on the network will continue to fire requests at it, and there will be no way for them to know they need to slow down :-/ No, because the slow-down's are propagated back to the original request originator, and only the request originator takes notice of them. That's how it works now. Unless... what if the DataReply message (or whatever its called these days) contains a slow-down request, and its respected by every peer along that path according to AIMD? The problem is that there will be a very week link between the rate of requests, and the rate of slow-down messages. It would be like having 10,000 individual heaters with thermostats all trying to regulate the temperature of a gigantic space. Each individual thermostat/heater will have an almost imperceptible impact. Although perhaps that would work...? The rest is plausible - feedback might be too slow, but this can be helped in part by an increased window size between the point at which we start sending slow-down messages and the point at which things start breaking more badly. Fair sharing can limit this. Limiting backoff so that we only misroute when nodes are severely overloaded can also help. We could queue (provided this happens only rarely and briefly), allow bounded misrouting (e.g. route to one of the top two routes), or terminate requests (thus freeing up resources). I think fair sharing is one of those complexities that we need to go back to the drawing board on. Its clear that even now we have hypotheses about how fair sharing could be causing problems. Its too complicated to really reason about, or to anticipate all of its implications. It is vital for: 1. Security: It quashes DoS attacks/floods quickly. Without it, any attacker can easily swamp any node it is connected to, using all of its capacity for sending its flood onward. With it, they are limited to a fraction of the node's capacity. This is a concern that load management must deal with! 2. Nodes with low numbers of peers, and slower nodes in general, to get a look in, rather than be rejected at the same rate as everyone else. - The easiest way to implement #1 is with AIMD's. We can keep them, but send slow-down messages earlier. - We could keep NLM, i.e. queue for limited times to smooth out delays. Doesn't that violate the principal that the load balancing stuff shouldn't do anything that makes the problem worse? Queueing does exactly that. So what do you
Re: [freenet-dev] Beyond New Load Management: A proposal
On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: Ok, thinking about it, here is a proposal, or rather, the beginning of a proposal. I'm assuming that we get rid of NLM, fair sharing, and anything else intended to control load, and replace it with this. We will absolutely need to simulate this before we write a single line of code to deploy it. I have justified fair sharing in my other reply. However alternate mechanisms might replace it. I agree we should simulate, although this might be significant work. The core idea is that a node will include a floating point number in response to any kind of request showing how close that node is to being overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is completely saturated and must reject requests. Clearly the goal is to avoid getting anywhere near to 1.0. A node tracks several things: All of these apply to both locally originated requests and remotely originated requests? And are they for only our direct peers? - What is the overall average load reported by responses this node has received - What is the average load reported by responses this node has received, per remote node - What is the average load reported by responses this node forwarded, per remote node Ahhh, this one could be interesting - you could use it to penalise nodes which spam excessively. I think, given these metrics, we should be able to do the following: - Limit our overall rate of initiating local requests based on the global average reported load We can already do this on the basis of AIMD's, based purely on behaviour of local requests. However, taking more data into account might result in more accurate results - although it might also dilute the feedback. In any case, this part is safe: it's not going to cause feedback problems, on its own. Conversely, it doesn't deal with problems such as network bottlenecks. However, while AIMD's are based on the *global* average load (or rather the load of all the nodes along the chain), it sounds like you are proposing to only report the *local* load of our direct peers? How are you going to compute the global average load? - Limit our rate of local requests based on the average load of the connection to the peer it would need to be forwarded to Not sure I follow here. Are you proposing that we initiate more requests in keyspace areas where there is lower load? - Detect when remote peers are abusing the system by disregarding load - as evidenced by a significantly higher average load in replies forwarded to them Okay, that is useful - although I doubt it would be an unambiguous detection, and what would we do about it? We would have to reject requests, and if we're not sure, or if it could be a downstream troublemaker, couldn't that result in misrouting and make things worse? On the other hand, any malicious node would be most easily detected by the nodes directly connected to it, so this could work much the same way as fair sharing - provided that it can reject or terminate the requests. Arguably fair sharing achieves the same objective. It could be extended in various ways e.g. if we are relaying a lot of slow-down notices to a peer we could start to think about squeezing the peer's allocation. Of course, lots of questions: - How do we compute the averages? A decaying running average of some kind? What time period? Yay more tunables. :( - How do we translate load into a desired rate of requests? And even more tunables. :( - What criteria indicate that a peer is abusing the system? What is the remedy? In the current system, don't accept their requests. In ANY system, there will always be a borderline below which they can get away with it. We can either try to make this as small as possible or try not to hit legitimate traffic too hard. This is basically control theory stuff, and we'll need robust answers to these questions before we proceed (ideally with a theoretical rather than experimental foundation). IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much the same without having to wait years for our theoreticians to wake up. signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
On Monday 29 Aug 2011 20:09:58 Matthew Toseland wrote: On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: Ok, thinking about it, here is a proposal, or rather, the beginning of a proposal. I'm assuming that we get rid of NLM, fair sharing, and anything else intended to control load, and replace it with this. We will absolutely need to simulate this before we write a single line of code to deploy it. I have justified fair sharing in my other reply. However alternate mechanisms might replace it. I agree we should simulate, although this might be significant work. The core idea is that a node will include a floating point number in response to any kind of request showing how close that node is to being overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is completely saturated and must reject requests. Clearly the goal is to avoid getting anywhere near to 1.0. A node tracks several things: All of these apply to both locally originated requests and remotely originated requests? And are they for only our direct peers? - What is the overall average load reported by responses this node has received - What is the average load reported by responses this node has received, per remote node - What is the average load reported by responses this node forwarded, per remote node Ahhh, this one could be interesting - you could use it to penalise nodes which spam excessively. Actually, thinking about it, I don't think this will work. Requests are usually distributed evenly across the keyspace, and downstream nodes only know they are from this node - not which previous node they are from. So the pain will be shared across all the peers sending us requests, including our own local requests, and won't be identifiable as due to any single node. You have to do it by volume, not by reported load. signature.asc Description: This is a digitally signed message part. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland t...@amphibian.dyndns.org wrote: On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: Ok, thinking about it, here is a proposal, or rather, the beginning of a proposal. I'm assuming that we get rid of NLM, fair sharing, and anything else intended to control load, and replace it with this. We will absolutely need to simulate this before we write a single line of code to deploy it. I have justified fair sharing in my other reply. However alternate mechanisms might replace it. I agree we should simulate, although this might be significant work. I doubt it would be more work than trying it without simulation and failing for a decade, which was our previous approach. The core idea is that a node will include a floating point number in response to any kind of request showing how close that node is to being overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is completely saturated and must reject requests. Clearly the goal is to avoid getting anywhere near to 1.0. A node tracks several things: All of these apply to both locally originated requests and remotely originated requests? And are they for only our direct peers? Every peer along the reply path will take note of the load value, so it will be a mixture of the loads of close-by peers, and more distant peers. Obviously, since we are more likely to receive messages from peers that are closer, our average will automatically be biased towards closer peers. - What is the average load reported by responses this node forwarded, per remote node Ahhh, this one could be interesting - you could use it to penalise nodes which spam excessively. Exactly. I think, given these metrics, we should be able to do the following: - Limit our overall rate of initiating local requests based on the global average reported load We can already do this on the basis of AIMD's, based purely on behaviour of local requests. However, taking more data into account might result in more accurate results - although it might also dilute the feedback. Right, AIMD tends to control the rate at which a specific event occurs, like a dropped IP packet with TCP, or a dropped request as we currently use it. This is a bit like signaling that you have a headache by banging your head against the wall. It is also a rather coarse way to deliver feedback. A floating point load number conveys more information in a non-destructive way. In any case, this part is safe: it's not going to cause feedback problems, on its own. Conversely, it doesn't deal with problems such as network bottlenecks. Well, it depends. If we are tracking load on a per-outbound-connection basis then we can potentially take action to manage load per-connection in addition to overall. However, while AIMD's are based on the *global* average load (or rather the load of all the nodes along the chain), it sounds like you are proposing to only report the *local* load of our direct peers? How are you going to compute the global average load? No, its not the local load of our direct peers. Its the load reported in all reply messages that we forward, some of those will be from our direct peers, but most will be from remote peers. - Limit our rate of local requests based on the average load of the connection to the peer it would need to be forwarded to Not sure I follow here. Are you proposing that we initiate more requests in keyspace areas where there is lower load? Well, kinda, except the opposite. We might reduce requests in keyspaces where there is higher load. - Detect when remote peers are abusing the system by disregarding load - as evidenced by a significantly higher average load in replies forwarded to them Okay, that is useful - although I doubt it would be an unambiguous detection, and what would we do about it? We would have to reject requests, and if we're not sure, or if it could be a downstream troublemaker, couldn't that result in misrouting and make things worse? On the other hand, any malicious node would be most easily detected by the nodes directly connected to it, so this could work much the same way as fair sharing - provided that it can reject or terminate the requests. No, we probably wouldn't start rejecting requests - more likely we would disconnect from that node. Obviously we'd need to be fairly sure of abusive behavior before doing that - but abuse should be rare. Arguably fair sharing achieves the same objective. It could be extended in various ways e.g. if we are relaying a lot of slow-down notices to a peer we could start to think about squeezing the peer's allocation. Yes, except it doesn't work, and its too complicated to reason about. Of course, lots of questions: - How do we compute the averages? A decaying running average of some kind? What time period? Yay more tunables.
Re: [freenet-dev] Beyond New Load Management: A proposal
On Mon, Aug 29, 2011 at 2:21 PM, Matthew Toseland t...@amphibian.dyndns.org wrote: - What is the average load reported by responses this node forwarded, per remote node Ahhh, this one could be interesting - you could use it to penalise nodes which spam excessively. Actually, thinking about it, I don't think this will work. Requests are usually distributed evenly across the keyspace, and downstream nodes only know they are from this node - not which previous node they are from. So the pain will be shared across all the peers sending us requests, including our own local requests, and won't be identifiable as due to any single node. You have to do it by volume, not by reported load. But if a node is being abusive won't the nodes its requesting from tend to have a higher than average load? Perhaps you are right though, but this is the kind of thought-process we need. Ian. -- Ian Clarke Founder, The Freenet Project Email: i...@freenetproject.org ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke: Yes, small tweaks have worked so well for us for the last decade, leaving us pretty-much where we were in 2003. No, we don't understand how the current system works, there is no point in trying to fix something when we don't even know what is broken. I’d like to present a clue what is broken in NLM. Before I kill you with the log, here’s the result: With NLM the latency of a request is a function of the raw bandwidth (not so with OLM), and NLM used only half my bandwidth after it had been deployed for 2 days (at the start much more). τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time (time to find the node), nlm: new load management, olm: old load management. So first step: make sure all bandwidth gets used - maybe by allocating more slots till we use all allowed bandwidth. Better having to throttle a transfer than not using bandwidth. *NLM should with the current network be slower than OLM by 23%. But in 18 months it should actually be faster by ~8% — given Moores Law holds for upload bandwidth — because the routes are shorter.* The main advantage of NLM is, that it should be much more resilient against attackers (DoS). Now to the log - it’s math and not cleaned up; you have been warned :) ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf. ArneBab queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀, w: success rate ArneBab ψs = τ(h) + q(h) ArneBab ψf = q(h) ArneBab ψ ~ w₁·ψs + (1-w₁)·ψf ArneBab σs = τ(h) + q(h) ArneBab σf = q(h) ArneBab σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15% ArneBab num(ψ) / num(σ) ~ 1 ArneBab → time ~ σ + ψ ArneBab q(h) depends on timeouts, as do w₁ and w₂ ArneBab time = w₁·ψs + (1-w₁)·ψf + w₂·ψs + (1-w₂)·ψf ArneBab = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1- w₂)·q(h) ArneBab = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) ArneBab = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) ArneBab in the congestion case q(h) ~ timeout ArneBab timeout = o ArneBab timeout: o ArneBab w depends on the timeout *somehow*, but inversely ArneBab o=0 → w=0 ArneBab assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100% ArneBab assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1 ArneBab correction: in the congestion case: q(h) ~ min(timeout, τ(h)) ArneBab timeout matters for q(h) only when timeout τ(h) ArneBab I try to: I still need a dependency of w on timeout ArneBab … lets call it t(w) ArneBab better: w(o) :) toad_ well, if there is a timeout, we have a fixed time, but we reduce the hops ... toad_ i thought w was success rate ArneBab ah! ArneBab and the success rates where in the NLM stats ArneBab going mostly smoothly from 60% to 0% ArneBab for the HTL toad_ right, success rate peaks at 18 or sometimes 16 toad_ what are w1 vs w2? toad_ chk vs ssk i guess ArneBab yes -*- toad_ thinks considering both is probably overambitious at this stage? ArneBab should not be too bad: SSKs drop much more rapidly at decreasing hops ArneBab hops→HTL toad_ ψs is time for a successful chk; ψf is time for a failed chk ... in which case h in the first instance is low, and in the second instance is h0 ArneBab yes toad_ okay, i don't follow this line: time = w₁·ψs + (1-w₁)·ψf + w₂·ψs + (1-w₂)·ψf toad_ i thought w2 related to SSKs? ArneBab uh, yes… ArneBab time = w₁·ψs + (1-w₁)·ψf + w₂·σs + (1-w₂)·σf toad_ you have to appreciate i'm only just getting back into maths and physics after 12 years ... toad_ (retaking a-levels to get a degree) ArneBab no probs, I’m also no expert in this. I try to get a relation between the time and the timeout, so we can try to find a minimum toad_ in any case, there are two different h's for the two uses of q(h) - h0 and h_avg toad_ h_avg for success and h0 for failure ArneBab hm, yes ArneBab which makes this harder… ArneBab it’s wrong anyway… the q(h_avg) was missing toad_ h_avg is somewhere between 5 and 10 imho toad_ at least it is if everything is working well and the input load isn't all really popular stuff (in which case it's answered quickly and can be ignored) ArneBab = τ(h) · (w₁+w₂) + q(h) · (2-w₁-w₂) + q(h_avg) · (w₁+w₂) ArneBab would have been too easy :) toad_ okay so here q(h) means q(h0) i.e. h = h0, max hops? ArneBab jepp, and max hops sinks with falling timeout toad_ hmm? ArneBab the max actual hops toad_ on the upside, q() is linear -- Torgal (~Torgal@78.251.49.8) hat das Netzwerk verlassen (Ping timeout: 276 seconds) toad_ hopefully ArneBab yes: q(h) = h·o ArneBab (in the congestion case) toad_ the problem i have is it looks like q(1) ~= time [ for a full request ], unless load is very low ArneBab so τ « q? ArneBab τ much smaller than q? toad_ of course it's bounded by timeouts, but i'd expect a runaway feedback loop until it reaches heavy timeouts and effectively cuts the htl toad_ well, with OLM, success time for a CHK is 1m25s, unsuccessful is 19sec, so transfer time is at least 1 minute toad_ and less than 1m25; but with NLM, unsuccessful is 3 min+ ArneBab
[freenet-dev] Fundraising
The project's current state is reflected in a dollar amount of remaining project funds, along with an estimate of how much additional development time it corresponds to. This amount will decrease suddenly and significantly when a payment is made. It's not - at least supposed to be - a secret that the Freenet project is set to run out of funds mid-September. I've mentioned to people - on Sone, in #freenet - that the project is running out of funds and they had no idea. This is a problem. I propose (in decreasing importance) that: 1) The donation page mention _when_ the last ~2-month payment was made. - Or ideally, that the time remaining estimate include remaining funded development time. This need not be exact: it would allow the estimate to decrease gradually instead of in sudden chunks. 2) Fundraising be presented as drives for a month's worth of funding. - This would allow donating to have goals, and be framed as increasing toward something tangible, (a month of development) rather than building up funds against sudden, quiet decrease. - The estimate ceases to be meaningful if it sits at 8 days for two months. 3) Release announcements include statements on the project's current financial state and how to donate. - Fred could have integrated release announcements for greater visibility. These could of course be disabled by the user. 4) Release announcements also be displayed on the freenetproject.org homepage in something akin to the Latest Releases/News sidebar on winehq.org. - Having only news posts that are several months old makes the project appear inactive, which it is not. It should go without saying that a 0.8 release would probably make good press, assuming it can be made stable before funding runs out. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
Okay, I don't understand most of that, I might be able to check the math if it was written properly, but it looks difficult. However, as far as I can see: - The most obvious way to increase bandwidth usage would be to increase the timeout time for output bandwidth liability (and at the same time increase the relevant block transfer timeouts). - This would increase the number of slots but it would also increase the number of requests seeking them; I don't see why it would help matters. - Running an excessive number of requests without adjusting the block transfer timeouts would result in some of the transfers timing out. - It would also, as you mention, make SSK requests (especially failed SSK requests) even slower. - I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. - As far as I can see, all the benefits of NLM re attackers are achieved by fair sharing. - A system which is more resistant to attacks but slower probably isn't all that interesting if the attacks in question are relatively expensive anyway. - Smaller block sizes would have a significant efficiency cost, and would probably make load management more difficult. I apologise if you see this as rather a broadside after I encouraged you to analyse the problem, but it is not yet a convincing demonstration that queueing is a viable strategy and not the spawn of satan! :) On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote: Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke: Yes, small tweaks have worked so well for us for the last decade, leaving us pretty-much where we were in 2003. No, we don't understand how the current system works, there is no point in trying to fix something when we don't even know what is broken. I’d like to present a clue what is broken in NLM. Before I kill you with the log, here’s the result: With NLM the latency of a request is a function of the raw bandwidth (not so with OLM), and NLM used only half my bandwidth after it had been deployed for 2 days (at the start much more). τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time (time to find the node), nlm: new load management, olm: old load management. So first step: make sure all bandwidth gets used - maybe by allocating more slots till we use all allowed bandwidth. Better having to throttle a transfer than not using bandwidth. *NLM should with the current network be slower than OLM by 23%. But in 18 months it should actually be faster by ~8% — given Moores Law holds for upload bandwidth — because the routes are shorter.* The main advantage of NLM is, that it should be much more resilient against attackers (DoS). Now to the log - it’s math and not cleaned up; you have been warned :) ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf. ArneBab queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀, w: success rate ArneBab ψs = τ(h) + q(h) ArneBab ψf = q(h) ArneBab ψ ~ w₁·ψs + (1-w₁)·ψf ArneBab σs = τ(h) + q(h) ArneBab σf = q(h) ArneBab σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15% ArneBab num(ψ) / num(σ) ~ 1 ArneBab → time ~ σ + ψ ArneBab q(h) depends on timeouts, as do w₁ and w₂ ArneBab time = w₁·ψs + (1-w₁)·ψf + w₂·ψs + (1-w₂)·ψf ArneBab = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1- w₂)·q(h) ArneBab = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) ArneBab = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) ArneBab in the congestion case q(h) ~ timeout ArneBab timeout = o ArneBab timeout: o ArneBab w depends on the timeout *somehow*, but inversely ArneBab o=0 → w=0 ArneBab assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100% ArneBab assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1 ArneBab correction: in the congestion case: q(h) ~ min(timeout, τ(h)) ArneBab timeout matters for q(h) only when timeout τ(h) ArneBab I try to: I still need a dependency of w on timeout ArneBab … lets call it t(w) ArneBab better: w(o) :) toad_ well, if there is a timeout, we have a fixed time, but we reduce the hops ... toad_ i thought w was success rate ArneBab ah! ArneBab and the success rates where in the NLM stats ArneBab going mostly smoothly from 60% to 0% ArneBab for the HTL toad_ right, success rate peaks at 18 or sometimes 16 toad_ what are w1 vs w2? toad_ chk vs ssk i guess ArneBab yes -*- toad_ thinks considering both is probably overambitious at this stage? ArneBab should not be too bad: SSKs drop much more rapidly at decreasing hops ArneBab hops→HTL toad_ ψs is time for a successful chk; ψf is time for a failed chk ... in which case h in the first instance is low, and in the second instance is h0 ArneBab yes toad_ okay, i don't follow this line: time = w₁·ψs + (1-w₁)·ψf + w₂·ψs + (1-w₂)·ψf toad_ i thought w2 related to SSKs? ArneBab uh, yes… ArneBab time = w₁·ψs + (1-w₁)·ψf + w₂·σs + (1-w₂)·σf toad_ you have to appreciate i'm only just getting back into maths and physics after 12 years ... toad_
Re: [freenet-dev] Beyond New Load Management: A proposal
On Monday 29 Aug 2011 20:32:13 Ian Clarke wrote: On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland t...@amphibian.dyndns.org wrote: On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: Ok, thinking about it, here is a proposal, or rather, the beginning of a proposal. I'm assuming that we get rid of NLM, fair sharing, and anything else intended to control load, and replace it with this. We will absolutely need to simulate this before we write a single line of code to deploy it. I have justified fair sharing in my other reply. However alternate mechanisms might replace it. I agree we should simulate, although this might be significant work. I doubt it would be more work than trying it without simulation and failing for a decade, which was our previous approach. The core idea is that a node will include a floating point number in response to any kind of request showing how close that node is to being overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is completely saturated and must reject requests. Clearly the goal is to avoid getting anywhere near to 1.0. A node tracks several things: All of these apply to both locally originated requests and remotely originated requests? And are they for only our direct peers? Every peer along the reply path will take note of the load value, so it will be a mixture of the loads of close-by peers, and more distant peers. Obviously, since we are more likely to receive messages from peers that are closer, our average will automatically be biased towards closer peers. So it is from the data source? Or every node along the way? - What is the average load reported by responses this node forwarded, per remote node Ahhh, this one could be interesting - you could use it to penalise nodes which spam excessively. Exactly. I think, given these metrics, we should be able to do the following: - Limit our overall rate of initiating local requests based on the global average reported load We can already do this on the basis of AIMD's, based purely on behaviour of local requests. However, taking more data into account might result in more accurate results - although it might also dilute the feedback. Right, AIMD tends to control the rate at which a specific event occurs, like a dropped IP packet with TCP, or a dropped request as we currently use it. This is a bit like signaling that you have a headache by banging your head against the wall. It is also a rather coarse way to deliver feedback. Fair point. A floating point load number conveys more information in a non-destructive way. In any case, this part is safe: it's not going to cause feedback problems, on its own. Conversely, it doesn't deal with problems such as network bottlenecks. Well, it depends. If we are tracking load on a per-outbound-connection basis then we can potentially take action to manage load per-connection in addition to overall. Not sure I follow here. Routing by load is bad, unless it is severely limited. However, while AIMD's are based on the *global* average load (or rather the load of all the nodes along the chain), it sounds like you are proposing to only report the *local* load of our direct peers? How are you going to compute the global average load? No, its not the local load of our direct peers. Its the load reported in all reply messages that we forward, some of those will be from our direct peers, but most will be from remote peers. - Limit our rate of local requests based on the average load of the connection to the peer it would need to be forwarded to Not sure I follow here. Are you proposing that we initiate more requests in keyspace areas where there is lower load? Well, kinda, except the opposite. We might reduce requests in keyspaces where there is higher load. - Detect when remote peers are abusing the system by disregarding load - as evidenced by a significantly higher average load in replies forwarded to them Okay, that is useful - although I doubt it would be an unambiguous detection, and what would we do about it? We would have to reject requests, and if we're not sure, or if it could be a downstream troublemaker, couldn't that result in misrouting and make things worse? On the other hand, any malicious node would be most easily detected by the nodes directly connected to it, so this could work much the same way as fair sharing - provided that it can reject or terminate the requests. No, we probably wouldn't start rejecting requests - more likely we would disconnect from that node. Obviously we'd need to be fairly sure of abusive behavior before doing that - but abuse should be rare. Well then we'd need to be very sure, and this means you could get away with a lot before triggering it. Wierd topologies for instance could cause
Re: [freenet-dev] Fundraising
These are good ideas Steve, I must confess that I've been a little distracted lately - and we do definitely need to get our act together on funding. Matthew, what is our current drop-dead date as best as you can estimate it? Ian. On Mon, Aug 29, 2011 at 4:18 PM, Steve Dougherty st...@asksteved.comwrote: The project's current state is reflected in a dollar amount of remaining project funds, along with an estimate of how much additional development time it corresponds to. This amount will decrease suddenly and significantly when a payment is made. It's not - at least supposed to be - a secret that the Freenet project is set to run out of funds mid-September. I've mentioned to people - on Sone, in #freenet - that the project is running out of funds and they had no idea. This is a problem. I propose (in decreasing importance) that: 1) The donation page mention _when_ the last ~2-month payment was made. - Or ideally, that the time remaining estimate include remaining funded development time. This need not be exact: it would allow the estimate to decrease gradually instead of in sudden chunks. 2) Fundraising be presented as drives for a month's worth of funding. - This would allow donating to have goals, and be framed as increasing toward something tangible, (a month of development) rather than building up funds against sudden, quiet decrease. - The estimate ceases to be meaningful if it sits at 8 days for two months. 3) Release announcements include statements on the project's current financial state and how to donate. - Fred could have integrated release announcements for greater visibility. These could of course be disabled by the user. 4) Release announcements also be displayed on the freenetproject.orghomepage in something akin to the Latest Releases/News sidebar on winehq.org. - Having only news posts that are several months old makes the project appear inactive, which it is not. It should go without saying that a 0.8 release would probably make good press, assuming it can be made stable before funding runs out. ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl -- Ian Clarke Personal blog: http://blog.locut.us/ ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl
Re: [freenet-dev] Beyond New Load Management: A proposal
After discussing and going deeper into this, it became apparent that the problem is not overload of queues. I’ll repeat the result I came to here (and not in chatlog form :) ): The problem is in the load limiter: 1) we need to use all bandwidth, because latency depends on the number of finishing transfers. 2) SSKs fail 85% of the time. 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for them. 4) this now bites us, because NLM has shorter transfer times and longer wait times. So the over-reservation of bandwidth for SSKs lasts longer. 5) solution: count each SSK as only average_SSK_success_rate * data_to_transfer_on_success. To (1): When we have less bandwidth, than less CHKs succeed per second, so less slots get freed and the queueing time it bigger. Best wishes, Arne Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland: Okay, I don't understand most of that, I might be able to check the math if it was written properly, but it looks difficult. However, as far as I can see: - The most obvious way to increase bandwidth usage would be to increase the timeout time for output bandwidth liability (and at the same time increase the relevant block transfer timeouts). - This would increase the number of slots but it would also increase the number of requests seeking them; I don't see why it would help matters. - Running an excessive number of requests without adjusting the block transfer timeouts would result in some of the transfers timing out. - It would also, as you mention, make SSK requests (especially failed SSK requests) even slower. - I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. - As far as I can see, all the benefits of NLM re attackers are achieved by fair sharing. - A system which is more resistant to attacks but slower probably isn't all that interesting if the attacks in question are relatively expensive anyway. - Smaller block sizes would have a significant efficiency cost, and would probably make load management more difficult. I apologise if you see this as rather a broadside after I encouraged you to analyse the problem, but it is not yet a convincing demonstration that queueing is a viable strategy and not the spawn of satan! :) On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote: Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke: Yes, small tweaks have worked so well for us for the last decade, leaving us pretty-much where we were in 2003. No, we don't understand how the current system works, there is no point in trying to fix something when we don't even know what is broken. I’d like to present a clue what is broken in NLM. Before I kill you with the log, here’s the result: With NLM the latency of a request is a function of the raw bandwidth (not so with OLM), and NLM used only half my bandwidth after it had been deployed for 2 days (at the start much more). τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time (time to find the node), nlm: new load management, olm: old load management. So first step: make sure all bandwidth gets used - maybe by allocating more slots till we use all allowed bandwidth. Better having to throttle a transfer than not using bandwidth. *NLM should with the current network be slower than OLM by 23%. But in 18 months it should actually be faster by ~8% — given Moores Law holds for upload bandwidth — because the routes are shorter.* The main advantage of NLM is, that it should be much more resilient against attackers (DoS). Now to the log - it’s math and not cleaned up; you have been warned :) ArneBab SSK-time: σ, CHK-time: ψ, success: Xs, fail: Xf. ArneBab queue-time: q, transfer-time: τ, hops remaining: h, total hops: h₀, w: success rate ArneBab ψs = τ(h) + q(h) ArneBab ψf = q(h) ArneBab ψ ~ w₁·ψs + (1-w₁)·ψf ArneBab σs = τ(h) + q(h) ArneBab σf = q(h) ArneBab σ ~ w₂·ψs + (1-w₂)·ψf; w₂ ~ 15% ArneBab num(ψ) / num(σ) ~ 1 ArneBab → time ~ σ + ψ ArneBab q(h) depends on timeouts, as do w₁ and w₂ ArneBab time = w₁·ψs + (1-w₁)·ψf + w₂·ψs + (1-w₂)·ψf ArneBab = w₁ · (τ(h) + q(h)) + (1-w₁)·q(h) + w₂ · (τ(h) + q(h)) + (1- w₂)·q(h) ArneBab = t(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) ArneBab = τ(h) · (w₁+w₂) + 2·q(h) · (2-w₁-w₂) ArneBab in the congestion case q(h) ~ timeout ArneBab timeout = o ArneBab timeout: o ArneBab w depends on the timeout *somehow*, but inversely ArneBab o=0 → w=0 ArneBab assumption: o = ∞ → w₂ ~ 20%, w₁ ~ 100% ArneBab assumption: o = ∞ → w₂ ~ 0.2, w₁ ~ 1 ArneBab correction: in the congestion case: q(h) ~ min(timeout, τ(h)) ArneBab timeout matters for q(h) only when timeout τ(h) ArneBab I try to: I still need a dependency of w on timeout ArneBab … lets call it t(w) ArneBab better: w(o) :) toad_ well, if there is a timeout, we have a fixed time, but we reduce
Re: [freenet-dev] Beyond New Load Management: A proposal
On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote: After discussing and going deeper into this, it became apparent that the problem is not overload of queues. I’ll repeat the result I came to here (and not in chatlog form :) ): The problem is in the load limiter: 1) we need to use all bandwidth, because latency depends on the number of finishing transfers. 2) SSKs fail 85% of the time. 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for them. 4) this now bites us, because NLM has shorter transfer times and longer wait times. So the over-reservation of bandwidth for SSKs lasts longer. 5) solution: count each SSK as only average_SSK_success_rate * data_to_transfer_on_success. To (1): When we have less bandwidth, than less CHKs succeed per second, so less slots get freed and the queueing time it bigger. Another way to look at this: Both me and Ian have been assuming that the resource is some number representing a property of the node, for example, how many requests are running. But the real resource is the external bandwidth usage, which is rather different. An alternative solution - if we are worried about bursts of requests succeeding and therefore timing out - is to allocate a fixed chunk of bandwidth for SSKs, and not let them interfere with CHKs. Then CHKs can have long transfer times, and therefore long queueing times when they fail (but this is only ~ 40% of the time with NLM's accurate routing), and SSKs can have short transfer times, and therefore short queueing times. This may be less disruptive and should also be relatively easy to implement. But the other question is, can queueing ever be helpful? It can if it allows us to route more accurately (which NLM clearly does), and/or to run enough requests in parallel that the longer time taken for the request to reach its destination is offset. Is this condition met? Best wishes, Arne Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland: Okay, I don't understand most of that, I might be able to check the math if it was written properly, but it looks difficult. However, as far as I can see: - The most obvious way to increase bandwidth usage would be to increase the timeout time for output bandwidth liability (and at the same time increase the relevant block transfer timeouts). - This would increase the number of slots but it would also increase the number of requests seeking them; I don't see why it would help matters. - Running an excessive number of requests without adjusting the block transfer timeouts would result in some of the transfers timing out. - It would also, as you mention, make SSK requests (especially failed SSK requests) even slower. - I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. - As far as I can see, all the benefits of NLM re attackers are achieved by fair sharing. - A system which is more resistant to attacks but slower probably isn't all that interesting if the attacks in question are relatively expensive anyway. - Smaller block sizes would have a significant efficiency cost, and would probably make load management more difficult. I apologise if you see this as rather a broadside after I encouraged you to analyse the problem, but it is not yet a convincing demonstration that queueing is a viable strategy and not the spawn of satan! :) On Monday 29 Aug 2011 21:27:01 Arne Babenhauserheide wrote: Am Montag, 29. August 2011, 14:32:13 schrieb Ian Clarke: Yes, small tweaks have worked so well for us for the last decade, leaving us pretty-much where we were in 2003. No, we don't understand how the current system works, there is no point in trying to fix something when we don't even know what is broken. I’d like to present a clue what is broken in NLM. Before I kill you with the log, here’s the result: With NLM the latency of a request is a function of the raw bandwidth (not so with OLM), and NLM used only half my bandwidth after it had been deployed for 2 days (at the start much more). τ ~ bandwidth. q_olm ~ 16s, q_nlm ~ τ! ; with τ: transfer time, q: queue time (time to find the node), nlm: new load management, olm: old load management. So first step: make sure all bandwidth gets used - maybe by allocating more slots till we use all allowed bandwidth. Better having to throttle a transfer than not using bandwidth. *NLM should with the current network be slower than OLM by 23%. But in 18 months it should actually be faster by ~8% — given Moores Law holds for upload bandwidth — because the routes are shorter.* The main advantage of NLM is, that it should be much more resilient against attackers (DoS). Now to the log - it’s math and not cleaned up; you have been warned :) ArneBab SSK-time: σ, CHK-time: ψ, success: Xs,
Re: [freenet-dev] Beyond New Load Management: A proposal
On Tuesday 30 Aug 2011 00:18:35 Matthew Toseland wrote: On Tuesday 30 Aug 2011 00:08:16 Arne Babenhauserheide wrote: After discussing and going deeper into this, it became apparent that the problem is not overload of queues. I’ll repeat the result I came to here (and not in chatlog form :) ): The problem is in the load limiter: 1) we need to use all bandwidth, because latency depends on the number of finishing transfers. 2) SSKs fail 85% of the time. 3) the load limiter assumes that all SSKs succeed and reserves bandwidth for them. 4) this now bites us, because NLM has shorter transfer times and longer wait times. So the over-reservation of bandwidth for SSKs lasts longer. 5) solution: count each SSK as only average_SSK_success_rate * data_to_transfer_on_success. To (1): When we have less bandwidth, than less CHKs succeed per second, so less slots get freed and the queueing time it bigger. Another way to look at this: Both me and Ian have been assuming that the resource is some number representing a property of the node, for example, how many requests are running. But the real resource is the external bandwidth usage, which is rather different. An alternative solution - if we are worried about bursts of requests succeeding and therefore timing out - is to allocate a fixed chunk of bandwidth for SSKs, and not let them interfere with CHKs. Then CHKs can have long transfer times, and therefore long queueing times when they fail (but this is only ~ 40% of the time with NLM's accurate routing), and SSKs can have short transfer times, and therefore short queueing times. This may be less disruptive and should also be relatively easy to implement. But the other question is, can queueing ever be helpful? It can if it allows us to route more accurately (which NLM clearly does), and/or to run enough requests in parallel that the longer time taken for the request to reach its destination is offset. Is this condition met? Yes it is! If a request takes P seconds without queueing, and P+Q seconds with queueing, then if we have the same number of requests running at once, we have a slowdown of (P+Q)/P. However, if we start (P+Q)/P times more requests, and the queueing time remains constant, we can compensate for this. Does it? There are two main components: 1) The queueing time. This is roughly proportional to (number of requests running) / (number of slots available). If we increase the capacity of the node, we multiply both the top and bottom by the same factor - so the queueing time is unchanged, unless the data transfer time is changed. 2) The data transfer time. In the above case with a slowdown, the data transfer time is the same, or a little shorter, because most of our bandwidth isn't being used. If we increase it by running more requests at once, it should reach where it was, and if we increase it further, transfers will start to take longer again. ArneBab has analysed this in more detail and shown that the largest component of the queueing time is proportional to the data transfer time, hence, separating SSKs (which have a very short transfer time) from CHKs (which have a much longer transfer time) would help significantly. The key advantage of NLM, of course, is that it routes accurately, resulting in 10%-30% improvements in success rates, and transferring data over significantly fewer hops (= more effective bandwidth). Best wishes, Arne Am Montag, 29. August 2011, 22:27:47 schrieb Matthew Toseland: Okay, I don't understand most of that, I might be able to check the math if it was written properly, but it looks difficult. However, as far as I can see: - The most obvious way to increase bandwidth usage would be to increase the timeout time for output bandwidth liability (and at the same time increase the relevant block transfer timeouts). - This would increase the number of slots but it would also increase the number of requests seeking them; I don't see why it would help matters. - Running an excessive number of requests without adjusting the block transfer timeouts would result in some of the transfers timing out. - It would also, as you mention, make SSK requests (especially failed SSK requests) even slower. - I am quite confident that Moore's Law DOES NOT hold for upload bandwidth. - As far as I can see, all the benefits of NLM re attackers are achieved by fair sharing. - A system which is more resistant to attacks but slower probably isn't all that interesting if the attacks in question are relatively expensive anyway. - Smaller block sizes would have a significant efficiency cost, and would probably make load management more difficult. I apologise if you see this as rather a broadside after I encouraged you to analyse the problem, but it is not yet a
Re: [freenet-dev] Beyond New Load Management: A proposal
Von: Matthew Toseland t...@amphibian.dyndns.org But the other question is, can queueing ever be helpful? It can if it allows us to route more accurately (which NLM clearly does), and/or to run enough requests in parallel that the longer time taken for the request to reach its destination is offset. Is this condition met? Experience with the deployed NLM showed that even in the fully congested case it had success rates of 60% for HTL 18,17 and 16, compared to less than 40% for OLM. This means that the requests are sent over fewer hops on average, because find the content fewer hops away from the requester. A download of 1MiB which is sent over 2 hops needs 2 MiB in total network bandwidth. If it is sent over only 1.5 hops on average, then it needs only 1.5 MiB total network bandwidth. So essentially NLM can distribute 30% more content with the same network resources¹. And these numbers are actual observations. The only reason why this did not result in increased performance is that the nodes used less than 50% of their allocated bandwidth² - which is a problem with the bandwidth scheduler and not with queueing. Best wishes, Arne ¹: The relevant network resource is upload bandwidth. ²: Source: observations from me and two other freenet users. PS: How exactly the bandwidth limiter is fixed is an implementation detail. I think you are actually the only person who can judge how to do this most efficiently. ___ Schon gehört? WEB.DE hat einen genialen Phishing-Filter in die Toolbar eingebaut! http://produkte.web.de/go/toolbar smime.p7s Description: S/MIME Cryptographic Signature ___ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl