On Monday 29 Aug 2011 20:32:13 Ian Clarke wrote: > On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland <t...@amphibian.dyndns.org > > wrote: > > > On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote: > > > Ok, thinking about it, here is a proposal, or rather, the beginning of a > > > proposal. I'm assuming that we get rid of NLM, fair sharing, and > > anything > > > else intended to control load, and replace it with this. We will > > absolutely > > > need to simulate this before we write a single line of code to deploy it. > > > > I have justified fair sharing in my other reply. However alternate > > mechanisms might replace it. > > > > I agree we should simulate, although this might be significant work. > > > > I doubt it would be more work than trying it without simulation and failing > for a decade, which was our previous approach. > > > > > The core idea is that a node will include a floating point number in > > > response to any kind of request showing how close that node is to being > > > overloaded. 0.0 would mean its doing nothing, 1.0 would mean that it is > > > completely saturated and must reject requests. Clearly the goal is to > > avoid > > > getting anywhere near to 1.0. > > > > > > A node tracks several things: > > > > All of these apply to both locally originated requests and remotely > > originated requests? And are they for only our direct peers? > > > > Every peer along the reply path will take note of the load value, so it will > be a mixture of the loads of close-by peers, and more distant peers. > Obviously, since we are more likely to receive messages from peers that are > "closer", our average will automatically be biased towards closer peers.
So it is from the data source? Or every node along the way? > > > - What is the average load reported by responses this node forwarded, > > per > > > remote node > > > > Ahhh, this one could be interesting - you could use it to penalise nodes > > which spam excessively. > > > Exactly. > > > > I think, given these metrics, we should be able to do the following: > > > > > > - Limit our overall rate of initiating local requests based on the > > global > > > average reported load > > > > We can already do this on the basis of AIMD's, based purely on behaviour of > > local requests. However, taking more data into account might result in more > > accurate results - although it might also dilute the feedback. > > Right, AIMD tends to control the rate at which a specific event occurs, like > a dropped IP packet with TCP, or a dropped request as we currently use it. > This is a bit like signaling that you have a headache by banging your head > against the wall. It is also a rather coarse way to deliver feedback. Fair point. > > A floating point load number conveys more information in a non-destructive > way. > > In any case, this part is "safe": it's not going to cause feedback problems, > > on its own. Conversely, it doesn't deal with problems such as network > > bottlenecks. > > Well, it depends. If we are tracking load on a per-outbound-connection > basis then we can potentially take action to manage load per-connection in > addition to overall. Not sure I follow here. Routing by load is bad, unless it is severely limited. > > > However, while AIMD's are based on the *global* average load (or rather the > > load of all the nodes along the chain), it sounds like you are proposing to > > only report the *local* load of our direct peers? How are you going to > > compute the global average load? > > No, its not the local load of our direct peers. Its the load reported in > all reply messages that we forward, some of those will be from our direct > peers, but most will be from remote peers. > > > > - Limit our rate of local requests based on the average load of the > > > connection to the peer it would need to be forwarded to > > > > Not sure I follow here. Are you proposing that we initiate more requests in > > keyspace areas where there is lower load? > > Well, kinda, except the opposite. We might reduce requests in keyspaces > where there is higher load. > > > > > - Detect when remote peers are abusing the system by disregarding load > > - > > > as evidenced by a significantly higher average load in replies > > forwarded to > > > them > > > > Okay, that is useful - although I doubt it would be an unambiguous > > detection, and what would we do about it? We would have to reject requests, > > and if we're not sure, or if it could be a downstream troublemaker, couldn't > > that result in misrouting and make things worse? On the other hand, any > > malicious node would be most easily detected by the nodes directly connected > > to it, so this could work much the same way as fair sharing - provided that > > it can reject or terminate the requests. > > No, we probably wouldn't start rejecting requests - more likely we would > disconnect from that node. Obviously we'd need to be fairly sure of abusive > behavior before doing that - but abuse should be rare. Well then we'd need to be very sure, and this means you could get away with a lot before triggering it. Wierd topologies for instance could cause this sort of problem on a temporary basis. > > > Arguably fair sharing achieves the same objective. It could be extended in > > various ways e.g. if we are relaying a lot of slow-down notices to a peer we > > could start to think about squeezing the peer's allocation. > > Yes, except it doesn't work, and its too complicated to reason about. What makes you think it doesn't work? What makes you think it's complicated either? > > > > Of course, lots of questions: > > > > > > - How do we compute the averages? A decaying running average of some > > > kind? What time period? > > > > Yay more tunables. :( > > Ideally we'd do this in a way that avoids tunables. This is control theory, > we shouldn't have to reinvent the wheel. Control theory doesn't apply because we are not dealing with feedback going to a single system, it's thousands of nodes, which all have separate counters. > > > > - How do we translate load into a desired rate of requests? > > > > And even more tunables. :( > > Don't be pessimistic, hopefully we can avoid tunables if we are smart about > it. > > > > - What criteria indicate that a peer is abusing the system? What is > > the > > > remedy? > > > > In the current system, don't accept their requests. > > Yes, although I'm leaning towards disconnecting from them, which is > more-or-less the same thing, just more permanent and they don't consume our > resources any more. Only if we can be really sure. And I'm not sure how realistic that is. > > > In ANY system, there will always be a borderline below which they can get > > away with it. We can either try to make this as small as possible or try not > > to hit legitimate traffic too hard. > > Abuse should be rare, we should bias against false positives when we look > for it. Which means that an abuser can stay below the threshold for a long time on a lot of nodes ... IMHO telling the difference could be difficult. > > > > This is basically control theory stuff, and we'll need robust answers to > > > these questions before we proceed (ideally with a theoretical rather than > > > experimental foundation). > > > > IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much > > the same without having to wait years for our theoreticians to wake up. > > Yes, small tweaks have worked so well for us for the last decade, leaving us > pretty-much where we were in 2003. No, we don't understand how the current > system works, there is no point in trying to fix something when we don't > even know what is broken. I think we have a much better understanding of it than we have had until recently: All forms of load management tried so far transmute incoming load into something bad - misrouting, queueing delays, reduced effective HTL. We need to reduce the incoming load, WITHOUT these things. The simplest implementation of this concept is to send non-local RejectedOverload's when we are *close* to the limits, not only local ones when we are *at* the limits (which are then relayed as non-local); in other words, to send a slow down message *before* it's too late. > > I am touched by your faith in "theoreticians", but I have a feeling we're in > uncharted territory here. We need to agree on an approach that is simple > enough to reason about, and then build a simulation for it. I think the > simulation should just be a few days of work, we're not talking years here, > hopefully not even months. Weeks probably - and simulations themselves tend to have tunables. Anyway if I start writing simulations people get mad because I'm not fixing bugs.
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Devl mailing list Devl@freenetproject.org http://freenetproject.org/cgi-bin/mailman/listinfo/devl