Re: [freenet-dev] Beyond New Load Management: A proposal

Matthew Toseland Mon, 29 Aug 2011 14:37:58 -0700

On Monday 29 Aug 2011 20:32:13 Ian Clarke wrote:
> On Mon, Aug 29, 2011 at 2:09 PM, Matthew Toseland <t...@amphibian.dyndns.org
> > wrote:
> 
> > On Monday 29 Aug 2011 19:28:50 Ian Clarke wrote:
> > > Ok, thinking about it, here is a proposal, or  rather, the beginning of a
> > > proposal.  I'm assuming that we get rid of NLM, fair sharing, and
> > anything
> > > else intended to control load, and replace it with this.  We will
> > absolutely
> > > need to simulate this before we write a single line of code to deploy it.
> >
> > I have justified fair sharing in my other reply. However alternate
> > mechanisms might replace it.
> >
> > I agree we should simulate, although this might be significant work.
> >
> 
> I doubt it would be more work than trying it without simulation and failing
> for a decade, which was our previous approach.
> 
> 
> > > The core idea is that a node will include a floating point number in
> > > response to any kind of request showing how close that node is to being
> > > overloaded.  0.0 would mean its doing nothing, 1.0 would mean that it is
> > > completely saturated and must reject requests.  Clearly the goal is to
> > avoid
> > > getting anywhere near to 1.0.
> > >
> > > A node tracks several things:
> >
> > All of these apply to both locally originated requests and remotely
> > originated requests? And are they for only our direct peers?
> >
> 
> Every peer along the reply path will take note of the load value, so it will
> be a mixture of the loads of close-by peers, and more distant peers.
>  Obviously, since we are more likely to receive messages from peers that are
> "closer", our average will automatically be biased towards closer peers.


So it is from the data source? Or every node along the way?
> 
> >    - What is the average load reported by responses this node forwarded,
> > per
> > >    remote node
> >
> > Ahhh, this one could be interesting - you could use it to penalise nodes
> > which spam excessively.
> >
> Exactly.
> 
> > > I think, given these metrics, we should be able to do the following:
> > >
> > >    - Limit our overall rate of initiating local requests based on the
> > global
> > >    average reported load
> >
> > We can already do this on the basis of AIMD's, based purely on behaviour of
> > local requests. However, taking more data into account might result in more
> > accurate results - although it might also dilute the feedback.
> 
> Right, AIMD tends to control the rate at which a specific event occurs, like
> a dropped IP packet with TCP, or a dropped request as we currently use it.
>  This is a bit like signaling that you have a headache by banging your head
> against the wall.  It is also a rather coarse way to deliver feedback.

Fair point.
> 
> A floating point load number conveys more information in a non-destructive
> way.
> 
> In any case, this part is "safe": it's not going to cause feedback problems,
> > on its own. Conversely, it doesn't deal with problems such as network
> > bottlenecks.
> 
> Well, it depends.  If we are tracking load on a per-outbound-connection
> basis then we can potentially take action to manage load per-connection in
> addition to overall.

Not sure I follow here. Routing by load is bad, unless it is severely limited.
> 
> > However, while AIMD's are based on the *global* average load (or rather the
> > load of all the nodes along the chain), it sounds like you are proposing to
> > only report the *local* load of our direct peers? How are you going to
> > compute the global average load?
> 
> No, its not the local load of our direct peers.  Its the load reported in
> all reply messages that we forward, some of those will be from our direct
> peers, but most will be from remote peers.
> 
> > >    - Limit our rate of local requests based on the average load of the
> > >    connection to the peer it would need to be forwarded to
> >
> > Not sure I follow here. Are you proposing that we initiate more requests in
> > keyspace areas where there is lower load?
> 
> Well, kinda, except the opposite.  We might reduce requests in keyspaces
> where there is higher load.
> 
> 
> > >    - Detect when remote peers are abusing the system by disregarding load
> > -
> > >    as evidenced by a significantly higher average load in replies
> > forwarded to
> > >    them
> >
> > Okay, that is useful - although I doubt it would be an unambiguous
> > detection, and what would we do about it? We would have to reject requests,
> > and if we're not sure, or if it could be a downstream troublemaker, couldn't
> > that result in misrouting and make things worse? On the other hand, any
> > malicious node would be most easily detected by the nodes directly connected
> > to it, so this could work much the same way as fair sharing - provided that
> > it can reject or terminate the requests.
> 
> No, we probably wouldn't start rejecting requests - more likely we would
> disconnect from that node.  Obviously we'd need to be fairly sure of abusive
> behavior before doing that - but abuse should be rare.

Well then we'd need to be very sure, and this means you could get away with a 
lot before triggering it. Wierd topologies for instance could cause this sort 
of problem on a temporary basis.
> 
> > Arguably fair sharing achieves the same objective. It could be extended in
> > various ways e.g. if we are relaying a lot of slow-down notices to a peer we
> > could start to think about squeezing the peer's allocation.
> 
> Yes, except it doesn't work, and its too complicated to reason about.

What makes you think it doesn't work? What makes you think it's complicated 
either?
> 
> > > Of course, lots of questions:
> > >
> > >    - How do we compute the averages?  A decaying running average of some
> > >    kind?  What time period?
> >
> > Yay more tunables. :(
> 
> Ideally we'd do this in a way that avoids tunables.  This is control theory,
> we shouldn't have to reinvent the wheel.

Control theory doesn't apply because we are not dealing with feedback going to 
a single system, it's thousands of nodes, which all have separate counters.
> 
> > >    - How do we translate load into a desired rate of requests?
> >
> > And even more tunables. :(
> 
> Don't be pessimistic, hopefully we can avoid tunables if we are smart about
> it.
> 
> > >    - What criteria indicate that a peer is abusing the system?  What is
> > the
> > >    remedy?
> >
> > In the current system, don't accept their requests.
> 
> Yes, although I'm leaning towards disconnecting from them, which is
> more-or-less the same thing, just more permanent and they don't consume our
> resources any more.

Only if we can be really sure. And I'm not sure how realistic that is.
> 
> > In ANY system, there will always be a borderline below which they can get
> > away with it. We can either try to make this as small as possible or try not
> > to hit legitimate traffic too hard.
> 
> Abuse should be rare, we should bias against false positives when we look
> for it.

Which means that an abuser can stay below the threshold for a long time on a 
lot of nodes ...

IMHO telling the difference could be difficult.
> 
> > > This is basically control theory stuff, and we'll need robust answers to
> > > these questions before we proceed (ideally with a theoretical rather than
> > > experimental foundation).
> >
> > IMHO small tweaks to AIMDs, combined with fair sharing, will achieve much
> > the same without having to wait years for our theoreticians to wake up.
> 
> Yes, small tweaks have worked so well for us for the last decade, leaving us
> pretty-much where we were in 2003.  No, we don't understand how the current
> system works, there is no point in trying to fix something when we don't
> even know what is broken.

I think we have a much better understanding of it than we have had until 
recently: All forms of load management tried so far transmute incoming load 
into something bad - misrouting, queueing delays, reduced effective HTL. We 
need to reduce the incoming load, WITHOUT these things. The simplest 
implementation of this concept is to send non-local RejectedOverload's when we 
are *close* to the limits, not only local ones when we are *at* the limits 
(which are then relayed as non-local); in other words, to send a slow down 
message *before* it's too late.
> 
> I am touched by your faith in "theoreticians", but I have a feeling we're in
> uncharted territory here.  We need to agree on an approach that is simple
> enough to reason about, and then build a simulation for it.  I think the
> simulation should just be a few days of work, we're not talking years here,
> hopefully not even months.

Weeks probably - and simulations themselves tend to have tunables. Anyway if I 
start writing simulations people get mad because I'm not fixing bugs.

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Devl mailing list
Devl@freenetproject.org
http://freenetproject.org/cgi-bin/mailman/listinfo/devl

Re: [freenet-dev] Beyond New Load Management: A proposal

Reply via email to