[freenet-dev] Load management theory

Matthew Toseland Tue, 12 Jul 2011 17:45:54 +0100

On Saturday 09 Jul 2011 21:22:04 Evan Daniel wrote:
> I've only looked at the new load management emails somewhat, not in
> detail. However, it seems to me that there hasn't been a deep enough
> theoretical analysis of the issue. I'll try to lay out what I mean by
> that here, and start the process. I don't have time to finish it
> anytime soon, but hopefully this will be useful anyway.


That seems entirely reasonable. In fact it's one of my main worries about NLM!
> 
> I see four major factors in load management, plus a couple others I'll
> mention after this that make things more complicated. Those are
> capacity of neighbors, location distribution of neighbors, incoming
> request distribution, and request misrouting tolerance.

This is reasonable.
> 
> Obviously not all neighbors have the same request-handling capacity.
> At present, Freenet acknowledges this to a limited degree via the
> bandwidth limit / peer limit relationship. It is possible we do not
> wish to allow an infinite range of capacities, because of concern
> about the effects of (non-malicious) supernodes on the network, as
> regards things like security and fault tolerance.

Right. Currently we limit the proportion of requests a peer can have. This 
results in misrouting, even with NLM. I wonder if we need to specify it more 
explicitly in NLM.
> 
> Neighbor location distribution and incoming request distribution are
> closely connected. If the two are matched, then all neighbors will be
> sent a similar number of requests. If they're mismatched, some
> neighbors will get a larger share of requests than others. Variable
> peer capacity means that a mismatch here may be desirable, in order to
> produce an outbound request pattern that matches available network
> capacity.

Possibly ... on the other hand, opennet should ensure a reasonable match. In 
fact we have people worrying about it being too reasonable - nodes with lots of 
local traffic have lots of distant peers, nodes with little local traffic have 
too few distant peers, there have been proposals for how to deal with this.

On darknet, swapping ensures peers are distributed reasonably for remote 
requests; incoming requests should generally be highly specialised as long as 
we don't have too much local traffic. HOWEVER swapping does not care about 
variable capacities, so what you say above does apply in this case...
> 
> Lastly, we have misrouting tolerance. At one extreme, we route every
> request correctly, and the capacity of the local node is limited by
> the slowest neighbor node. (To be more precise: the local capacity C_l
> becomes min({r_0, r_1, ..., r_n}) where r_i is the ratio C_i/S_i, and
> C_i is the capacity of neighbor i, and S_i is the share of incoming
> requests sent to neighbor i. In other words, the local capacity may be
> limited not by the slowest node, but by some faster node that gets
> sent a disproportionately larger share of the request traffic.)

Vive did some work suggesting that misrouting isn't always fatal, or at least 
that it is possible to usefully take into account performance. However, 
experience in 0.5 shows that misrouting is generally a very serious problem. 
The extreme case is 0.5 - we essentially routed load, specialisation was very 
weak, and as a consequence data persistence and performance was poor. In 
particular, if misrouting results in not finding data, we go more hops, and we 
are likely to retry, even if a failed request doesn't use much bandwidth; and 
if it results in storing data in the wrong place on insert it is even worse, 
resulting in a lot more effort being spent to find it.

NLM takes a fairly strict view of misrouting:
1. We pick up to 4 nodes. These are the top nodes for the request. The first 
always happens, the second happens if the ideal next node is low-capacity (less 
than min[median/2, first quartile] of our peers' capacities) the third happens 
if the request is realtime (which is irrelevant as NLM will be turned off for 
realtime in the near future), the fourth happens if we can't find a route after 
1 minute.
2. We can route to any of these.
3. After we are rejected we reroute, we can add another node as above.
4. However, if we get too many RejectedLoop's, we stop adding more. We *want* 
to timeout (that is, to fail to get a slot, to fail to route the request 
forward in 2 minutes), so that we can terminate the request, and tell the 
originator not to retry for a while because this part of the keyspace is 
difficult; this is the only mechanism we have for dealing with high capacity / 
low capacity boundaries at present.

This is of course fairly arbitrary... Ideally we'd like to have some 
theoretical basis for how much misrouting is a good thing, or even calculate it 
in real time based on some statistic. But *any* amount of misrouting 
potentially increases the total network load, so we need to be very careful 
here. IMHO such arbitrary limits are probably a reasonable first approximation, 
especially as the stats seem reasonable at the moment...
> 
> These four interrelate, at a fundamental theoretical level. Namely,
> you can't arbitrarily control all four at the same time. To use the
> term loosely, there are four parameters, but only three degrees of
> freedom. So, the question becomes, where do we want to exercise
> control, and how can we do so most optimally?
> 
> This analysis seems to me to lead to a conclusion that we should be
> trying to do load management by controlling who our peers are. We
> can't do much about incoming request distribution (not entirely true
> when you include local requests, more on this later) or peer capacity.
> That leaves one degree of freedom between misrouting and peer location
> distribution, so we find we're trading between those two.

As far as I can see, misrouting is unacceptable, period. There are times when 
we need it, but generally if we are routing to a node simply because the others 
are overloaded, we are losing the battle: we need to either explicitly tell 
downstream not to send requests so fast, or obstruct them by queueing.
> 
> This is, imho, not a surprising conclusion, but also not one I've seen
> explicitly stated before. It's somewhat explicit in our current LRU
> peer selection scheme, but not to the degree I think we should be
> talking about.

Hmmm...

LRU is pretty close to optimal according to Oskar's work, isn't it?
> 
> There are a couple factors that complicate all of this. These include
> locally originated requests, locally answered requests, and FOAF
> routing. Locally answered requests aren't often discussed in detail in
> simulations. For example, the simulators I've written all try to route
> to the best node on the network (shortest node location to request
> location distance). However, node store size variability makes things
> more complicated than that. I'm also not sure how to handle it in a
> nice clean fashion. We could try simulating it, but that's just going
> to be an exercise in GIGO unless we have actual data to use. (Probe
> requests as per issue 3568 would be good here.) However, I think we
> can get rather far while continuing to ignore that issue.
> 
> Locally originated requests are trickier. Some nodes have lots, some
> have very few, some vary when people browse freesites, etc. I'm not
> really sure what to do about them. They need more careful analysis.
> Specifically, I see two conflicting requirements: local requests need
> to be taken into account when choosing peers, so that our local
> requests don't swamp the long links entirely; and local requests need
> to not be taken into account, so that they don't distort the overall
> network topology.

Yeah. Lots of convenient compromises are possible (e.g. don't path fold while 
high HTL, defined as the same threshold at which we don't cache), but not 
necessarily ideal for the network...
> 
> FOAF routing, combined with our bandwidth-peer count relationship, is
> a start at handling load by managing who our peers are. I think this
> is the direction further load management work should take.

I'm not sure I follow here.
> 
> I'll end with a few ideas I haven't explored in much detail yet. They
> might be good ones or bad ones, but I hope they're at least
> interesting.
> 
> What if each peer advertises its available outbound bandwidth to its
> peers? The peer selection code could then attempt to maintain a
> constant ratio of bandwidth per fraction of requests served among
> peers.

I'm not sure I follow. Constant ratio of what? Please explain...

NLM does advertise peer capacities, and uses them to try to avoid rejections.
> 
> What about modifying the LRU code to correct the time since last usage
> by a factor of number of FOAF locations? That is, if we have peer A
> with 8 FOAF locations advertised and peer B with 20, and B is
> responding to 1.5 requests for every 1 that A responds to, then A
> should be treated as the better node. That is, we favor nodes which
> minimize ((time since last request answered) / (FOAF locations
> advertised)). (Count of FOAF locations is then being treated as a
> standin for capacity, as per the suggestion above.)

This is too easy to game. And a node with lots of FOAFs should perform well, 
shouldn't it?
> 
> Of course, all of these have the interesting problem that they distort
> topology away from stuff we have careful theoretical models for.
> That's bad. OTOH, I'm not entirely sure that those same models are
> actually directly applicable to the current network, given the
> combination of LRU peers selection, FOAF routing, and variable peer
> counts with capacity. The models also don't tell us what to do about
> the fact that some nodes generate a lot more requests than others.

Well yeah, freenet in practice is messy ... However, FOAF at least is based on 
theoretical work. So is LRU. Everything else is difficult, yeah...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
URL: 
<https://emu.freenetproject.org/pipermail/devl/attachments/20110712/cfaf2985/attachment.pgp>

[freenet-dev] Load management theory

Reply via email to