[freenet-dev] Link length distribution

Evan Daniel Mon, 14 Sep 2009 15:29:41 -0400

The fundamental routing model for Freenet (opennet and darknet) is
based on an assumption of proper distribution of link lengths (the
distance between the two endpoint nodes).  In order to achieve good
routing efficiency, the cumulative distribution of link lengths should
follow a logarithmic curve; that is, for any node, it should have
approximately equal numbers of links of length 0.25-0.50 (0.5 being
the longest possible link), 0.125-0.25, 0.0625-0.125, etc.  For
details, see the papers by Kleinberg:
http://www.cs.cornell.edu/home/kleinber/nat00.pdf
http://www.cs.cornell.edu/home/kleinber/swn.pdf
Roughly speaking, this distribution means that requests start by
taking long hops, and then narrow in on the target location via
smaller and smaller hops, permitting O(log(n)^2) route lengths.
Opennet and darknet both attempt to maintain this distribution;
opennet through connection churn, darknet through location swaps.

TheSeeker raised an interesting point in IRC yesterday: link lengths
on opennet nodes don't actually appear to follow this distribution.
I've made a very cursory start on the analysis, but as I'm busy with
other parts of Freenet I'm going to hope someone else can continue it.
I took the FOAF location data from my opennet node (normally I have a
couple darknet peers, but at the time I took the data they were
offline, so this is purely opennet data). From that, I can see the
link lengths from my node to its peers, and from those peers to their
peers. I then plotted the distribution of link lengths, both on a
linear scale and a log scale:
freenet:CHK at
CQnUwrJpEOOl7CGHE7jHE47WClMS1unpcgNRC~yBCIw,Tzdx8plj--XJmReCOAxmV3TdCVz66pKe~snXgSpK~aI,AAIC--8/locations.png
freenet:CHK at
fgKX0f57ivs7qvLg1QaqxMudLdMrHXVephO-gsMgEPY,VRcVHdnLnq9UDi-WMyk8IzW12wq-YTj2M5z2H2fawmE,AAIC--8/log_locations.png
Spreadsheet those came from:
freenet:CHK at
MXlEbI5oppUIOU3n3MJ7xjvH9I1DXGcPSBqjHebtWaQ,j9pG8L-hWoUVYRbKtwQMeM~8siE1kfGlrBucI9xLe~U,AAIC--8/locations.gnumeric

The latter graph is probably more informative: ideally, those points
should approximate a straight line. Clearly, they don't, though I
haven't run any statistical measures. (The curious might want to
start here: http://en.wikipedia.org/wiki/Kolmogorov?Smirnov_test .)

There are several confounding factors. First, the data aren't
independent; there should be local clustering, and I seem to have
double-counted the links to my node (18 out of 353 data points) (links
between my peers would also be double-counted, but there don't appear
to be any). Second, there's the bandwidth issue: some of my peers are
faster than others, as can be seen from their varied number of FOAF
locations. Peers with more FOAF locations will receive more traffic,
in (rough) proportion to the number of FOAF locations they have. I'm
uncertain how the link length distribution should respond; perhaps a
link to a peer should be counted n times, where n is the number of
FOAF locations it advertises? Or perhaps not; figuring that out would
take some theoretical work I haven't done. On average, across a large
number of nodes / links, that effect should go away. Third, we must
be wary of observer bias: nodes that connect to other nodes are more
likely to be observed by a random sample of nodes. This will impact
FOAF link length counting, but not local link length counting.

I'm not sure what to make of this data: it may represent a deep-rooted
problem with routing, or it might go away if I properly account for
the above effects and gather more data.

I can think of several possible sources of problems. For example, the
LRU link retention policy will tend to favor badly distributed links
to nodes that have higher bandwidth (and therefore more FOAF
locations) simply because we route more requests to them. Similarly,
it will tend to favor nodes with large caches that are serving recent
/ popular data effectively over those that have smaller caches but
route unpopular requests better. One simple problem is that strict
LRU data is noisy; that combined with nodes that aren't always online
might cause too much random connection churn and too little structured
connection churn. We could smooth the data in some fashion without
impacting the theoretical basis behind the LRU criteria, I believe
(though I'd have to look at the relevant paper in more detail). We
could also consider attempting to correct for FOAF location count
somehow.

I think this merits further investigation, involving both some theory
and more data. If there is in fact a problem, we should give
consideration to modifying the link selection policy. It would also
be useful to have some data about hybrid darknet/opennet nodes: are
those fixed links causing link length distribution problems? (Pure
darknet nodes would be an entirely different case, I believe.)

Evan Daniel

[freenet-dev] Link length distribution

Reply via email to