[freenet-devl] Re: Fwd: Some musings on Freenet

hal Sat, 26 May 2001 11:58:58 -0700
Ray Heasman wrote:
> 3. Insertion of data
>
> My understanding is that FreeNet performs a depth-first backtracking graph
> traversal of nodes on the network.

Technically, yes, but keep in mind that backtracking happens only in
two circumstances: if you hit a dead node (hopefully rare), or if you
hit a node which is already working on this search request (common).

> Insertion has two stages: first, a search
> to check that the data to be inserted does not already exist, second, an
> insertion phase that tunnels data to the last point at which the search
> failed.

...and caches it along the way.

> During the request phase, each node forwards a request to nodes it thinks
> most likely of having the data being inserted. This is done by making an
> estimate of how close the key describing the search is to the keys usually
> handled by the nodes it knows about.

Isn't it still just a matter of choosing the node which holds the
key closest to the target one?  This description makes it sound like
a more elaborate algorithm is used.

> Unique IDs are used so that requests
> are never processed twice by the same node (i.e. They are used by the search
> algorithm to change the graph of nodes into a dynamically generated tree
> that contains no loops). This eliminates redundant searching.

Right, this is the main source of the "backtracking" above.

> Search packets also contain a "hops to live" parameter (HTL), which is
> decremented every time the search packet is forwarded to a node.
>
> Assuming the data to be inserted is not already on the network, eventually
> HTL drops to 0, and the lucky node thus found signals the search failed. At
> this point, the originating node starts streaming the data back along the
> path searched (sans backtracks) to the lucky node.

Keeping in mind that the main cause of backtracks is that we hit the path
already, the "sans backtracks" is not really meaningful here in that you
are considering the same set of nodes whether you count backtracks or not.

> What I find interesting about this is that we are storing data in the last
> place the network looked (rather than the first).

This is not true.  The data is stored in the first place the network
looked, namely the best neighbor of the initiating node.  It is stored in
the second place the network looked, namely the best neighbor of this next
node.  And so on, all the way up to the last place the network looked.
This is entirely because of caching during insertion.

Reducing caching, as proposed later, will make it less true that we
store data in the first place we looked.  That may or may not be a good
idea but it is important to understand the role of caching.

> 5. Search keys
>
> Keys are 160 bit hashs of the data to be stored. This means that data with
> similar keys may not be associated in terms of content at all. This is a
> good thing as data are not clustered on servers by subject, making the loss
> of a node less damaging. However for searching to work, there has to be a
> concept of "closeness" of keys. Adam Langley provides a rough description of
> how this is currently done in [AL1]. This combines the test of whether one
> key is "less" than another with a binary tree to find the closest key in a
> group. Closeness is defined as the distance between leaves on this tree.

I thought the tree based search was just a speed optimization.  The
basic heuristic is still to define relative closeness, right?  We can
tell if A is closer to B or to C.  This should be irrespective of what
other keys D, E, F, ... are present on the node.

> 5.1 Reservations
>
> One thing that worries me about this definition of closeness is that it is
> entirely dependent on what keys a node currently holds. An outside observer
> with knowledge of only a subset of keys on the node would not be able to
> predict the closeness of two keys calculated by that node. It also means
> that as each node prunes and updates its node routing list it will change
> the behaviour of future routing decisions in unpredictable ways.

I believe that the only real issue is which key on the node is the closest
to the target one.  Any self-organizing network is going to change its
routing decisions over time based on the information present at each node,
so this is not really a cause for worry.

> 6. Potential problems
>
> The model mentioned in section 2 suggests a potential problem. In order to
> be a good self-organising network that routes requests to nodes containing
> their associated data, it is necessary for nodes in the vicinity of the
> insertion point to have many requests routed through them.

I don't understand this, and it may be based on the false assumption that
most requests "should" be satisfied by a node near the original insertion
point of the data.  But Freenet is not designed to follow this principle.
It is entirely possible that most requests will be satisfied by a node
far from the original insertion point.  In that case there is no reason
for nodes near the insertion point to see many requests for that data.

> In the current
> implementation, caching is very aggressive: a node caches data the first
> time it sees it and declines to request that data again if a request arrives
> for it.
>
> This has the profound effect of moving the data away from the point of
> insertion. ie. It starves the routing mechanism of data.

Yes, to the first, but no to the second.  The routing mechanism is not
starved, it is still informed about where the data is.

> Cached files are so
> preferred by the network that after one or two hits, the original data is
> never fetched again. After a while the network forgets how to route to that
> file, simply because it doesn't have to.

There is no distinction between cache and original data in Freenet.
Your copy of a file is as valid as mine.  This analysis seems to assume
that there is an original, which gets lost because it is never referenced,
and a bunch of caches, which get swamped.  But that is not how Freenet
works; all copies of a file are equal.

> Later, because of the incredibly
> high level of thrashing in the caches, all the caches drop the file, and the
> file is essentially lost.

It's this high level of thrashing that is the problem.  If the data is
dropped by "all the caches" then it is because some other data is more
popular.

> In short, caching may be the cancer killing freenet.

It's not caching per se, since that is crucial to the way Freenet works
(see discussion of insertion above).

However there is a germ of truth, in that the real problem is that the
network is being asked to store too many copies of data.  The storage
capacity of the network is equal to total space divided by number of
copies per data item.  As the number of copies increases, the total
capacity decreases and we see thrashing.

But again, it's not the existence of caching which is the problem, it is
overuse of network resources by end users.  There is a tragedy of the
commons here and it manifests itself by "creeping HTL".  Each person
inserting data uses a higher HTL than the other guy in order to make
his data more widespread.  They also use tricks like connecting to a
bunch of random nodes and sucking the data, in order to spread it around
even more.

Furthermore this HTL escalation makes the problem worse because it
makes the search space flatter.  Nodes can't intelligently choose their
neighbors when every neighbor has the same data.  Routing paths become
random and data which is not widely spread through the network is not
found.

This further drives people to insert with maximal HTL so that their
data is everywhere, but that then just makes the problem worse for
everyone else, first by causing thrashing and second by flattening the
search space.

I don't know if reducing caching would really help, given the
determination by users to get their data spread everywhere.  They might
just increase their remote fetching to pull the data even harder.
However it does seem that there needs to be something done to reduce the
number of copies being kept for data items and to allow the network to
do its job of finding data via intelligent routing.

Hal

_______________________________________________
Devl mailing list
[EMAIL PROTECTED]
http://lists.freenetproject.org/mailman/listinfo/devl
[freenet-devl] Re: Fwd: Some musings on Freenet

Reply via email to