Hi Hal,

On Sat, May 26, 2001 at 11:48:07AM -0700, hal at finney.org wrote:
> Ray Heasman wrote:
> 
> 
> > What I find interesting about this is that we are storing data in the last
> > place the network looked (rather than the first).
> 
> This is not true.  The data is stored in the first place the network
> looked, namely the best neighbor of the initiating node.  It is stored in
> the second place the network looked, namely the best neighbor of this next
> node.  And so on, all the way up to the last place the network looked.
> This is entirely because of caching during insertion.

Um. This may be a matter of interpretation, but I think I disagree with you.
If you include a collision with a node that forces you to backtrack as a
"look", then the file is stored in the last place you look.

> > 5. Search keys
> >
> > Keys are 160 bit hashs of the data to be stored. This means that data with
> > similar keys may not be associated in terms of content at all. This is a
> > good thing as data are not clustered on servers by subject, making the loss
> > of a node less damaging. However for searching to work, there has to be a
> > concept of "closeness" of keys. Adam Langley provides a rough description of
> > how this is currently done in [AL1]. This combines the test of whether one
> > key is "less" than another with a binary tree to find the closest key in a
> > group. Closeness is defined as the distance between leaves on this tree.
> 
> I thought the tree based search was just a speed optimization.  The
> basic heuristic is still to define relative closeness, right?  We can
> tell if A is closer to B or to C.  This should be irrespective of what
> other keys D, E, F, ... are present on the node.

Ok. Thanks for the clarification.

> > 6. Potential problems
> >
> > The model mentioned in section 2 suggests a potential problem. In order to
> > be a good self-organising network that routes requests to nodes containing
> > their associated data, it is necessary for nodes in the vicinity of the
> > insertion point to have many requests routed through them.
> 
> I don't understand this, and it may be based on the false assumption that
> most requests "should" be satisfied by a node near the original insertion
> point of the data.  But Freenet is not designed to follow this principle.
> It is entirely possible that most requests will be satisfied by a node
> far from the original insertion point.  In that case there is no reason
> for nodes near the insertion point to see many requests for that data.

Um. I call the insertion point the final resting place of the data when data
are inserted, not the node that requests the insert. Are we on the same
frequency here?

> 
> > In the current
> > implementation, caching is very aggressive: a node caches data the first
> > time it sees it and declines to request that data again if a request arrives
> > for it.
> >
> > This has the profound effect of moving the data away from the point of
> > insertion. ie. It starves the routing mechanism of data.
> 
> Yes, to the first, but no to the second.  The routing mechanism is not
> starved, it is still informed about where the data is.

*mumble*. Convince me. How can the network keep track if you have random
instances of a file popping up and disappearing in a matter of minutes?

> > Cached files are so
> > preferred by the network that after one or two hits, the original data is
> > never fetched again. After a while the network forgets how to route to that
> > file, simply because it doesn't have to.
> 
> There is no distinction between cache and original data in Freenet.
> Your copy of a file is as valid as mine.  This analysis seems to assume
> that there is an original, which gets lost because it is never referenced,
> and a bunch of caches, which get swamped.  But that is not how Freenet
> works; all copies of a file are equal.

Yes. And the last half of the document is basically an attempt by me to
explain why I think this is a terminally stupid idea. I am arguing that this
distinction _must_ exist in order for freenet to be successful.

> > Later, because of the incredibly
> > high level of thrashing in the caches, all the caches drop the file, and the
> > file is essentially lost.
> 
> It's this high level of thrashing that is the problem.  If the data is
> dropped by "all the caches" then it is because some other data is more
> popular.

Yes. This is another fundamental point. Data should not be destroyed by
requesting a file. Data should be destroyed by the insertion of a file.
Needless to say, some destruction is required if you are going to do
caching. Thus I think it vitally important that caches be distinct from
inserted-data repositories, and some mechanism be installed to canonise
cached versions of popular files.

> 
> > In short, caching may be the cancer killing freenet.
> 
> It's not caching per se, since that is crucial to the way Freenet works
> (see discussion of insertion above).
> 
> However there is a germ of truth, in that the real problem is that the
> network is being asked to store too many copies of data.  The storage
> capacity of the network is equal to total space divided by number of
> copies per data item.  As the number of copies increases, the total
> capacity decreases and we see thrashing.
> 
> But again, it's not the existence of caching which is the problem, it is
> overuse of network resources by end users.  There is a tragedy of the
> commons here and it manifests itself by "creeping HTL".  Each person
> inserting data uses a higher HTL than the other guy in order to make
> his data more widespread.  They also use tricks like connecting to a
> bunch of random nodes and sucking the data, in order to spread it around
> even more.

*nod* I feel you are arguing reasons for my points for me.

> 
> Furthermore this HTL escalation makes the problem worse because it
> makes the search space flatter.  Nodes can't intelligently choose their
> neighbors when every neighbor has the same data.  Routing paths become
> random and data which is not widely spread through the network is not
> found.

*nod* This is what I meant be dilution of routing information. 15 minutes
later everyone decides the latest pigdog entry is awfully interesting, and
all your caches turf out their data, leaving you with a bunch of random
routing entries and no data.

> 
> This further drives people to insert with maximal HTL so that their
> data is everywhere, but that then just makes the problem worse for
> everyone else, first by causing thrashing and second by flattening the
> search space.
> 
> I don't know if reducing caching would really help, given the
> determination by users to get their data spread everywhere.  They might
> just increase their remote fetching to pull the data even harder.
> However it does seem that there needs to be something done to reduce the
> number of copies being kept for data items and to allow the network to
> do its job of finding data via intelligent routing.

If you remove the ability for requests to trash data, most of the problems
you describe fall away. If we do this, then extra mechanisms are required.
Please see the second half of my document for my reasoning, along with some
suggested mechanisms for rationalising split respositories in Section 8.

Thanks,
Ray

_______________________________________________
Devl mailing list
Devl at freenetproject.org
http://lists.freenetproject.org/mailman/listinfo/devl

Reply via email to