[Freenet-dev] Datastore formats and scalability

Oskar Sandberg Fri, 4 Aug 2000 14:08:53 +0700

On Fri, 04 Aug 2000, you wrote:
<>
> No, the reason why I want to use a binary tree is to do high speed
> file index lookups.  The reason why is that you don't have to compare
> the key you are looking for with each other key for other files until
> you come upon the file the key you have points to.  And also, this
> would greatly reduce the time spent in the worst case (which would be
> searching for a nonexistant item with a plain sequential search).


Ok, Travis, what do you think we are? A bunch of gradeschool kids who found the
"implement new technology wizard" in a warezed copy of MS Programmer 2005?

Of course nobody is doing any linear lookups for references or data from the
key value. Even if Hashtables (which are better for this job then binary trees)
had not been available in the java standard libraries, they are not exactly
rocket science to implement.

The problem is that you are completely misundertsanding the difficult part of
designing a datastore. This is not an old "bigger, faster, better" (zzz)
problem, it is a design problem that nobody has solved yet.

The easy part of the DataStore is the lookups. Each key (20 bytes) entry needs
to have an address (4 bytes) of another node and a reference to where the
actual data is stored (not sure, but to be safe say 16 bytes). That is 40 bytes
per entry. In other words, an index for a currently standard node holding 500
entries is 20 kB. An index for an extreme node holding a million entries is
only 40 MB, still small enough to keep in the memory, and certainly not an
issue on disk. Your half gigabyte index file that "will happen" contains 125
million entries. I would like to contend that that will _not_ happen, and that
is not because I think "Freenet never get that big", but for that same reason
you don't see that many ADT tree structures with 125 million branches per level.

So if you want to look up entries in that list fast, all you have to do is
write a hashtable (if you want it in memory - which I would recommend) or a
b-tree (if you want it on disk). Then you can store the actual data in a
database or on disk or wherever. 

However, that is NOT the problem with DataStore design. Looking up data in a
list is probably the oldest problem in CS and we could probably find a million
examples of people having done it before. The hard part in the DataStore is the
routing, being able to find the "closest" key and it's reference, as well as
the nth-closest should that fail. This is not a case where I am aware of much
prior art (I've wondered a bit about spell checkers, if anybody knows anything
else I would be very interested). 

A linear solution for this is very simple, go through the list and find the
entry that is closest, second closest, etc. And linear should be fine as long
as we only have 500 entries - but if you want a million entries, then I can
tell you it will NOT be fine. And nobody has put forward a good solution for an
O(<n) lookup of the closest key based only on the existance of a closeness
relation.

I did write an O(log n) datastore based on a binary tree for Serapis, but to
make it work I made the assumption that there was an order on the keys
equivalent to the closeness. For the current case of lexicographic comparisons
that works fine : I ordered the tree lexicographically, and since walking
through a tree "In order" is easy enough, all you have to do to find the
closest key is start from where the key would be in the tree, and the closest
is either the previous or next entry. However, not all keys can be guaranteed
to have the property that they can be ordered so the closest keys end up next
to them - examples of keys that can't, mentioned here only in the last few days,
are circular lexicographic trees (F is closer to 0 than D - one could
probably hack this into a tree, but it would be specific to this type) and the
logicial search keys which have infinite dimensions.

The only (possible) solution I have come up with to the problem is the sexy
but probably impossibly impractical idea that the answer to the question:
"What can find the closest key fast having only a closeness relation between
them?" is, of course, Freenet - so the DataStore ADT should be modeled as a
Freenet microcosmos within each node (with each node unit in the microcosmos
having a datastore small enough for linear searches). If somebody has a better
idea I would be VERY interested.

-- 
\oskar

_______________________________________________
Freenet-dev mailing list
Freenet-dev at lists.sourceforge.net
http://lists.sourceforge.net/mailman/listinfo/freenet-dev

[Freenet-dev] Datastore formats and scalability

Reply via email to