On Sun, Nov 11, 2001 at 06:20:20PM -0800, Scott G. Miller wrote:
> > I'm sure the "datastore bug" (probably a family of poorly understood flaws
> > working in concert) is tied into one of these systemic holes -- the
> > insufficient treatment of the problem of keeping the on-disk files and
> > accounting data in a continuously consistent state so that the file-system
> > can start up properly no matter what was happening the last time it was
> > shut down (^C and killall java being the more popular ways :).
> >     
> From my trace of the code, the problem was that the routing table was
> being stored in the Datastore, without any real safeguards in place to
> keep the routing table from being freed in favor of ordinary data.

But .. that's impossible, since the routing table isn't stored under
a key, and only keys can be freed to reclaim space.

> I'm not entirely certain a complete redesign was necessary.  In the original
> DFS code, the routing table and datastore index were stored in a separate
> file for a *reason*.

Whether they are actually in separate files is pretty inconsequential in
my new design .. I'm basically just cutting off a region at the beginning
and using it like a separate file.

> > The current design also fails to address the realities of dealing with
> > very large stores that might contain many thousands of files and possibly
> > on the order of a million fragments.  I don't think it's reasonable to
> > keep the accounting information for all that in memory but that's what
> > we do.
> Its completely reasonable, given that datastores really should not be
> these multi-gigabyte monsters.

Fred already consumes large amounts of memory, and even with moderately
sized stores we can reduce that memory usage with a system like the one
I described.

And if someone does choose to run say a 30 gig store, it's sort of sub-ideal
if Fred turns around and eats all of their physical and swap memory.

Every indication is that stores are going to grow insanely huge over the
next few years anyway, despite us sitting here saying "don't do that."

> > So my major target in the rewrite is a system that, 1) maintains the
> > files and accounting data in a continuously consistent state, and
> > 2) can handle, say, 10^6 files in 10^7 distinct fragments with
> > celerity and yet a modesty of memory use..
> > 
> > It turns out these two goals are deeply intertwined, since what you
> > need are on-disk accounting structures that you can make atomic updates
> > to, and that you can also organize into binary search trees.  I have
> > worked out most of the details of such a system.  It reserves space at
> > the beginning of the store for a collection of independently updateable
> > accounting blocks.  Each of these records data about a certain range
> > of keys or storage fragments.  They are indexed into in-memory trees so
> > that any table look-up (i.e., checking if a key is in the store) requires
> > at most one disk access.  Each block has a checksum so it is possible to
> > tell if one was incompletely written.  Updates are made by writing a
> > special block in an empty slot to declare that an update is in progress,
> > then writing new versions of the updated blocks into empty slots, and
> > finally destroying the update-block and then queueing the old slots
> > of the updated blocks for re-use (i.e., deleting them).
> Checksums?  Wow this is overkill.  Check out some journaling filesystem
> design, the idea is not to get consistency through atomic writes, and to
> throw out any blocks that didn't make it to the journal (didn't complete
> atomically).  I'm a bit worried that this is getting out of control.  But
> good luck anyway.

Our encryption requirements pretty much force the division of the directory
into atomic blocks since otherwise you have to rewrite the entire thing
each time you update it.  The other option is to just keep logging changes
but you still have to rewrite it occasionally and that's the dangerous
moment.  Plus this latter approach forces you to keep everything in
memory which I wanted to avoid.

Thanks for the good wish ;)

-- 

:: tavin cole (tcole at espnow.com) ::

if there's been a way to build it
there'll be a way to destroy it
things are not all that out of control

                        - stereolab

_______________________________________________
Devl mailing list
Devl at freenetproject.org
http://lists.freenetproject.org/mailman/listinfo/devl

Reply via email to