On Monday 28 July 2008 09:58, Matthew Dillon wrote: > :1) The Update a Clone Method > : > :2) The Update in Place Method > : > :Daniel > > Hmm. Those seem fairly complex.
Then I need to rewrite the post so it seems as simple as it is :-) > How would you deal with incidental > operations on the B-Tree such as doing a split? A single insert > could wind up requiring a split of every internal node on the way > down to the leaf. I guess you could clone the blocks, but there are > a ton of parent and child linkages that have to be changed when you > split a node. It sounds pretty expensive. In this case a log transaction is created containing all of the split nodes as physical updates and possibly some merged nodes from the free tree. Btree splitting can always be committed atomically and independently of any other activity on the filesystem. It is nicely bounded by the btree depth. Allocations that do not require splitting the free tree are logged logically, leaving the affected free tree blocks dirty in cache. So the allocation updates come "for free", being tucked into the same commit block that governs the split btree leaves. > Maybe implementing UNDOs for incidental B-Tree operations is the ticket. > Incidental meaning those operations (such as splits) which lead up to > an insertion or deletion but do not actually perform the insertion > or deletion. On crash recovery you would undo partially completed > B-Tree operations and then REDO the related logical operations. I now see how UNDO works, thankyou for your patient explanations. The point I overlooked is, fsync (and friends) is indeed the only barrier you have to worry about. Still, I think it is good to try to get as much as possible of what was going on in a bit burst of activity with no fsync durably onto disk. Redo will clearly leave the filesystem in a consistent state less far back in time than undo. So my general strategy is to log big changes like splits as "physical" full block updates and small ones like allocating an extent as logical edit records in the commit blocks of the physical updates. The complexity you noted above is due to making sure that the on-disk image of a block is always what the logical edit record expects it to be at the time the logical edit is replayed. > Having to allocate new blocks for B-Tree deletions could create a big > issue when the filesystem becomes full or near-full. Yes, allocate-in-free is usually nasty. I need to ensure that there is always a reserve of blocks available on disk that is more than the maximum possible transaction that can be accepted. This simple idea can get really complex, I know. One nice trick to simplify things a little is to have a really generous limit on the maximum number of dirty blocks that are allowed, when the filesystem has lots of free space, and shrink that limit progressively as free space approaches zero. I now see what you were driving at with your point about interlocking namespace transactions, etc. While each VFS transaction can indeed be committed on its own, atomically and independently, to do so would be death for throughput. So it is necessary to batch up a lot of these transactions, and be mindful of the interdependencies between the VFS transactions and the underlying data structures. The rule is, underlying data changes required by any given VFS transaction must never lie in more than one commit. This can be accomplished without actually tracking the physical representation interdependencies between VFS transactions. Instead, just count the dirty cache blocks and send out an atomic commit for the entire set of transactions when some threshold is passed. Now the challenge is to figure out how to avoid stalling during the commit. It is thanks to you, your searching questions and the example of Hammer that I was forced to understand these issues clearly. :-) Note! In Linux, VFS means "Virtual Filesystem Switch", that is, the security, synchronization and abstraction layer that exists above filesystems in Linux. I think it means "Virtual File System" to you, which is just "filesystem" to Linux hackers. Regards, Daniel
