Here I describe two slightly different methods that Tux3 will use to implement atomic commit of data and metadata. Both methods combine logical and physical forward logging to perform atomic commits efficiently. The discussion below assumes we are updating a btree leaf, for example to make room for more inode data or to add pointers to a file index. The same techniques apply equally well to all structures in the filesystem.
1) The Update a Clone Method Instead of directly modifying a disk block corresponding to a btree leaf, we allocate a new block for the leaf and copy the contents of the original block to the new block, only in the buffer cache (no copy is performed on disk). We can now alter the new block at will and flush it out to disk without affecting the on-disk btree at all. But have not yet linked the new block into the btree. We could accomplish that by performing a similar clone update recursively up to the root of the tree, which creates a new tree root. The whole chain of new blocks would then be flushed out to disk and a pointer to the new root stored at some predictable location so it can be found later. This is the "phase tree" method that I invented for Tux2, and is commonly called "copy on write" these days. It could also be called the "sue me" method because Netapp likes to sue those such as Sun who implement it. Fortunately, there is a better way to do it that I invented recently and which I have never heard of anyone using before. We modify only the leaf node of the btree by cloning and record the pointer to the clone only in the cached btree index block, not on disk. To be able to reconstruct the cached version of the index node after a crash, we log a logical change record to the disk that says "write this leaf into that btree index node". We make whatever changes we need to the clone of the leaf node, then construct a two block transaction consisting of the clone and a commit block. The commit block points at the new leaf node and also carries the logical update record that describes how to modify the parent index node recorded on disk to point at the new leaf. If the commit block can be allocated immediately adjacent to the clone then we can construct a single bio to transfer the clone and the commit block to disk in one operation. Otherwise we construct two bios. If there is some previous commit transaction that has been staged but not yet committed, then we modify its commit block to point at the location where our new commit block will be stored. Otherwise we have to seek to some known location on the disk to store the location of the new commit. To be able to tell after a crash whether the commit block and its associated data made it to disk, we store a sequence number to distinguish this commit block from a possible similar commit that might have landed accidentally at the same location earlier, and store a hash of the two blocks in the commit block. Now the transaction is ready to go out to disk. But if the disk is busy with previous traffic (we hope it is) then we just stage the transaction for commit, and commit the predecessor transaction instead. This gives a chance for any following transaction to extend the forward log chain instead of having to start a new chain, which would need an extra seek and transfer. The original version of the leaf node on disk is not yet freeable, because it may be needed after a crash (together with the logical update record) to reconstruct the cached btree node image. Only after we "roll up" the logical update log entry by atomically updating the parent node may we free the original leaf block and the commit block that recorded the logical update. Such a rollup happen periodically, otherwise we would end up with an unreasonably long chain of logical updates needing to be replayed on remount to reconstruct the btree image in cache. 2) The Update in Place Method An alternate method of atomically updating our btree leaf node is to write it to disk twice: the first time along with a commit block that says we intend to overwrite the leaf node with the data part of the transaction, and the second write does the actual overwrite. If we crash before completing the second write, replay will load the image of the btree block from the commit transaction into cache and we are ready to continue where we left off. Like the clone method, the update in place method constructs a transaction, tries to link it into an existing forward log chain, or falls back to writing the commit block location to a known location if it has to, then stages the transaction for commit, committing the predecessor transaction in the forward chain if there is one. The overwrite transfer can be submitted as soon as the commit transaction has completed, which might consist of one or two bio transfers depending on whether it was possible to allocate the transaction contiguously or not. It is possible that the updated leaf block may be needed as the base for applying logical updates to reconstruct the cached version of the leaf as a consequence of some future transaction, but the future transaction must be able to rely on a known state of the leaf if it wants to log logical changes against it. Therefore the future transaction may have to wait on the update in place transaction to complete as well, to guarantee that the desired leaf state can be reconstructed. Any future update in place of the same block has to wait for the overwrite transfer to complete before submitting its own overwrite transfer, otherwise the second overwrite may race ahead of the first creating a stale disk block. The whole update in place transaction can be freed as soon as the overwrite transfer completes. Comparison of the methods Both methods are very efficient ways to achieve atomic commit. The update in place method has the advantage of not perturbing the original disk layout, possibly reducing fragmentation. The update clone method has the advantage of one less write operation. Tux3 will provide a mount option to specify which method to use. A related technique is atomic append. A whole range of data blocks can be written out along with a commit block in a single bio transfer, or a small number of bio transfers in one transaction, along with a logical record in the commit block to say which file the blocks are to be appended, in the form of a logical update in the commit block which will eventually result in the addition of one or more extents to the file index, update of the size and mtime, and removal of the new data blocks from the free map. Thus, a file write of considerable size can be atomically committed with a single bio transfer. Regards, Daniel
