On Tue, 2013-12-17 at 16:53 +0000, David Howells wrote: > It has occurred to me and others that something like BTRFS could be a good fit > to build an AFS fileserver directly on top of. The question is what > facilities > would be needed from BTRFS to make this work? > > So I thought I'd kick off a shopping list;-) > > (1) 64-bit data version numbers that increase monotonically with each write. > > Yes, this is likely to cause some performance degredation as it > introduces > an ordering over data writes and metadata writes to a file. Maybe writes > can be batched to improve performance?
Yes. You need a distinct version number for each version of the file that is visible to any client, but intermediate versions never seen by any client do not need separate versions. However, note that whenever a client does a successful write, it modifies the file locally and assigns it the next version, so each RPC must result in a new version of the file. There are some other complications here, but it's probably not impossible to design a filesystem-provided version which can be used as the AFS data version. > (2) Storage for ACLs and AFS UIDs. Having shareable ACLs might also > be useful. > > Xattrs would likely do for this. I'm a bit confused about whether you're talking about btrfs as a storage backend for, say, OpenAFS, or btrfs as the complete on-disk volume representation. In particular, OpenAFS storage backends needn't provide storage for AFS-level vnode metadata, because that is stored in the vnode indices. They really only need to provide storage for the few pieces of "key" data (volume/vnode/uniq) that bind a storage-layer inode to the corresponding AFS-layer vnode, plus the DV. On the other hand, if you're looking to provide the complete on-disk representation, then you need to be able to "name" inodes by (volume/vnode/uniq) instead of by a filename. The fileserver needs to be able to specify those properties, instead of a name, when creating an inode (unless you're doing directory management), and it needs to be able to look up inodes by that same tuple, efficiently, even if you _are_ doing directory management. Also bear in mind that the OpenAFS fileserver's current on-disk directory representation is also the on-the-wire representation, so even if you store directories in some other way, it must be possible to produce the AFS protocol version efficiently. Also, if the fileserver is managing directories and/or volume cloning, it needs to be able to manipulate the link count on inodes, including having inodes be automatically deleted when the fileserver decrements the link count to zero. > (4) A 32-bit vnode number and 32-bit vnode uniquifier/generation number. > > These don't necessarily have to be stored by BTRFS directly but could > instead be in a separate database file that gets snapshotted also. It turns out to make integrity checking and recovery a lot easier if the volume/parent ID, vnode number, uniqifier, and data version are part of the filesystem metadata, rather than being stored in a separate file. This insures that they don't get separated in some bad way during filesystem repair, making it difficult/impossible to match storage-filesystem-layer inodes with the corresponding AFS-layer vnodes. We live with it today, because we have little choice with modern filesystems that don't give us any place to store metadata, but if a filesystem is going to be specifically designed to serve as a more efficient/reliable AFS storage backend, it should store this sort of thing in the filesystem metadata. > (5) The ability to set the vnode number, vnode uniquifier and data version > number to specific values. Necessary to clone volumes and restore > volume dumps. Well, if the filesystem is going to handle volume snapshotting, then you don't need to do clones, and vice versa. However, yes, to handle restores and some other operations, you need to be able to set the volume ID, vnode number, and uniqifier of an inode -- but only when _creating_ the inode; these properties are immutable once an inode is created. -- Jeff _______________________________________________ OpenAFS-devel mailing list OpenAFS-devel@openafs.org https://lists.openafs.org/mailman/listinfo/openafs-devel