Hi All, Please find the updated bit-rot design for glusterfs volumes.
Thanks to Vijay Bellur for his valuable inputs in the design. Phase 1: File level bit rot detection The initial approach is to achieve bit rot detection at file level, where checksum is computed for a complete file, and checked during access. A single daemon(say BitD) per node will be responsible for all the bricks of the node. This daemon, will be registered to the gluster management daemon, and any graph changes (add-brick/remove-brick/replace-brick/stop bit-rot) will be handles accordingly. This BitD will register with changelog xlator of all the bricks for the node, and process changes from them. Change log xlator, would give the list of files (in terms of gfid) which have changed during a defined interval. Checksum's would have to be computed for these based on either fd close() call for non NFS access, or every write for anonymous fd access (NFS). The computed checksum in addition to the timestamp of the computation would be saved as a extended-attribute (xattr) of the file. By using change-log xlators, we would prevent periodic scans of the bricks, to identify the files whose checksums need to be updated. Upon access (open for non-anonymous-fd calls, every read for anonymous-fd calls) from any clients, the bit rot detection xlator loaded ontop of the bricks, would recompute the checksum of the file, and allow the calls to proceed if they match, or fail them if they mis-match. This introduces extra workload for NFS workloads, and for large files which require read of the complete file to recompute the checksum(we try to solve this in phase-2). Since a data write happens first, followed by a delayed checksum compute, there is a time frame where we might have data updated, but checksums yet to be computed. We should allow the access of such files if the file timestamps (mtime) has changed, and is within a defined range from the current time. Additionally, we could/should have the ability to switch of checksum compute from glusterfs perspective, if the underlying FS exposes/implements bit-rot detection(btrfs). Phase 2: Block-level(User space/defined) bit rot detection and correction. The eventual aim is to be able to heal/correct bit rots in files. To achieve this, computing checksum at a more fine grain level like a block (size limited by the bit rot algorithm), so that we not only detect bit rots, but also have the ability to restore them. Additionally, for large files, checking the checksums at block level is more efficient, rather than recompute the checksum of the whole file for a an access. In this phase, we could move the checksum computation phase to the xlator loaded on-top of the posix translator at each bricks. with every write, we could compute the checksum, and store the checksum and continue with the write or vice versa. Every access would also be able to read/compute the checksum of the requested block, check it with the save checksum of the block, and act accordingly. This would take away the dependency on the external BitD, and changelog xlator. Additionally, using a Error-correcting code(ECC) or Forward-error-correction(FEC) alogrithm, would enable us the correct few bits in the block which have gone corrupt. And compute of the complete files checksum is eliminated, as we are dealing with blocks of defined size. We require the ability to store these fine-grained checksums efficiently, and extended attributes would not scale for this implementation. Either a custom backed store, or a DB would be preferrable in this instance. Please feel free to comment/critique. With regards, Shishir _______________________________________________ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel