On Tue, May 8, 2012 at 2:34 AM, Xavier Hernandez <[email protected]>wrote:
> Hello developers, > > I would like to expose some ideas we are working on to create a new kind > of translator that should be able to unify and simplify to some extent the > healing procedures of complex translators. > > Currently, the only translator with complex healing capabilities that we > are aware of is AFR. We are developing another translator that will also > need healing capabilities, so we thought that it would be interesting to > create a new translator able to handle the common part of the healing > process and hence to simplify and avoid duplicated code in other > translators. > > The basic idea of the new translator is to handle healing tasks nearer the > storage translator on the server nodes instead to control everything from a > translator on the client nodes. Of course the heal translator is not able > to handle healing entirely by itself, it needs a client translator which > will coordinate all tasks. The heal translator is intended to be used by > translators that work with multiple subvolumes. > > I will try to explain how it works without entering into too much details. > > There is an important requisite for all client translators that use > healing: they must have exactly the same list of subvolumes and in the same > order. Currently, I think this is not a problem. > > The heal translator treats each file as an independent entity, and each > one can be in 3 modes: > > 1. Normal mode > > This is the normal mode for a copy or fragment of a file when it is > synchronized and consistent with the same file on other nodes (for example > with other replicas. It is the client translator who decides if it is > synchronized or not). > > 2. Healing mode > > This is the mode used when a client detects an inconsistency in the copy > or fragment of the file stored on this node and initiates the healing > procedures. > > 3. Provider mode (I don't like very much this name, though) > > This is the mode used by client translators when an inconsistency is > detected in this file, but the copy or fragment stored in this node is > considered good and it will be used as a source to repair the contents of > this file on other nodes. > > Initially, when a file is created, it is set in normal mode. Client > translators that make changes must guarantee that they send the > modification requests in the same order to all the servers. This should be > done using inodelk/entrylk. > > When a change is sent to a server, the client must include a bitmap mask > of the clients to which the request is being sent. Normally this is a > bitmap containing all the clients, however, when a server fails for some > reason some bits will be cleared. The heal translator uses this bitmap to > early detect failures on other nodes from the point of view of each client. > When this condition is detected, the request is aborted with an error and > the client is notified with the remaining list of valid nodes. If the > client considers the request can be successfully server with the remaining > list of nodes, it can resend the request with the updated bitmap. > > The heal translator also updates two file attributes for each change > request to mantain the "version" of the data and metadata contents of the > file. A similar task is currently made by AFR using xattrop. This would not > be needed anymore, speeding write requests. > > The version of data and metadata is returned to the client for each read > request, allowing it to detect inconsistent data. > > When a client detects an inconsistency, it initiates healing. First of > all, it must lock the entry and inode (when necessary). Then, from the data > collected from each node, it must decide which nodes have good data and > which ones have bad data and hence need to be healed. There are two > possible cases: > > 1. File is not a regular file > > In this case the reconstruction is very fast and requires few requests, so > it is done while the file is locked. In this case, the heal translator does > nothing relevant. > > 2. File is a regular file > > For regular files, the first step is to synchronize the metadata to the > bad nodes, including the version information. Once this is done, the file > is set in healing mode on bad nodes, and provider mode on good nodes. Then > the entry and inode are unlocked. > > When a file is in provider mode, it works as in normal mode, but refuses > to start another healing. Only one client can be healing a file. > > When a file is in healing mode, each normal write request from any client > are handled as if the file were in normal mode, updating the version > information and detecting possible inconsistencies with the bitmap. > Additionally, the healing translator marks the written region of the file > as "good". > > Each write request from the healing client intended to repair the file > must be marked with a special flag. In this case, the area that wants to be > written is filtered by the list of "good" ranges (if there are any > intersection with a good range, it is removed from the request). The > resulting set of ranges are propagated to the lower translator and added to > the list of "good" ranges but the version information is not updated. > > Read requests are only served if the range requested is entirely contained > into the "good" regions list. > > There are some additional details, but I think this is enough to have a > general idea of its purpose and how it works. > > The main advantages of this translator are: > > 1. Avoid duplicated code in client translators > 2. Simplify and unify healing methods in client translators > 3. xattrop is not needed anymore in client translators to keep track of > changes > 4. Full file contents are repaired without locking the file > 5. Better detection and prevention of some split brain situations as soon > as possible > > I think it would be very useful. It seems to me that it works correctly in > all situations, however I don't have all the experience that other > developers have with the healing functions of AFR, so I will be happy to > answer any question or suggestion to solve problems it may have or to > improve it. > > What do you think about it ? > > The goals you state above are all valid. What would really help (adoption) is if you can implement this as a modification of AFR by utilizing all the work already done, and you get brownie points if it is backward compatible with existing AFR. If you already have any code in a publishable state, please share it with us (github link?). Avati
_______________________________________________ Gluster-devel mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/gluster-devel
