Re: [Gluster-devel] A healing translator

Xavier Hernandez Tue, 22 May 2012 00:44:41 -0700

On 05/22/2012 02:11 AM, Anand Avati wrote:

On Tue, May 8, 2012 at 2:34 AM, Xavier Hernandez<[email protected] <mailto:[email protected]>> wrote:


    Hello developers,

    I would like to expose some ideas we are working on to create a
    new kind of translator that should be able to unify and simplify
    to some extent the healing procedures of complex translators.

    Currently, the only translator with complex healing capabilities
    that we are aware of is AFR. We are developing another translator
    that will also need healing capabilities, so we thought that it
    would be interesting to create a new translator able to handle the
    common part of the healing process and hence to simplify and avoid
    duplicated code in other translators.

    The basic idea of the new translator is to handle healing tasks
    nearer the storage translator on the server nodes instead to
    control everything from a translator on the client nodes. Of
    course the heal translator is not able to handle healing entirely
    by itself, it needs a client translator which will coordinate all
    tasks. The heal translator is intended to be used by translators
    that work with multiple subvolumes.

    I will try to explain how it works without entering into too much
    details.

    There is an important requisite for all client translators that
    use healing: they must have exactly the same list of subvolumes
    and in the same order. Currently, I think this is not a problem.

    The heal translator treats each file as an independent entity, and
    each one can be in 3 modes:

    1. Normal mode

        This is the normal mode for a copy or fragment of a file when
        it is synchronized and consistent with the same file on other
        nodes (for example with other replicas. It is the client
        translator who decides if it is synchronized or not).

    2. Healing mode

        This is the mode used when a client detects an inconsistency
        in the copy or fragment of the file stored on this node and
        initiates the healing procedures.

    3. Provider mode (I don't like very much this name, though)

        This is the mode used by client translators when an
        inconsistency is detected in this file, but the copy or
        fragment stored in this node is considered good and it will be
        used as a source to repair the contents of this file on other
        nodes.

    Initially, when a file is created, it is set in normal mode.
    Client translators that make changes must guarantee that they send
    the modification requests in the same order to all the servers.
    This should be done using inodelk/entrylk.

    When a change is sent to a server, the client must include a
    bitmap mask of the clients to which the request is being sent.
    Normally this is a bitmap containing all the clients, however,
    when a server fails for some reason some bits will be cleared. The
    heal translator uses this bitmap to early detect failures on other
    nodes from the point of view of each client. When this condition
    is detected, the request is aborted with an error and the client
    is notified with the remaining list of valid nodes. If the client
    considers the request can be successfully server with the
    remaining list of nodes, it can resend the request with the
    updated bitmap.

    The heal translator also updates two file attributes for each
    change request to mantain the "version" of the data and metadata
    contents of the file. A similar task is currently made by AFR
    using xattrop. This would not be needed anymore, speeding write
    requests.

    The version of data and metadata is returned to the client for
    each read request, allowing it to detect inconsistent data.

    When a client detects an inconsistency, it initiates healing.
    First of all, it must lock the entry and inode (when necessary).
    Then, from the data collected from each node, it must decide which
    nodes have good data and which ones have bad data and hence need
    to be healed. There are two possible cases:

    1. File is not a regular file

        In this case the reconstruction is very fast and requires few
        requests, so it is done while the file is locked. In this
        case, the heal translator does nothing relevant.

    2. File is a regular file

        For regular files, the first step is to synchronize the
        metadata to the bad nodes, including the version information.
        Once this is done, the file is set in healing mode on bad
        nodes, and provider mode on good nodes. Then the entry and
        inode are unlocked.

    When a file is in provider mode, it works as in normal mode, but
    refuses to start another healing. Only one client can be healing a
    file.

    When a file is in healing mode, each normal write request from any
    client are handled as if the file were in normal mode, updating
    the version information and detecting possible inconsistencies
    with the bitmap. Additionally, the healing translator marks the
    written region of the file as "good".

    Each write request from the healing client intended to repair the
    file must be marked with a special flag. In this case, the area
    that wants to be written is filtered by the list of "good" ranges
    (if there are any intersection with a good range, it is removed
    from the request). The resulting set of ranges are propagated to
    the lower translator and added to the list of "good" ranges but
    the version information is not updated.

    Read requests are only served if the range requested is entirely
    contained into the "good" regions list.

    There are some additional details, but I think this is enough to
    have a general idea of its purpose and how it works.

    The main advantages of this translator are:

    1. Avoid duplicated code in client translators
    2. Simplify and unify healing methods in client translators
    3. xattrop is not needed anymore in client translators to keep
    track of changes
    4. Full file contents are repaired without locking the file
    5. Better detection and prevention of some split brain situations
    as soon as possible

    I think it would be very useful. It seems to me that it works
    correctly in all situations, however I don't have all the
    experience that other developers have with the healing functions
    of AFR, so I will be happy to answer any question or suggestion to
    solve problems it may have or to improve it.

    What do you think about it ?

The goals you state above are all valid. What would really help(adoption) is if you can implement this as a modification of AFR byutilizing all the work already done, and you get brownie points if itis backward compatible with existing AFR. If you already have any codein a publishable state, please share it with us (github link?).


Avati

I've tried to understand how AFR works and, in some way, some of theideas have been taken from it. However it is very complex and a lot ofchanges have been carried out in the master branch over the latestmonths. It's hard for me to follow them while actively working on mytranslator. Nevertheless, the main reason to take a separate path wasthat AFR is strongly bound to replication (at least from what I saw whenI analyzed it more deeply. Maybe things have changed now, but haven'thad time to review them).

The requirements for my translator didn't fit very well with AFR, andthe needed effort to understand and modify it to adapt it was too high.It also seems that there isn't any detailed developer info aboutinternals of AFR that could have helped to be more confident to modifyit (at least I haven't found it).

I'm currenty working on it, but it's not ready yet. As soon as it is ina minimally stable state we will publish it, probably on github. I'llwrite the url to this list.


Thank you

_______________________________________________
Gluster-devel mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] A healing translator

Reply via email to