Re: Replication using totem protocol

Jules Gosnell Mon, 16 Jan 2006 16:00:19 -0800

Rajith Attapattu wrote:

This is a very educating thread, maybe Jules can incoporate some ofthe ideas into your document on clustering.I do have a few questions on Guglielmo's points on sessionreplication. (Not an expert so pls bear with my questions)

Rajith, Guglielmo, these are good questions, so I hope you don't mind ifI answer some of them for WADI...

>1. The user should configure a minimum-degree-of-replication R. This is
>the number of replicas of a specific session which need to beavailable in
>order for an HTTP request to be serviced.
1.) How do u figure out the most efficient value for R?I assume when R increases, network chatter increases at a magnitue ofX, and X depends on wether it's a multicast protocol or 1->1 (firstof all is this assumption correct ???).

Since WADI is using 1->1 messaging. increasing the number of replicantswill increase the number of messages accordingly. I expect that 1 or atmost 2 backup copies will be sufficient for most sites.

And when R reduces the chances of a request hitting a server where thesession is not replicated is high.

This is not something that is really considered a significant saving inWADI (see my last posting's explanation of why you only want one'active' copy of a session). WADI will keep session backups serialised,to save resources being constantly expended deserialising sessionbackups that may never be accessed. I guess actually, you could considerthat WADI will do a lazy deserialisation in the case that you haveoutlined, as primary and secondary copies will actually swap roles withattendant serialisation/passivation and deserialisation/activationcoordinated by messages.

If you are running a reasonable sized cluster (e.g. 30 nodes - it's allrelative) with a small number of backups configured (e.g. 1), then, inthe case of a session affinity brekdown (due to the leaving of aprimary's node), you have a 1/30 chance that the request will hit theprimary, a 1/30 that you will hit the secondary and a 28/30 that youwill miss :-) So, you are right :-)

So the sweet spot is a balance btw the above to factors ??? or have Imissed any other critical factor(s) ?

In WADI, I don't really see this as a sweetspot. If you are running withsession affinity, then requests should rarely miss their primary. If, inexceptional circumstances they do, and you have to arrange a role-swapbetween primary-secondary or a migration, you are still looking at cost.

If, however, you did your deserialisation of replicants up front andthus avoided further messages when a secondary was hit, by maintainingall copies 'active' (I think you would not be spec compliant if you didthis), then you would find a sweetspot, but only because you had paid upfront with a lot of deserialisation to create it. Furthermore, increating this sweetspot, you are constraining yourself (although notstrictly) in terms of cluster size, because you don't want many nodesmore than session copies.... So I don't see any advantage in such a'sweetspot'. I would rather be spec compliant, pay the deserialisationcost lazily and not worry about using as many nodes as I like :-).

I thought long and hard about whether I should allow multiple 'active'copies of a session and my decision to disallow this has had a largeimpact on WADI's design.

I am still happy that it is the right decision for HttpSessions, butthere may be some slack here that SFSBs could exploit. I need toinvestigate.

2.) When you say minimum-degree-of-replication it imples to me afloor?? is there like a ceiling value likemaximum-degree-of-replication?? I guess we don't want the session togrow beyond a point.>2. When an HTTP request arrives, if the cluster which received does not
>have R copies then it blocks (it waits until there are.) This should in
>data centers because partitions are likely to be very short-lived (aka
>virtual partitions, which are due to congestion, not to any hardware
>issue.)
1) Can u pls elaborate a bit more on this, didn't really understandit, when u said wait untill, does it mean
    a) wait till there are R no of replicas in the cluster?
b) or until a session is replicated within the server the httprequest is received?2) when u said virtual partition did u mean a sub set of nodesbeing isolated due to congestion. By isolation I meant they have notable to replicate there sessions or receive replications from sessionsfrom other nodes outside of the subset due to congestion. Is thiscorrect??3) Assuming an HTTP request arrives and the cluster does not have Rcopies. How different is this situation from "an HTTP request arrivesbut no session replication in that server" ??>3. If at any time an HTTP reaches a server which does not have itself a
>replica of the session it sends a client redirect to a node which does.
How can this be achived?? Is it by having a central cordinator thathandles a mapping or is this information replicated in all nodes onthe entire cluster.information == "which clusters have replicas of each session"The point below gave me the impression that some of inventory has tobe maintained centrally or cluster-wide (ideally in case controller dies).>4. When a new cluster is formed (with nodes coming or going), it takes an>inventory of all the sessions and their version numbers. Sessionswhich do>not have the necessary degree of replication need to be fixed, whichwill
>require some state transfer, and possibly migration of some session for
>proper load balancing.
Again how does the replication healing/shedding works. Assuming nodesdie or comeback with carrying there statehow does the cluster decide on adding or removing sessions to maintainthe optimal R value.Where does the brain/logic for this sit?? Ideally distributable incase the controller dies.General comments/questions
-------------------------------------------
1. How much does the current impls like WADI, ACluster and ASpaceaddress those above concerns?

WADI does not currently implement a solution, but, I have the one that Idescribed in mind.I think (but am open to correction) that AC is happy to leave this issueto its consumer

James can answer for ActiveSpace and maybe correct me on AC ?

Jules

2.) What aspects of the above concerns can be addresed with totembetter than other protocols?3. Can SEDA like architechture solve the problem of deciding the valueof R dynamically runtime from time to time based on load and networklatency?? I guess the network latency can be messured with somemetrics around token passing or something like that.Answers are greatly appreciated.Regards,Rajith.On 1/16/06, *lichtner* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>wrote:




    On Mon, 16 Jan 2006, Jules Gosnell wrote:

    > REMOVE_NODE is when a node leaves cleanly, FAILED_NODE when a
    node dies ...

    I figured. I imagine that if I had to add this distinction to totem I
    would add a message were the node in question announces that it is
    leaving, and then stops forwarding the token. On the other hand,
    it does
    not need to announce anything, and the other nodes will detect that it
    left. In fact totem does not judge a node either way: you can leave
    because you want to or under duress, and the consequences as far
    distribute algorithms are probably minimal. I think the only where
    this
    might is for logging purposes (but that could be handled at the
    application level) or to speed the membership protocol, although it's
    already pretty fast.

    So I would not draw a distinction there.

    > By also treating nodes joining, leaving and dieing, as split and
    merge
    > operations I can reduce the number of cases that I have to deal
    with.

    I would even add that the difference is known only to the application.

    > and ensure that what might be very uncommonly run code (run on
    network
    > partition/healing) is the same code that is commonly run on e.g.
    node
    > join/leave - so it is likely to be more robust.

    Sounds good.

    > In the case of a binary split, I envisage two sets of nodes losing
    > contact with each other. Each cluster fragment will repair its
    internal
    > structure. I expect that after this repair, neither fragment
    will carry
    > a complete copy of the cluster's original state (unless we are
    > replicating 1->all, which WADI will usually not do), rather, the two
    > datasets will intersect and their union will be the original
    dataset.
    > Replicated state will carry a version number.

    I think a version number should work very well.

    > If client affinity survives the split (i.e. clients continue to
    talk to
    > the same nodes), then we should find ourselves in a working
    state, with
    > two smaller clusters carrying overlapping and diverging state. Each
    > piece of state should be static in one subcluster and divergant
    in the
    > other (it has only one client). The version carried by each piece of
    > state may be used to decide which is the most recent version.
    >
    > (If client affinity is not maintained, then, without a
    backchannel of
    > some sort, we are in trouble).
    >
    > When a merge occurs, WADI will be able to merge the internal
    > representations of the participants, delegating awkward
    decisions about
    > divergant state to deploy-time pluggable algorithms. Hopefully, each
    > piece of state will only have diverged in one cluster fragment
    so the
    > choosing which copy to go forward with will be trivial.

    > A node death can just be thought of as a 'split' which never
    'merges'.

    Definitely :)

    > Of course, multiple splits could occur concurrently and merging
    them is
    > a little more complicated than I may have implied, but I am getting
    > there....

    Although I consider the problem of session replication less than
    glamorous, since it is at hand, I would approach it this way:

    1. The user should configure a minimum-degree-of-replication R.
    This is
    the number of replicas of a specific session which need to be
    available in
    order for an HTTP request to be serviced.

    2. When an HTTP request arrives, if the cluster which received
    does not
    have R copies then it blocks (it waits until there are.) This
    should in
    data centers because partitions are likely to be very short-lived (aka
    virtual partitions, which are due to congestion, not to any hardware
    issue.)

    3. If at any time an HTTP reaches a server which does not have
    itself a
    replica of the session it sends a client redirect to a node which
    does.

    4. When a new cluster is formed (with nodes coming or going), it
    takes an
    inventory of all the sessions and their version numbers. Sessions
    which do
    not have the necessary degree of replication need to be fixed,
    which will
    require some state transfer, and possibly migration of some
    session for
    proper load balancing.

    Guglielmo



--
"Open Source is a self-assembling organism. You dangle a piece of
string into a super-saturated solution and a whole operating-system
crystallises out around it."

/**********************************
* Jules Gosnell
* Partner
* Core Developers Network (Europe)
*
*    www.coredevelopers.net
*
* Open Source Training & Support.
**********************************/

Re: Replication using totem protocol

Reply via email to