Vishal/Alex,
 Would it be good to have these comments/design on the jira? Its
probably better to keep design discussion/comments on the jira.

thanks
mahadev

On Sun, Mar 13, 2011 at 10:25 AM, Vishal Kher <[email protected]> wrote:
> Hi Alex,
>
> Great! Thanks for the design. I have a few suggestions/questions below.
>
> A. Bootstrapping the cluster
>    The algorithm assumes that a leader exists prior to doing a reconfig.
> So it looks like the reconfig() API is not intended to use for bootstrapping
> the cluster. How about we define a API for initializing the cluster?
>
> Instead of relying on the current method of setting the initial
> configuration
> in the config file, we should probably also have to define a join(M) (or
> init(M)) API. When a peer receives this request, it will try to connect to
> the
> specified peers.  During bootstrap peers will connect to each other (as they
> do
> now) and elect a leader.
>
> B. Storing membership in znode and updating client (a tangential topic to
> this
> design).
>    Earlier ZOOKEEPER-107 proposed of using a URI based approach for the
> clients to fetch server list. I am not opposed to the URI based approach,
> however, that shouldn't be our only approach. URI based approach requires
> extra
> resources (e.g., fault tolerant web service or shared storage for file,
> etc).
> In certain environments it may not be feasible to have such a resource.
> Instead, can we use a mechanism similar to ZooKeeper watchpoints for this?
> Store the membership information in a znode and let the ZooKeeper server
> inform
> the clients of the changed configuration.
>
> C. Locating most recent config on servers
>    The URI based approach will be more helpful to servers. For example, if
> A was down when M={A, B, C} was changed to M'={A, D, E}, then when A comes
> online it won't be able to locate the most recent config. In this case, A
> can
> query the URI. Second approach is to ask the leader to try to periodically
> send the membership (at least to nodes that are down).
>
> D. "Send a message to all members(M’) instructing them to connect to
> leader(M)."
>    leader(M) can potentially change after sending this message. Should
> this be "Send a messages to all members(M') to connect to members(M)? See
> point G. below.
>
> Also, in step 7, why do you send leader(M') along with
> <activate-config>message>?
>
> E. "3. Submit a reconfig(M’) operation to leader(M)"
>    What if leader(M) fails after receiving the request, but before
> informing any other peer? How will the administrative client know whether to
> retry or not?  Should s retry if leader fails and should the client API
> retry
> if s fails?
>
> F. Regarding 3.a and 3.b.
>
> The algorithm mentions:
>
> "3. Otherwise:
>   a. Designate the most up-to-date server in connected quorum of M' to be
> leader(M)
>   b. Stop processing new operations, return failure for any further ops
> received"
>
> Why do we need to change the leader if leader(M) is not in M'? How about we
> let
> the leader perform the reconfig and at the end of phase 3 (while processing
> retire) the leader will give up leadership. This will cause a new leader
> election and one of the peer in M' will become the leader. Similarly, after
> the
> second phase, members(M') will stop following leader(M) if leader(M) is not
> in
> M'. I think this will be simpler.
>
> G. Online VS offline
>
>    In your "Online vs offline" section, you mentioned that the offline
> strategy is preferred. I think we can do reconfiguration online.
> I pictured M' as modified OBSERVERS till the time the reconfiguration is
> complete.
>
> a. Let new members(M') join the cluster as OBSERVERS. Based on the current
> implementation, M' will essentially sync with the leader after becoming
> OBSERVERS and it will be easier to leverage the RequestProcessor pipeline
> for
> reconfig.
>
> b. Once a majority of M' join the leader, the leader executes phase 1 as you
> described.
>
> c. After phase 1., the leader changes the transaction logic a bit. Every
> transaction after this point (including reconfig-start) will be sent to M
> and
> M'.  Leader then waits for quorum(M) and quorum(M') to ack the transaction.
> So
> M' are not pure OBSERVERS as we think today. However, only a member of M can
> become a leader until reconfig succeeds.  Also, M' - (M n M') cannot serve
> client requests until reconfiguration is complete. By doing a transaction on
> both M and M' and waiting for the quorum of each set to ack, we keep
> transfering
> the state to both the configurations.
>
> d. After receiving ack for phase 2, the leader sends <switch-to-new-config>
> transaction to M and M' (instead of sending retire just to M).
>
> After receiving this message, M' will close connections with (and reject
> connections from) members not in M'.  Members that are supposed to leave the
> cluster will shutdown QuorumPeer. If leader(M) is not in M', then a new
> leader(M') will be elected.
>
> Let me know what you think about this.
>
> H. What happens if a majority of M' keep failing during reconfig?
>
> M={A, B}, M'={A, C, D, E}. What if D, E fail?
>
> Failure of a majority of M' will permanently stall reconfig. While this is
> less likely to happen, I think ZooKeeper should handle this
> automatically. After a few retries, we should abort reconfig.  Otherwise, we
> could disrupt a running cluster and we will never be able to recover without
> manual intervention. If the majority fails after phase 1, then this would
> mean
> sending a <abort-reconfig, version(M), M'> to M.
>
> Of course, one can argue - what if majority of M' fail after phase 3? So not
> sure if this is an overkill, but I feel we should handle this case.
>
> I. "Old reconfiguration request"
>
> a. We should use ZAB
> b. A side note - I think ZOOKEEPER-335 needs to be resolved for reconfig to
> work. This bug causes logs to diverge if ZK leader fails before sending
> PROPOSALs to followers (see
> http://www.mail-archive.com/[email protected]/msg00403.html).
>
> Because of this bug we could run into the following scenario:
> - A peer B that was leader when reconfig(M') was issued will have reconfig
> M'
>  in its transaction log.
> - A peer C that became leader after B's failure, can have reconfig(M'') in
> its
>  log.
> - Now, if B and C fail (say both reboot), then the outcome of reconfig will
>  depend on which node takes leadership. If B becomes a leader, then out
> come
> is M'. If C becomes a leader, then outcome is M''
>
> J. Policies/ sanity checks
> - Should we allow removal of a peer in a 3 node configuration? How about in
> a
> 2-node configuration?
>
> Can you please add section numbers to the design paper? It will be easier to
> refer to the text by numbers.
>
> Thanks again for working on this. We are about to release the first version
> of
> our product using ZooKeeper, which uses static configuration. Our next
> version
> is heavily dependent on dynamic membership. We have resource allocated at
> work
> that can dedicate time for implementing this feature for our next release
> and
> we are interested in contributing to it. We will be happy to chip in from
> our
> side to help with the implementation.
>
> Regards,
> -Vishal
>
> On Thu, Mar 10, 2011 at 2:19 AM, Alexander Shraer 
> <[email protected]>wrote:
>
>> Hi,
>>
>>
>>
>> I'm working on adding support for dynamic membership changes in Zookeeper
>> clusters (ZOOKEEPER-107). I've added a proposed algorithm and some
>> discussion here:
>>
>>
>>
>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ClusterMembership
>>
>>
>>
>> Any comments and suggestions are very welcome.
>>
>>
>>
>> Best Regards,
>>
>> Alex
>>
>

Reply via email to