[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16685929#comment-16685929
 ] 

Michael Han commented on ZOOKEEPER-3188:
----------------------------------------

A couple of comments on the high level design:

* Did we consider the compatibility requirement here? Will the new 
configuration format be backward compatible? One concrete use case is if a 
customer upgrades to new version with this multiple address per server 
capability but wants to roll back without rewriting the config files to older 
version.

* Did we evaluate the impact of this feature on existing server to server 
mutual authentication and authorization feature (e.g. ZOOKEEPER-1045 for 
Kerberos, ZOOKEEPER-236 for SSL), and also the impact on operations? For 
example how to configure Kerberos principals and / or SSL certs per host given 
multiple potential IP address and / or FQDN names per server?

* Could we provide more details on expected level of support with regards to 
dynamic reconfiguration feature? Examples would be great - for example: we 
would support adding, removing, or updating server address that's appertained 
to a given server via dynamic reconfiguration, and also the expected behavior 
in each case. For example, adding a new address to an existing ensemble member 
should not cause any disconnect / reconnect but removing an in use address of a 
server should cause a disconnect. Likely the dynamic reconfig API / CLI / doc 
should be updated because of this.

> Improve resilience to network
> -----------------------------
>
>                 Key: ZOOKEEPER-3188
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3188
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Ted Dunning
>            Priority: Major
>
> We propose to add network level resiliency to Zookeeper. The ideas that we 
> have on the topic have been discussed on the mailing list and via a 
> specification document that is located at 
> [https://docs.google.com/document/d/1iGVwxeHp57qogwfdodCh9b32P2_kOQaJZ2GDo7j36fI/edit?usp=sharing]
> That document is copied to this issue which is being created to report the 
> results of experimental implementations.
> h1. Zookeeper Network Resilience
> h2. Background
> Zookeeper is designed to help in building distributed systems. It provides a 
> variety of operations for doing this and all of these operations have rather 
> strict guarantees on semantics. Zookeeper itself is a distributed system made 
> up of cluster containing a leader and a number of followers. The leader is 
> designated in a process known as leader election in which a majority of all 
> nodes in the cluster must agree on a leader. All subsequent operations are 
> initiated by the leader and completed when a majority of nodes have confirmed 
> the operation. Whenever an operation cannot be confirmed by a majority or 
> whenever the leader goes missing for a time, a new leader election is 
> conducted and normal operations proceed once a new leader is confirmed.
>  
> The details of this are not important relative to this discussion. What is 
> important is that the semantics of the operations conducted by a Zookeeper 
> cluster and the semantics of how client processes communicate with the 
> cluster depend only on the basic fact that messages sent over TCP connections 
> will never appear out of order or missing. Central to the design of ZK is 
> that a server to server network connection is used as long as it works to use 
> it and a new connection is made when it appears that the old connection isn't 
> working.
>  
> As currently implemented, however, each member of a Zookeeper cluster can 
> have only a single address as viewed from some other process. This means, 
> absent network link bonding, that the loss of a single switch or a few 
> network connections could completely stop the operations of a the Zookeeper 
> cluster. It is the goal of this work to address this issue by allowing each 
> server to listen on multiple network interfaces and to connect to other 
> servers any of several addresses. The effect will be to allow servers to 
> communicate over redundant network paths to improve resiliency to network 
> failures without changing any core algorithms.
> h2. Proposed Change
> Interestingly, the correct operations of a Zookeeper cluster do not depend on 
> _how_ a TCP connection was made. There is no reason at all not to advertise 
> multiple addresses for members of a Zookeeper cluster. 
>  
> Connections between members of a Zookeeper cluster and between a client and a 
> cluster member are established by referencing a configuration file (for 
> cluster members) that specifies the address of all of the nodes in a cluster 
> or by using a connection string containing possible addresses of Zookeeper 
> cluster members. As soon as a connection is made, any desired authentication 
> or encryption layers are added and the connection is handed off to the client 
> communications layer or the server to server logic. 
> This means that the only thing that actually needs to change to allow 
> Zookeeper servers to be accessible on multiple networks is a change in the 
> server configuration file format to allow the multiple addresses to be 
> specified and to update the code that establishes the TCP connection to make 
> use of these multiple addresses. No code changes are actually needed on the 
> client since we can simply supply all possible server addresses. The client 
> already has logic for selecting a server address at random and it doesn’t 
> really matter if these addresses represent synonyms for the same server. All 
> that matters is that _some_ connection to a server is established.
> h2. Configuration File Syntax Change
> The current Zookeeper syntax looks like this:
>  
> tickTime=2000
> dataDir=/var/zookeeper
> clientPort=2181
> initLimit=5
> syncLimit=2
> server.1=zoo1:2888:3888
> server.2=zoo2:2888:3888
> server.3=zoo3:2888:3888
>  
> The only lines that matter for this discussion are the last three. These 
> specify the addresses for each of the servers that are part of the Zookeeper 
> cluster as well as the port numbers used for the servers to talk to each 
> other.
>  
> I propose that the current syntax of these lines be augmented to allow a 
> comma delimited list of addresses. For the current example, we might have 
> this:
>  
> server.1=zoo1-net1:2888:3888,zoo1-net2:2888:3888
> server.2=zoo2-net1:2888:3888,zoo2-net2:2888:3888
> server.3=zoo3-net1:2888:3888
>  
> The first two servers are available via two different addresses, presumably 
> on separate networks while the third server only has a single address. In 
> practice, we would probably specify multiple addresses for all the servers, 
> but that isn’t necessary for this proposal. There is work ongoing to improve 
> and generalize the syntax for configuring Zookeeper clusters. As that work 
> progresses, it will be necessary to figure out appropriate extensions to 
> allow multiple addresses in the new and improved syntax. Nothing blocks the 
> current proposal from being implemented in current form and adapted later for 
> the new syntax.
>  
> When a server tries to connect to another server, it would simply shuffle the 
> available addresses at random and try to connect using successive addresses 
> until a connection succeeds or all addresses have been tried. 
>  
> The complete syntax for server lines in a Zookeeper configuration file in BNF 
> is
>  
> <server-line> ::= "server."<integer> "=" <address-spec>
> <address-spec> ::= <server-address>[<client-address>]
> <server-address> ::= <address>:<port1>:<port2>[:<role>]
> <client-address> ::= [;[<client address>:]<client port>
>  
> After this change, the syntax would look like this:
>  
> <server-line> ::= "server."<integer> "=" <address-list>
> <address-list> ::= <address-spec>[,<address-list>]
> <address-spec> ::= <server-address>[<client-address>]
> <server-address> ::= <address>:<port1>:<port2>[:<role>]
> <client-address> ::= [;[<client address>:]<client port>
>  
> h2. Dynamic Reconfiguration
> From version 3.5, Zookeeper has the ability to change the configuration of 
> the cluster dynamically. This can involve the atomic change of any of the 
> configuration parameters that are dynamically configurable. These include, 
> notably for the purposes here, the addresses of the servers in the cluster. 
> In order to simplify this, the configuration file post 3.5 is split into 
> static configuration that cannot be changed on the fly and dynamic 
> configuration that can be changed. When a new configuration is committed by 
> the cluster, the dynamic configuration file is simply over-written and used.
>  
> This means that extending the configuration file syntax to support multiple 
> addresses is sufficient to support dynamic reconfiguration.
> h2. Client Connections
> When client connections are initially made, the client library is given a 
> list of servers to contact. Servers are selected at random until a connection 
> is made or the patience of the library implementers is exhausted. This 
> requires no changes to support multiple network links per server except 
> insofar that servers with more network connections will wind up with more 
> client connections unless some action is taken. What will be done is to find 
> the server with the most addresses and add duplicates of some address for 
> every other server until every server is mentioned the same number of times. 
> For cases where all servers have identical numbers of network connections, 
> this will cause no change. It is expected that this will only arise in normal 
> situations as a transient condition while a cluster is being reconfigured or 
> if some servers are added to a cluster temporarily during maintenance 
> operations. 
>  
> More interesting is the fact that when a connection is made to a Zookeeper 
> cluster, the server responds with a list of the servers in the cluster. We 
> will need to arrange that the list contains all available address in the 
> Zookeeper cluster, but will not need to make any other changes. As mentioned 
> before, some addresses might be duplicated to make sure that all servers have 
> equal probability of being selected by a server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to