[FLINK-2288] [docs] Add docs for HA/ZooKeeper setup This closes #886
Project: http://git-wip-us.apache.org/repos/asf/flink/repo Commit: http://git-wip-us.apache.org/repos/asf/flink/commit/9c0dd974 Tree: http://git-wip-us.apache.org/repos/asf/flink/tree/9c0dd974 Diff: http://git-wip-us.apache.org/repos/asf/flink/diff/9c0dd974 Branch: refs/heads/master Commit: 9c0dd9742966011322be36343611146ed7b862f0 Parents: 8c72b50 Author: Ufuk Celebi <[email protected]> Authored: Fri Jul 3 16:45:15 2015 +0200 Committer: Stephan Ewen <[email protected]> Committed: Wed Jul 8 20:28:40 2015 +0200 ---------------------------------------------------------------------- docs/_includes/navbar.html | 1 + docs/page/css/flink.css | 14 ++- docs/setup/fig/jobmanager_ha_overview.png | Bin 0 -> 57875 bytes docs/setup/jobmanager_high_availability.md | 121 ++++++++++++++++++++++++ 4 files changed, 133 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/flink/blob/9c0dd974/docs/_includes/navbar.html ---------------------------------------------------------------------- diff --git a/docs/_includes/navbar.html b/docs/_includes/navbar.html index 740ec9f..dc7ef30 100644 --- a/docs/_includes/navbar.html +++ b/docs/_includes/navbar.html @@ -53,6 +53,7 @@ under the License. <li><a href="{{ setup }}/yarn_setup.html">YARN</a></li> <li><a href="{{ setup }}/gce_setup.html">GCloud</a></li> <li><a href="{{ setup }}/flink_on_tez.html">Flink on Tez <span class="badge">Beta</span></a></li> + <li><a href="{{ setup }}/jobmanager_high_availability.html">JobManager High Availability<a></li> <li class="divider"></li> <li><a href="{{ setup }}/config.html">Configuration</a></li> http://git-wip-us.apache.org/repos/asf/flink/blob/9c0dd974/docs/page/css/flink.css ---------------------------------------------------------------------- diff --git a/docs/page/css/flink.css b/docs/page/css/flink.css index 9074e23..3b09e54 100644 --- a/docs/page/css/flink.css +++ b/docs/page/css/flink.css @@ -113,11 +113,19 @@ h2, h3 { code { background: #f5f5f5; - padding: 0; - color: #333333; - font-family: "Menlo", "Lucida Console", monospace; + padding: 0; + color: #333333; + font-family: "Menlo", "Lucida Console", monospace; } pre { font-size: 85%; } + +img.center { + display: block; + margin-left: auto; + margin-right: auto; +} + + http://git-wip-us.apache.org/repos/asf/flink/blob/9c0dd974/docs/setup/fig/jobmanager_ha_overview.png ---------------------------------------------------------------------- diff --git a/docs/setup/fig/jobmanager_ha_overview.png b/docs/setup/fig/jobmanager_ha_overview.png new file mode 100644 index 0000000..ff82cae Binary files /dev/null and b/docs/setup/fig/jobmanager_ha_overview.png differ http://git-wip-us.apache.org/repos/asf/flink/blob/9c0dd974/docs/setup/jobmanager_high_availability.md ---------------------------------------------------------------------- diff --git a/docs/setup/jobmanager_high_availability.md b/docs/setup/jobmanager_high_availability.md new file mode 100644 index 0000000..dec0cdc --- /dev/null +++ b/docs/setup/jobmanager_high_availability.md @@ -0,0 +1,121 @@ +--- +title: "JobManager High Availability (HA)" +--- +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> + +The JobManager is the coordinator of each Flink deployment. It is responsible for both *scheduling* and *resource management*. + +By default, there is a single JobManager instance per Flink cluster. This creates a *single point of failure* (SPOF): if the JobManager crashes, no new programs can be submitted and running programs fail. + +With JobManager High Availability, you can run multiple JobManager instances per Flink cluster and thereby circumvent the *SPOF*. + +The general idea of JobManager high availability is that there is a **single leading JobManager** at any time and **multiple standby JobManagers** to take over leadership in case the leader fails. This guarantees that there is **no single point of failure** and programs can make progress as soon as a standby JobManager has taken leadership. There is no explicit distinction between standby and master JobManager instances. Each JobManager can take the role of master or standby. + +As an example, consider the following setup with three JobManager instances: + +<img src="fig/jobmanager_ha_overview.png" class="center" /> + +## Configuration + +To enable JobManager High Availability you have to configure a **ZooKeeper quorum** and set up a **masters file** with all JobManagers hosts. + +Flink leverages **[ZooKeeper](http://zookeeper.apache.org)** for *distributed coordination* between all running JobManager instances. ZooKeeper is a separate service from Flink, which provides highly reliable distirbuted coordination via leader election and light-weight consistent state storage. Check out [ZooKeeper's Getting Started Guide](http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html) for more information about ZooKeeper. + +Configuring a ZooKeeper quorum in `conf/flink-conf.yaml` *enables* high availability mode and all Flink components try to connect to a JobManager via coordination through ZooKeeper. + +- **ZooKeeper quorum** (required): A *ZooKeeper quorum* is a replicated group of ZooKeeper servers, which provide the distributed coordination service. + + <pre>ha.zookeeper.quorum: address1:2181[,...],addressX:2181</pre> + + Each *addressX:port* refers to a ZooKeeper server, which is reachable by Flink at the given address and port. + +- The following configuration keys are optional: + + - `ha.zookeeper.dir: /flink [default]`: ZooKeeper directory to use for coordination + - TODO Add client configuration keys + +## Starting an HA-cluster + +In order to start an HA-cluster configure the *masters* file in `conf/masters`: + +- **masters file**: The *masters file* contains all hosts, on which JobManagers are started. + + <pre> +jobManagerAddress1 +[...] +jobManagerAddressX + </pre> + +After configuring the masters and the ZooKeeper quorum, you can use the provided cluster startup scripts as usual. They will start a HA-cluster. **Keep in mind that the ZooKeeper quorum has to be running when you call the scripts**. + +## Running ZooKeeper + +If you don't have a running ZooKeeper installation, you can use the helper scripts, which ship with Flink. + +There is a ZooKeeper configuration template in `conf/zoo.cfg`. You can configure the hosts to run ZooKeeper on with the `server.X` entries, where X is a unique ID of each server: + +<pre> +server.X=addressX:peerPort:leaderPort +[...] +server.Y=addressY:peerPort:leaderPort +</pre> + +The script `bin/start-zookeeper-quorum.sh` will start a ZooKeeper server on each of the configured hosts. The started processes start ZooKeeper servers via a Flink wrapper, which reads the configuration from `conf/zoo.cfg` and makes sure to set some rqeuired configuration values for convenience. In production setups, it is recommended to manage your own ZooKeeper installation. + +## Example: Start and stop a local HA-cluster with 2 JobManagers + +1. **Configure ZooKeeper quorum** in `conf/flink.yaml`: + + <pre>ha.zookeeper.quorum: localhost</pre> + +2. **Configure masters** in `conf/masters`: + + <pre> +localhost +localhost</pre> + +3. **Configure ZooKeeper server** in `conf/zoo.cfg` (currently it's only possible to run a single ZooKeeper server per machine): + + <pre>server.0=localhost:2888:3888</pre> + +4. **Start ZooKeeper quorum**: + + <pre> +$ bin/start-zookeeper-quorum.sh +Starting zookeeper daemon on host localhost.</pre> + +5. **Start an HA-cluster**: + + <pre> +$ bin/start-cluster-streaming.sh +Starting HA cluster (streaming mode) with 2 masters and 1 peers in ZooKeeper quorum. +Starting jobmanager daemon on host localhost. +Starting jobmanager daemon on host localhost. +Starting taskmanager daemon on host localhost.</pre> + +6. **Stop ZooKeeper quorum and cluster**: + + <pre> +$ bin/stop-cluster.sh +Stopping taskmanager daemon (pid: 7647) on localhost. +Stopping jobmanager daemon (pid: 7495) on host localhost. +Stopping jobmanager daemon (pid: 7349) on host localhost. +$ bin/stop-zookeeper-quorum.sh +Stopping zookeeper daemon (pid: 7101) on host localhost.</pre>
