Hi All- A few notes on running a zookeeper ensemble on docker... (Apologies in advance for any mistakes/confusion.)
By default, Docker uses "bridge" networking, where it creates a virtual IP address (10.x.x.x) for each container. You can force it to use the host IP with --net=host, but this opens up some security vulnerabilities (like a docker image shutting down your host by sending commands to D-Bus). By default, Zookeeper wants each quorum member to have the full list of ensemble-member addresses. Unfortunately, this list performs double duty. For the server itself, it's entry represents the *interface* to bind to. For it's peers in the ensemble, it represents the *IP* address to connect to. To see why this is a problem, consider the following scenario: We want to run two servers "1" and "2" in separate containers on localhost. Suppose we set the quorum list to: - 1.2.3.4:3181:4181:participant;2181 - 1.2.3.4:3182:4182:participant;2182 Server "1" tries to bind to address 1.2.3.4:3181, and fails, because it actually has an address like 10.3.4.5. Similarly, if we use the hostname: - foo1.bar.com:3181:4181:participant;2181 - foo2.bar.com:3182:4182:participant;2182 it will resolve to 1.2.3.4 and fail. Using the --hostname option to Docker will also fail, either beacuse we use IPs in the quorum list, or because we will have the wrong (10.x.x.x) address for our peer. Now suppose we get clever and pass different quorum lists to each server. So, server "1" gets: - 0.0.0.0:3181:4181:participant;2181 - 1.2.3.4:3182:4182:participant;2182 And server "2" gets: - 1.2.3.4:3181:4181:participant;2181 - 0.0.0.0:3182:4182:participant;2182 It turns out that both servers will be able to bind to their local server address, *and* connect to their peer. So, now we're totally good, right?... Nope. It turns out that once the servers connect, they synchronize their quorum lists, and *restart* their quorum and leader election ports and services. Then one of 3 things happen: a) The server has 0.0.0.0 for its peer and can't connect. b) The server has 1.2.3.4 for it's own address and can't bind. c) Unexpected exceptions (restart race condition?) and repeated failures. So, unless you're willing to use --host=net, there's no correct way to specify the quorum with a default ZK setup. So, we're doomed, right?... It turns out that there's a (slightly buried) feature that lets you bind to 0.0.0.0 for the leader and quorum interfaces: https://issues.apache.org/jira/browse/ZOOKEEPER-1096 But really, it seems unfortunate that: a) The quorum list from peers can re-configure your local bindings (Denial of service on port 80, anyone?) b) ZK conflates binding *interfaces* and peer *IP* addresses c) This doesn't work by default, and takes some digging to find the fix. Thoughts? Thanks! .timrc
