Hi all, I'm concerned that there's an unsafe race when a broker loses and reestablishes its zk connection, and I'd like others to weigh in.
On ZookeeperConsumerConnector:204, registerConsumerInZK calls ZkUtils.createEphemeralPathExpectConflict, which on ZkUtils:89 has a case where it observes that the node and data it wants to create already exist, and it considers this a success and returns normally. But isn't it possible for that already created node to be a stale ephemeral node that is about to disappear, in which case the broker will lose its ephemeral /brokers/ids node and consumers won't be able to find it? In particular, wouldn't this occur when the broker gets disconnected from zk, reconnects with a new session, and tries to recreate its ephemeral node before zk has timed out the ephemeral node from its previous session? I'm seeing a behavior where one of our brokers was running but had no /brokers/ids node, and the logs indicated that it reconnected to zk recently, and I'm suspecting this as the explanation. (To fix it, I just restarted the broker.) I'm running an old RC for kafka-0.6, but looking at the latest code (from the git mirror) it looks like the code path described above is still the same as what we're running. Dan