[
https://issues.apache.org/jira/browse/KAFKA-764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048552#comment-16048552
]
Robert P. Thille commented on KAFKA-764:
----------------------------------------
I believe we saw this issue, or something very similar.
During a load test, we had a 3-node Kafka cluster which got into a confused
state:
Brokers 0 and 1 were happy and were listed in /brokers/ids/X in ZK, and Broker
2 was connected to ZK, but not listed in /brokers/ids/2 and brokers 0 & 1 had
no connections to broker 2.
Broker 2 was happily accepting new messages produced to it for hours.
Eventually, it did rejoin the cluster, but the published messages were lost as
the 0 & 1 brokers seemingly outvoted broker 2 about the partitions.
> Race Condition in Broker Registration after ZooKeeper disconnect
> ----------------------------------------------------------------
>
> Key: KAFKA-764
> URL: https://issues.apache.org/jira/browse/KAFKA-764
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.7.1
> Reporter: Bob Cotton
>
> When running our ZooKeepers in VMware, occasionally all the keepers
> simultaneously pause long enough for the Kafka clients to time out and then
> the keepers simultaneously un-pause.
> When this happens, the zk clients disconnect from ZooKeeper. When ZooKeeper
> comes back ZkUtils.createEphemeralPathExpectConflict discovers the node id of
> itself and does not re-register the broker id node and the function call
> succeeds. Then ZooKeeper figures out the broker disconnected from the keeper
> and deletes the ephemeral node *after* allowing the consumer to read the data
> in the /brokers/ids/x node. The broker then goes on to register all the
> topics, etc. When consumers connect, they see topic nodes associated with
> the broker but thy can't find the broker node to get connection information
> for the broker, sending them into a rebalance loop until they reach
> rebalance.retries.max and fail.
> This might also be a ZooKeeper issue, but the desired behavior for a
> disconnect case might be, if the broker node is found to explicitly delete
> and recreate it.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)