Bob Cotton created KAFKA-764:
--------------------------------

             Summary: Race Condition in Broker Registration after ZooKeeper 
disconnect
                 Key: KAFKA-764
                 URL: https://issues.apache.org/jira/browse/KAFKA-764
             Project: Kafka
          Issue Type: Bug
    Affects Versions: 0.7.1
            Reporter: Bob Cotton


When running our ZooKeepers in VMware, occasionally all the keepers 
simultaneously pause long enough for the Kafka clients to time out and then the 
keepers simultaneously un-pause.

When this happens, the zk clients disconnect from ZooKeeper. When ZooKeeper 
comes back ZkUtils.createEphemeralPathExpectConflict discovers the node id of 
itself and does not re-register the broker id node and the function call 
succeeds. Then ZooKeeper figures out the broker disconnected from the keeper 
and deletes the ephemeral node *after* allowing the consumer to read the data 
in the /brokers/ids/x node.  The broker then goes on to register all the 
topics, etc.  When consumers connect, they see topic nodes associated with the 
broker but thy can't find the broker node to get connection information for the 
broker, sending them into a rebalance loop until they reach 
rebalance.retries.max and fail.

This might also be a ZooKeeper issue, but the desired behavior for a disconnect 
case might be, if the broker node is found to explicitly delete and recreate it.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to