[ https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152453#comment-14152453 ]
James Lent commented on KAFKA-1387: ----------------------------------- As for your question (which I agree is one of the key questions) I have the following comments: * The ZooKeeper documentation states there is one case where a watch may be missed which I do not think applies to the situation I am trying to address: "Watches are maintained locally at the ZooKeeper server to which the client is connected. This allows watches to be lightweight to set, maintain, and dispatch. When a client connects to a new server, the watch will be triggered for any session events. Watches will not be received while disconnected from a server. When a client reconnects, any previously registered watches will be reregistered and triggered if needed. In general this all occurs transparently. There is one case where a watch may be missed: a watch for the existence of a znode not yet created will be missed if the znode is created and deleted while disconnected." * In my testing the node is normally gone by the time the New Session event is handled which recreates the node. In that case I do not see a Delete message (I log that arrival of a delete event even if the node is already gone): {noformat} [2014-09-29 18:23:43,071] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2014-09-29 18:23:43,071] INFO Unable to reconnect to ZooKeeper service, session 0x148c36a0a94000f has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:23:43,071] INFO Initiating client connection, connectString=localhost:2181/kafka/0.8 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@56404645 (org.apache.zookeeper.ZooKeeper) [2014-09-29 18:23:43,072] INFO Opening socket connection to server localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:23:43,073] INFO Socket connection established to localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:23:43,074] INFO EventThread shut down (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:23:43,082] INFO Closing socket connection to /10.210.10.165. (kafka.network.Processor) [2014-09-29 18:23:43,087] INFO Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x148c36a0a940010, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:23:43,087] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2014-09-29 18:23:43,099] INFO 0 successfully elected as leader (kafka.server.ZookeeperLeaderElector) [2014-09-29 18:23:43,143] INFO New session started, recreate ephemeral node /brokers/ids/0 (kafka.utils.EphemeralNodeMonitor) [2014-09-29 18:23:43,144] INFO Start registering broker 0 in ZooKeeper (kafka.server.KafkaHealthcheck) [2014-09-29 18:23:43,161] INFO Registered broker 0 at path /brokers/ids/0 with address jlent.digitalsmiths.com:9092. (kafka.utils.ZkUtils$) [2014-09-29 18:23:43,218] INFO Ephemeral node /brokers/ids/0 has new data [{"jmx_port":10001,"timestamp":"1412029423148","host":"jlent.digitalsmiths.com","version":1,"port":9092}] (kafka.utils.EphemeralNodeMonitor) [2014-09-29 18:23:43,237] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) {noformat} * I have seen cases where the node is still present when the New Session is handled and in that case I do see a Delete event a short while later. I don't have the logs that document that (don't ask me why I don't have logs to document the most important scenario). I will try to recreate that situation. * As an alternative I modified the New Session handling code to do nothing (except log the arrival of the new session event). In that case I do see the Delete event. This could perhaps be viewed a more severe test. In this case we get notified of a Delete that actually occured before we even handled the New Seesion event. That was actually how I did some of my original testing. {noformat} [2014-09-29 18:14:31,414] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2014-09-29 18:14:31,414] INFO Unable to reconnect to ZooKeeper service, session 0x148c36a0a94000c has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:14:31,414] INFO Initiating client connection, connectString=localhost:2181/kafka/0.8 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@15c58840 (org.apache.zookeeper.ZooKeeper) [2014-09-29 18:14:31,414] INFO Opening socket connection to server localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:14:31,415] INFO EventThread shut down (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:14:31,415] INFO Socket connection established to localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:14:31,420] INFO Closing socket connection to /10.210.10.165. (kafka.network.Processor) [2014-09-29 18:14:31,459] INFO Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x148c36a0a94000d, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2014-09-29 18:14:31,459] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2014-09-29 18:14:31,587] INFO 0 successfully elected as leader (kafka.server.ZookeeperLeaderElector) [2014-09-29 18:14:31,683] INFO New session started, DO NOT recreate ephemeral node /brokers/ids/0 (kafka.utils.EphemeralNodeMonitor) [2014-09-29 18:14:31,701] INFO Ephemeral node /brokers/ids/0 was deleted, recreate it (kafka.utils.EphemeralNodeMonitor) [2014-09-29 18:14:31,702] INFO Start registering broker 0 in ZooKeeper (kafka.server.KafkaHealthcheck) [2014-09-29 18:14:31,722] INFO Registered broker 0 at path /brokers/ids/0 with address jlent.digitalsmiths.com:9092. (kafka.utils.ZkUtils$) [2014-09-29 18:14:31,744] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener) [2014-09-29 18:14:31,746] INFO Ephemeral node /brokers/ids/0 has new data [{"jmx_port":10001,"timestamp":"1412028871704","host":"jlent.digitalsmiths.com","version":1,"port":9092}] (kafka.utils.EphemeralNodeMonitor) {noformat} > Kafka getting stuck creating ephemeral node it has already created when two > zookeeper sessions are established in a very short period of time > --------------------------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-1387 > URL: https://issues.apache.org/jira/browse/KAFKA-1387 > Project: Kafka > Issue Type: Bug > Reporter: Fedor Korotkiy > > Kafka broker re-registers itself in zookeeper every time handleNewSession() > callback is invoked. > https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala > > Now imagine the following sequence of events. > 1) Zookeeper session reestablishes. handleNewSession() callback is queued by > the zkClient, but not invoked yet. > 2) Zookeeper session reestablishes again, queueing callback second time. > 3) First callback is invoked, creating /broker/[id] ephemeral path. > 4) Second callback is invoked and it tries to create /broker/[id] path using > createEphemeralPathExpectConflictHandleZKBug() function. But the path is > already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting > stuck in the infinite loop. > Seems like controller election code have the same issue. > I'am able to reproduce this issue on the 0.8.1 branch from github using the > following configs. > # zookeeper > tickTime=10 > dataDir=/tmp/zk/ > clientPort=2101 > maxClientCnxns=0 > # kafka > broker.id=1 > log.dir=/tmp/kafka > zookeeper.connect=localhost:2101 > zookeeper.connection.timeout.ms=100 > zookeeper.sessiontimeout.ms=100 > Just start kafka and zookeeper and then pause zookeeper several times using > Ctrl-Z. -- This message was sent by Atlassian JIRA (v6.3.4#6332)