[ 
https://issues.apache.org/jira/browse/KAFKA-1387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152453#comment-14152453
 ] 

James Lent commented on KAFKA-1387:
-----------------------------------

As for your question (which I agree is one of the key questions) I have the 
following comments:

* The ZooKeeper documentation states there is one case where a watch may be 
missed which I do not think applies to the situation I am trying to address:

"Watches are maintained locally at the ZooKeeper server to which the client is 
connected. This allows watches to be lightweight to set, maintain, and 
dispatch. When a client connects to a new server, the watch will be triggered 
for any session events. Watches will not be received while disconnected from a 
server. When a client reconnects, any previously registered watches will be 
reregistered and triggered if needed. In general this all occurs transparently. 
There is one case where a watch may be missed: a watch for the existence of a 
znode not yet created will be missed if the znode is created and deleted while 
disconnected."

* In my testing the node is normally gone by the time the New Session event is 
handled which recreates the node. In that case I do not see a Delete message (I 
log that arrival of a delete event even if the node is already gone):

{noformat}
[2014-09-29 18:23:43,071] INFO zookeeper state changed (Expired) 
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:23:43,071] INFO Unable to reconnect to ZooKeeper service, 
session 0x148c36a0a94000f has expired, closing socket connection 
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,071] INFO Initiating client connection, 
connectString=localhost:2181/kafka/0.8 sessionTimeout=6000 
watcher=org.I0Itec.zkclient.ZkClient@56404645 (org.apache.zookeeper.ZooKeeper)
[2014-09-29 18:23:43,072] INFO Opening socket connection to server 
localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,073] INFO Socket connection established to 
localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,074] INFO EventThread shut down 
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,082] INFO Closing socket connection to /10.210.10.165. 
(kafka.network.Processor)
[2014-09-29 18:23:43,087] INFO Session establishment complete on server 
localhost/127.0.0.1:2181, sessionid = 0x148c36a0a940010, negotiated timeout = 
6000 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:23:43,087] INFO zookeeper state changed (SyncConnected) 
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:23:43,099] INFO 0 successfully elected as leader 
(kafka.server.ZookeeperLeaderElector)
[2014-09-29 18:23:43,143] INFO New session started, recreate ephemeral node 
/brokers/ids/0 (kafka.utils.EphemeralNodeMonitor)
[2014-09-29 18:23:43,144] INFO Start registering broker 0 in ZooKeeper 
(kafka.server.KafkaHealthcheck)
[2014-09-29 18:23:43,161] INFO Registered broker 0 at path /brokers/ids/0 with 
address jlent.digitalsmiths.com:9092. (kafka.utils.ZkUtils$)
[2014-09-29 18:23:43,218] INFO Ephemeral node /brokers/ids/0 has new data 
[{"jmx_port":10001,"timestamp":"1412029423148","host":"jlent.digitalsmiths.com","version":1,"port":9092}]
 (kafka.utils.EphemeralNodeMonitor)
[2014-09-29 18:23:43,237] INFO New leader is 0 
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
{noformat}

* I have seen cases where the node is still present when the New Session is 
handled and in that case I do see a Delete event a short while later.  I don't 
have the logs that document that (don't ask me why I don't have logs to 
document the most important scenario).  I will try to recreate that situation.
* As an alternative I modified the New Session handling code to do nothing 
(except log the arrival of the new session event).  In that case I do see the 
Delete event.  This could perhaps be viewed a more severe test.  In this case 
we get notified of a Delete that actually occured before we even handled the 
New Seesion event.  That was actually how I did some of my original testing.

{noformat}
[2014-09-29 18:14:31,414] INFO zookeeper state changed (Expired) 
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:14:31,414] INFO Unable to reconnect to ZooKeeper service, 
session 0x148c36a0a94000c has expired, closing socket connection 
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,414] INFO Initiating client connection, 
connectString=localhost:2181/kafka/0.8 sessionTimeout=6000 
watcher=org.I0Itec.zkclient.ZkClient@15c58840 (org.apache.zookeeper.ZooKeeper)
[2014-09-29 18:14:31,414] INFO Opening socket connection to server 
localhost/127.0.0.1:2181 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,415] INFO EventThread shut down 
(org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,415] INFO Socket connection established to 
localhost/127.0.0.1:2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,420] INFO Closing socket connection to /10.210.10.165. 
(kafka.network.Processor)
[2014-09-29 18:14:31,459] INFO Session establishment complete on server 
localhost/127.0.0.1:2181, sessionid = 0x148c36a0a94000d, negotiated timeout = 
6000 (org.apache.zookeeper.ClientCnxn)
[2014-09-29 18:14:31,459] INFO zookeeper state changed (SyncConnected) 
(org.I0Itec.zkclient.ZkClient)
[2014-09-29 18:14:31,587] INFO 0 successfully elected as leader 
(kafka.server.ZookeeperLeaderElector)
[2014-09-29 18:14:31,683] INFO New session started, DO NOT recreate ephemeral 
node /brokers/ids/0 (kafka.utils.EphemeralNodeMonitor)
[2014-09-29 18:14:31,701] INFO Ephemeral node /brokers/ids/0 was deleted, 
recreate it (kafka.utils.EphemeralNodeMonitor)
[2014-09-29 18:14:31,702] INFO Start registering broker 0 in ZooKeeper 
(kafka.server.KafkaHealthcheck)
[2014-09-29 18:14:31,722] INFO Registered broker 0 at path /brokers/ids/0 with 
address jlent.digitalsmiths.com:9092. (kafka.utils.ZkUtils$)
[2014-09-29 18:14:31,744] INFO New leader is 0 
(kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
[2014-09-29 18:14:31,746] INFO Ephemeral node /brokers/ids/0 has new data 
[{"jmx_port":10001,"timestamp":"1412028871704","host":"jlent.digitalsmiths.com","version":1,"port":9092}]
 (kafka.utils.EphemeralNodeMonitor)
{noformat}


> Kafka getting stuck creating ephemeral node it has already created when two 
> zookeeper sessions are established in a very short period of time
> ---------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-1387
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1387
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Fedor Korotkiy
>
> Kafka broker re-registers itself in zookeeper every time handleNewSession() 
> callback is invoked.
> https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/server/KafkaHealthcheck.scala
>  
> Now imagine the following sequence of events.
> 1) Zookeeper session reestablishes. handleNewSession() callback is queued by 
> the zkClient, but not invoked yet.
> 2) Zookeeper session reestablishes again, queueing callback second time.
> 3) First callback is invoked, creating /broker/[id] ephemeral path.
> 4) Second callback is invoked and it tries to create /broker/[id] path using 
> createEphemeralPathExpectConflictHandleZKBug() function. But the path is 
> already exists, so createEphemeralPathExpectConflictHandleZKBug() is getting 
> stuck in the infinite loop.
> Seems like controller election code have the same issue.
> I'am able to reproduce this issue on the 0.8.1 branch from github using the 
> following configs.
> # zookeeper
> tickTime=10
> dataDir=/tmp/zk/
> clientPort=2101
> maxClientCnxns=0
> # kafka
> broker.id=1
> log.dir=/tmp/kafka
> zookeeper.connect=localhost:2101
> zookeeper.connection.timeout.ms=100
> zookeeper.sessiontimeout.ms=100
> Just start kafka and zookeeper and then pause zookeeper several times using 
> Ctrl-Z.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to