[ 
https://issues.apache.org/jira/browse/KAFKA-8188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Candice Wan updated KAFKA-8188:
-------------------------------
    Description: 
We recently upgraded to 2.1.1 and saw below zookeeper connection issues which 
took down the whole cluster. We've got 3 nodes in the cluster, 2 of which threw 
below exceptions at the same second.

2019-04-03 08:25:19.603 
[main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] WARN 
org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, 
session 0x10071ff9baf0001 has expired
 2019-04-03 08:25:19.603 
[main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] INFO 
org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, 
session 0x10071ff9baf0001 has expired, closing socket connection
 2019-04-03 08:25:19.605 [main-EventThread] INFO 
org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 
0x10071ff9baf0001
 2019-04-03 08:25:19.605 [zk-session-expiry-handler0] INFO 
kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Session expired.
 2019-04-03 08:25:19.609 [zk-session-expiry-handler0] INFO 
kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Initializing a new session 
to 
vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100.
 2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO 
org.apache.zookeeper.ZooKeeper - Initiating client connection, 
connectString=vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100
 sessionTimeout=6000 
watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@12f8b1d8
 2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO 
o.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
 2019-04-03 08:25:19.611 
[zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] 
WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: 
javax.security.auth.login.LoginException: No JAAS configuration section named 
'Client' was found in specified JAAS configuration file: 
'file:/app0/common/config/ldap-auth.config'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
 2019-04-03 08:25:19.611 
[zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] 
INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 
vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100
 2019-04-03 08:25:19.611 [zk-session-expiry-handler0-EventThread] ERROR 
kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Auth failed.
 2019-04-03 08:25:19.611 
[zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] 
INFO org.apache.zookeeper.ClientCnxn - Socket connection established, 
initiating session, client: /169.20.222.18:56876, server: 
vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100
 2019-04-03 08:25:19.612 [controller-event-thread] INFO 
k.controller.PartitionStateMachine - [PartitionStateMachine controllerId=3] 
Stopped partition state machine
 2019-04-03 08:25:19.613 [controller-event-thread] INFO 
kafka.controller.ReplicaStateMachine - [ReplicaStateMachine controllerId=3] 
Stopped replica state machine
 2019-04-03 08:25:19.614 [controller-event-thread] INFO 
kafka.controller.KafkaController - [Controller id=3] Resigned
 2019-04-03 08:25:19.615 [controller-event-thread] INFO kafka.zk.KafkaZkClient 
- Creating /brokers/ids/3 (is it secure? false)
 2019-04-03 08:25:19.628 
[zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] 
INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 
vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100, sessionid = 
0x1007f4d2b810000, negotiated timeout = 6000
 2019-04-03 08:25:19.631 [/config/changes-event-process-thread] INFO 
k.c.ZkNodeChangeNotificationListener - Processing notification(s) to 
/config/changes
 2019-04-03 08:25:19.637 [controller-event-thread] ERROR 
k.zk.KafkaZkClient$CheckedEphemeral - Error while creating ephemeral at 
/brokers/ids/3, node already exists and owner '72182936680464385' does not 
match current session '72197563457011712'
 2019-04-03 08:25:19.637 [controller-event-thread] INFO kafka.zk.KafkaZkClient 
- Result of znode creation at /brokers/ids/3 is: NODEEXISTS
 2019-04-03 08:25:19.644 [controller-event-thread] ERROR 
k.c.ControllerEventManager$ControllerEventThread - [ControllerEventThread 
controllerId=3] Error processing event RegisterBrokerAndReelect
 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
 at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1631)
 at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:87)
 at 
kafka.controller.KafkaController$RegisterBrokerAndReelect$.process(KafkaController.scala:1516)
 at 
kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:89)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
 at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
 at 
kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:89)
 at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

 

Thread dump attached

 

  was:
We recently upgraded to 2.1.1 and we saw below zookeeper connection issues 
which took down the whole cluster. We've got 3 nodes in the cluster, 2 of which 
had issues.

2019-04-03 08:25:19.603 
[main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] WARN 
org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, 
session 0x10071ff9baf0001 has expired
2019-04-03 08:25:19.603 
[main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] INFO 
org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, 
session 0x10071ff9baf0001 has expired, closing socket connection
2019-04-03 08:25:19.605 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn 
- EventThread shut down for session: 0x10071ff9baf0001
2019-04-03 08:25:19.605 [zk-session-expiry-handler0] INFO 
kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Session expired.
2019-04-03 08:25:19.609 [zk-session-expiry-handler0] INFO 
kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Initializing a new session 
to 
vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100.
2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO 
org.apache.zookeeper.ZooKeeper - Initiating client connection, 
connectString=vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100
 sessionTimeout=6000 
watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@12f8b1d8
2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO 
o.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
2019-04-03 08:25:19.611 
[zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] 
WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: 
javax.security.auth.login.LoginException: No JAAS configuration section named 
'Client' was found in specified JAAS configuration file: 
'file:/app0/common/config/ldap-auth.config'. Will continue connection to 
Zookeeper server without SASL authentication, if Zookeeper server allows it.
2019-04-03 08:25:19.611 
[zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] 
INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 
vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100
2019-04-03 08:25:19.611 [zk-session-expiry-handler0-EventThread] ERROR 
kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Auth failed.
2019-04-03 08:25:19.611 
[zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] 
INFO org.apache.zookeeper.ClientCnxn - Socket connection established, 
initiating session, client: /169.20.222.18:56876, server: 
vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100
2019-04-03 08:25:19.612 [controller-event-thread] INFO 
k.controller.PartitionStateMachine - [PartitionStateMachine controllerId=3] 
Stopped partition state machine
2019-04-03 08:25:19.613 [controller-event-thread] INFO 
kafka.controller.ReplicaStateMachine - [ReplicaStateMachine controllerId=3] 
Stopped replica state machine
2019-04-03 08:25:19.614 [controller-event-thread] INFO 
kafka.controller.KafkaController - [Controller id=3] Resigned
2019-04-03 08:25:19.615 [controller-event-thread] INFO kafka.zk.KafkaZkClient - 
Creating /brokers/ids/3 (is it secure? false)
2019-04-03 08:25:19.628 
[zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)] 
INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server 
vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100, sessionid = 
0x1007f4d2b810000, negotiated timeout = 6000
2019-04-03 08:25:19.631 [/config/changes-event-process-thread] INFO 
k.c.ZkNodeChangeNotificationListener - Processing notification(s) to 
/config/changes
2019-04-03 08:25:19.637 [controller-event-thread] ERROR 
k.zk.KafkaZkClient$CheckedEphemeral - Error while creating ephemeral at 
/brokers/ids/3, node already exists and owner '72182936680464385' does not 
match current session '72197563457011712'
2019-04-03 08:25:19.637 [controller-event-thread] INFO kafka.zk.KafkaZkClient - 
Result of znode creation at /brokers/ids/3 is: NODEEXISTS
2019-04-03 08:25:19.644 [controller-event-thread] ERROR 
k.c.ControllerEventManager$ControllerEventThread - [ControllerEventThread 
controllerId=3] Error processing event RegisterBrokerAndReelect
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
NodeExists
 at org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
 at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1631)
 at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:87)
 at 
kafka.controller.KafkaController$RegisterBrokerAndReelect$.process(KafkaController.scala:1516)
 at 
kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:89)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
 at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
 at 
kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:89)
 at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)

 

Thread dump attached

 


> Zookeeper Connection Issue Take Down the Whole kafka cluster
> ------------------------------------------------------------
>
>                 Key: KAFKA-8188
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8188
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.1.1
>            Reporter: Candice Wan
>            Priority: Critical
>         Attachments: thread_dump.log
>
>
> We recently upgraded to 2.1.1 and saw below zookeeper connection issues which 
> took down the whole cluster. We've got 3 nodes in the cluster, 2 of which 
> threw below exceptions at the same second.
> 2019-04-03 08:25:19.603 
> [main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] WARN 
> org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, 
> session 0x10071ff9baf0001 has expired
>  2019-04-03 08:25:19.603 
> [main-SendThread(iaase00003184.svr.emea.jpmchase.net:36100)] INFO 
> org.apache.zookeeper.ClientCnxn - Unable to reconnect to ZooKeeper service, 
> session 0x10071ff9baf0001 has expired, closing socket connection
>  2019-04-03 08:25:19.605 [main-EventThread] INFO 
> org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 
> 0x10071ff9baf0001
>  2019-04-03 08:25:19.605 [zk-session-expiry-handler0] INFO 
> kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Session expired.
>  2019-04-03 08:25:19.609 [zk-session-expiry-handler0] INFO 
> kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Initializing a new 
> session to 
> vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100.
>  2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO 
> org.apache.zookeeper.ZooKeeper - Initiating client connection, 
> connectString=vsie5p0551.svr.emea.jpmchase.net:36100,iaase00003184.svr.emea.jpmchase.net:36100,iaase00003360.svr.emea.jpmchase.net:36100
>  sessionTimeout=6000 
> watcher=kafka.zookeeper.ZooKeeperClient$ZooKeeperClientWatcher$@12f8b1d8
>  2019-04-03 08:25:19.610 [zk-session-expiry-handler0] INFO 
> o.apache.zookeeper.ClientCnxnSocket - jute.maxbuffer value is 4194304 Bytes
>  2019-04-03 08:25:19.611 
> [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)]
>  WARN org.apache.zookeeper.ClientCnxn - SASL configuration failed: 
> javax.security.auth.login.LoginException: No JAAS configuration section named 
> 'Client' was found in specified JAAS configuration file: 
> 'file:/app0/common/config/ldap-auth.config'. Will continue connection to 
> Zookeeper server without SASL authentication, if Zookeeper server allows it.
>  2019-04-03 08:25:19.611 
> [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)]
>  INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server 
> vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100
>  2019-04-03 08:25:19.611 [zk-session-expiry-handler0-EventThread] ERROR 
> kafka.zookeeper.ZooKeeperClient - [ZooKeeperClient] Auth failed.
>  2019-04-03 08:25:19.611 
> [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)]
>  INFO org.apache.zookeeper.ClientCnxn - Socket connection established, 
> initiating session, client: /169.20.222.18:56876, server: 
> vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100
>  2019-04-03 08:25:19.612 [controller-event-thread] INFO 
> k.controller.PartitionStateMachine - [PartitionStateMachine controllerId=3] 
> Stopped partition state machine
>  2019-04-03 08:25:19.613 [controller-event-thread] INFO 
> kafka.controller.ReplicaStateMachine - [ReplicaStateMachine controllerId=3] 
> Stopped replica state machine
>  2019-04-03 08:25:19.614 [controller-event-thread] INFO 
> kafka.controller.KafkaController - [Controller id=3] Resigned
>  2019-04-03 08:25:19.615 [controller-event-thread] INFO 
> kafka.zk.KafkaZkClient - Creating /brokers/ids/3 (is it secure? false)
>  2019-04-03 08:25:19.628 
> [zk-session-expiry-handler0-SendThread(vsie5p0551.svr.emea.jpmchase.net:36100)]
>  INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on 
> server vsie5p0551.svr.emea.jpmchase.net/169.30.47.206:36100, sessionid = 
> 0x1007f4d2b810000, negotiated timeout = 6000
>  2019-04-03 08:25:19.631 [/config/changes-event-process-thread] INFO 
> k.c.ZkNodeChangeNotificationListener - Processing notification(s) to 
> /config/changes
>  2019-04-03 08:25:19.637 [controller-event-thread] ERROR 
> k.zk.KafkaZkClient$CheckedEphemeral - Error while creating ephemeral at 
> /brokers/ids/3, node already exists and owner '72182936680464385' does not 
> match current session '72197563457011712'
>  2019-04-03 08:25:19.637 [controller-event-thread] INFO 
> kafka.zk.KafkaZkClient - Result of znode creation at /brokers/ids/3 is: 
> NODEEXISTS
>  2019-04-03 08:25:19.644 [controller-event-thread] ERROR 
> k.c.ControllerEventManager$ControllerEventThread - [ControllerEventThread 
> controllerId=3] Error processing event RegisterBrokerAndReelect
>  org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
> NodeExists
>  at org.apache.zookeeper.KeeperException.create(KeeperException.java:126)
>  at kafka.zk.KafkaZkClient.checkedEphemeralCreate(KafkaZkClient.scala:1631)
>  at kafka.zk.KafkaZkClient.registerBroker(KafkaZkClient.scala:87)
>  at 
> kafka.controller.KafkaController$RegisterBrokerAndReelect$.process(KafkaController.scala:1516)
>  at 
> kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:89)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>  at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
>  at 
> kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:89)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
> Thread dump attached
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to