[ 
https://issues.apache.org/jira/browse/KAFKA-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865266#comment-16865266
 ] 

leibo edited comment on KAFKA-8532 at 6/17/19 2:53 AM:
-------------------------------------------------------

[~ijuma] I have tried to reproduced this ,but failed,  it will appear after i 
restart kafka cluster a few days.

The following info may be helpful:
 # We are running kafka cluster in to docker container managed by kubernetes.
 # Until now, every time this issue occurred, we found the zookeeper container 
CPU are set to 1C while the host have more than 50C CPU.
 # When we set to zookeeper container 1C cpu(The host have more than 50C cpu),  
due to docker's problem, the zookeeper running in docker container will have 
more than 38 GC work threads(According to JVM official document, 1C cpu should 
correspond 2 GC work threads), these gc work thread may slow down zookeeper , 
and maybe let kafka and zookeeper session timed out.
 #  After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not 
occurred yet, but I think this Optimizing measure is just reduce the 
probability of this problem, not solved it.

According to above info, I think when this problem occurred , the zookeeper is 
not running well, and kafka cluster deadlock occurred, but after some time, 
zookeeper recover normal running, kafka is not .


was (Author: lbdai3190):
[~ijuma] I have tried to reproduced this ,but failed,  it will appear after i 
restart kafka cluster a few days.

The following info may be helpful:
 # We are running kafka cluster in to docker container managed by kubernetes.
 # Until now, every time this issue occurred, we found the zookeeper container 
CPU are set to 1C while the host have more than 50C CPU.
 # When we set to zookeeper container 1C cpu(The host have more than 50C cpu),  
due to docker's problem, the zookeeper running in docker container will have 
more than 38 GC work threads(According to JVM official document, 1C cpu should 
correspond 2 GC work threads), these gc work thread may slow down zookeeper , 
and maybe let kafka and zookeeper session timed out.
 #  After we update zookeeper jvm *-XX:ParallelGCThreads=4* , this issue is not 
occurred yet, but I think this Optimizing measure is just reduce the 
probability of this problem, not solved it.

According to above info, I think when this problem occurred , the zookeeper is 
not running well, and kafka cluster deadlock occurred, but after some time, 
zookeeper recovery normal running, kafka is not .

> controller-event-thread deadlock with zk-session-expiry-handler0
> ----------------------------------------------------------------
>
>                 Key: KAFKA-8532
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8532
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.1.1
>            Reporter: leibo
>            Priority: Blocker
>         Attachments: js.log
>
>
> We have observed a serious deadlock between controller-event-thead and 
> zk-session-expirey-handle thread. When this issue occurred, it's only one way 
> to recovery the kafka cluster is restart kafka server. The  follows is the 
> jstack log of controller-event-thead and zk-session-expiry-handle thread.
> "zk-session-expiry-handler0" #163089 daemon prio=5 os_prio=0 
> tid=0x00007fcc9c010000 nid=0xfb22 waiting on condition [0x00007fcbb01f8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x00000005ee3f7000> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) // 
> 等待controller-event-thread线程处理expireEvent
>  at 
> kafka.controller.KafkaController$Expire.waitUntilProcessingStarted(KafkaController.scala:1533)
>  at 
> kafka.controller.KafkaController$$anon$7.beforeInitializingSession(KafkaController.scala:173)
>  at 
> kafka.zookeeper.ZooKeeperClient.callBeforeInitializingSession(ZooKeeperClient.scala:408)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$reinitialize$1$adapted(ZooKeeperClient.scala:374)
>  at kafka.zookeeper.ZooKeeperClient$$Lambda$1473/1823438251.apply(Unknown 
> Source)
>  at scala.collection.Iterator.foreach(Iterator.scala:937)
>  at scala.collection.Iterator.foreach$(Iterator.scala:937)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1425)
>  at scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:209)
>  at kafka.zookeeper.ZooKeeperClient.reinitialize(ZooKeeperClient.scala:374)
>  at 
> kafka.zookeeper.ZooKeeperClient.$anonfun$scheduleSessionExpiryHandler$1(ZooKeeperClient.scala:428)
>  at 
> kafka.zookeeper.ZooKeeperClient$$Lambda$1471/701792920.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.KafkaScheduler.$anonfun$schedule$2(KafkaScheduler.scala:114)
>  at kafka.utils.KafkaScheduler$$Lambda$198/1048098469.apply$mcV$sp(Unknown 
> Source)
>  at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:63)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
>  - <0x0000000661e8d2e0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> "controller-event-thread" #51 prio=5 os_prio=0 tid=0x00007fceaeec4000 
> nid=0x310 waiting on condition [0x00007fccb55c8000]
>  java.lang.Thread.State: WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x00000005d1be5a00> (a 
> java.util.concurrent.CountDownLatch$Sync)
>  at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
>  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
>  at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
>  at kafka.zookeeper.ZooKeeperClient.handleRequests(ZooKeeperClient.scala:157)
>  at 
> kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1596)
>  at 
> kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1589)
>  at 
> kafka.zk.KafkaZkClient.deletePreferredReplicaElection(KafkaZkClient.scala:989)
>  at 
> kafka.controller.KafkaController.removePartitionsFromPreferredReplicaElection(KafkaController.scala:873)
>  at 
> kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:631)
>  at 
> kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:266)
>  at 
> kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1221)
>  at 
> kafka.controller.KafkaController$Reelect$.process(KafkaController.scala:1508)
>  at 
> kafka.controller.KafkaController$RegisterBrokerAndReelect$.process(KafkaController.scala:1517)
>  at 
> kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:89)
>  at 
> kafka.controller.ControllerEventManager$ControllerEventThread$$Lambda$362/918424570.apply$mcV$sp(Unknown
>  Source)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
>  at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
>  at 
> kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:89)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> Locked ownable synchronizers:
>  - None



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to