[ https://issues.apache.org/jira/browse/KAFKA-3924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419873#comment-15419873 ]
Alexey Ozeritskiy commented on KAFKA-3924: ------------------------------------------ I've got the deadlock with that patch. Stack traces: {code} "ReplicaFetcherThread-3-2" #112 prio=5 os_prio=0 tid=0x00007f0acc100000 nid=0xfd54f in Object.wait() [0x00007f0b141d7000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1245) - locked <0x00000003d8269bc8> (a kafka.Kafka$$anon$1) at java.lang.Thread.join(Thread.java:1319) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:106) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:46) at java.lang.Shutdown.runHooks(Shutdown.java:123) at java.lang.Shutdown.sequence(Shutdown.java:167) at java.lang.Shutdown.exit(Shutdown.java:212) - locked <0x00000003d8106e88> (a java.lang.Class for java.lang.Shutdown) at java.lang.Runtime.exit(Runtime.java:109) at java.lang.System.exit(System.java:971) at kafka.server.ReplicaFetcherThread.handleOffsetOutOfRange(ReplicaFetcherThread.scala:179) {code} {code} "Thread-2" #29 prio=5 os_prio=0 tid=0x00007f0a70008000 nid=0xfecbf in Object.wait() [0x00007f0b166e5000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1245) - locked <0x00000003e5c46960> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1319) at kafka.server.KafkaRequestHandlerPool$$anonfun$shutdown$3.apply(KafkaRequestHandler.scala:92) at kafka.server.KafkaRequestHandlerPool$$anonfun$shutdown$3.apply(KafkaRequestHandler.scala:91) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at kafka.server.KafkaRequestHandlerPool.shutdown(KafkaRequestHandler.scala:91) at kafka.server.KafkaServer$$anonfun$shutdown$3.apply$mcV$sp(KafkaServer.scala:559) at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:79) at kafka.utils.Logging$class.swallowWarn(Logging.scala:92) at kafka.utils.CoreUtils$.swallowWarn(CoreUtils.scala:51) at kafka.utils.Logging$class.swallow(Logging.scala:94) at kafka.utils.CoreUtils$.swallow(CoreUtils.scala:51) at kafka.server.KafkaServer.shutdown(KafkaServer.scala:559) at kafka.server.KafkaServerStartable.shutdown(KafkaServerStartable.scala:49) at kafka.Kafka$$anon$1.run(Kafka.scala:63) {code} System.exit executes hook in Thread 2 and joins it (first trace). Thread 2 joins ReplicaFetcherThread-3-2 (second trace). So they are waiting each other forever. > Data loss due to halting when LEO is larger than leader's LEO > ------------------------------------------------------------- > > Key: KAFKA-3924 > URL: https://issues.apache.org/jira/browse/KAFKA-3924 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.10.0.0 > Reporter: Maysam Yabandeh > Fix For: 0.10.0.1 > > > Currently the follower broker panics when its LEO is larger than its leader's > LEO, and assuming that this is an impossible state to reach, halts the > process to prevent any further damage. > {code} > if (leaderEndOffset < replica.logEndOffset.messageOffset) { > // Prior to truncating the follower's log, ensure that doing so is not > disallowed by the configuration for unclean leader election. > // This situation could only happen if the unclean election > configuration for a topic changes while a replica is down. Otherwise, > // we should never encounter this situation since a non-ISR leader > cannot be elected if disallowed by the broker configuration. > if (!LogConfig.fromProps(brokerConfig.originals, > AdminUtils.fetchEntityConfig(replicaMgr.zkUtils, > ConfigType.Topic, > topicAndPartition.topic)).uncleanLeaderElectionEnable) { > // Log a fatal error and shutdown the broker to ensure that data loss > does not unexpectedly occur. > fatal("...") > Runtime.getRuntime.halt(1) > } > {code} > Firstly this assumption is invalid and there are legitimate cases (examples > below) that this case could actually occur. Secondly halt results into the > broker losing its un-flushed data, and if multiple brokers halt > simultaneously there is a chance that both leader and followers of a > partition are among the halted brokers, which would result into permanent > data loss. > Given that this is a legit case, we suggest to replace it with a graceful > shutdown to avoid propagating data loss to the entire cluster. > Details: > One legit case that this could actually occur is when a troubled broker > shrinks its partitions right before crashing (KAFKA-3410 and KAFKA-3861). In > this case the broker has lost some data but the controller cannot still > elects the others as the leader. If the crashed broker comes back up, the > controller elects it as the leader, and as a result all other brokers who are > now following it halt since they have LEOs larger than that of shrunk topics > in the restarted broker. We actually had a case that bringing up a crashed > broker simultaneously took down the entire cluster and as explained above > this could result into data loss. > The other legit case is when multiple brokers ungracefully shutdown at the > same time. In this case both of the leader and the followers lose their > un-flushed data but one of them has HW larger than the other. Controller > elects the one who comes back up sooner as the leader and if its LEO is less > than its future follower, the follower will halt (and probably lose more > data). Simultaneous ungrateful shutdown could happen due to hardware issue > (e.g., rack power failure), operator errors, or software issue (e.g., the > case above that is further explained in KAFKA-3410 and KAFKA-3861 and causes > simultaneous halts in multiple brokers) -- This message was sent by Atlassian JIRA (v6.3.4#6332)