[ https://issues.apache.org/jira/browse/KAFKA-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16147070#comment-16147070 ]
Ivan Babrou commented on KAFKA-3039: ------------------------------------ We also experienced this and out of 28 upgraded nodes in one rack 4 nodes decided to nuke 1 partition (different partitions on each node): {noformat} 2017-08-30T10:17:29.509 node-93 WARN [ReplicaFetcherThread-0-10042]: Based on follower's leader epoch, leader replied with an unknown offset in requests-48. High watermark 0 will be used for truncation. (kafka.server.ReplicaFetcherThread) 2017-08-30T10:17:29.510 node-93 INFO Truncating log requests-48 to offset 0. (kafka.log.Log) -- 2017-08-30T10:17:29.536 node-93 WARN [ReplicaFetcherThread-0-10082]: Based on follower's leader epoch, leader replied with an unknown offset in requests-80. High watermark 0 will be used for truncation. (kafka.server.ReplicaFetcherThread) 2017-08-30T10:17:29.536 node-93 INFO Truncating log requests-80 to offset 0. (kafka.log.Log) -- 2017-08-30T10:26:32.203 node-87 WARN [ReplicaFetcherThread-2-10056]: Based on follower's leader epoch, leader replied with an unknown offset in requests-82. High watermark 0 will be used for truncation. (kafka.server.ReplicaFetcherThread) 2017-08-30T10:26:32.204 node-87 INFO Truncating log requests-82 to offset 0. (kafka.log.Log) -- 2017-08-30T10:27:31.755 node-89 WARN [ReplicaFetcherThread-3-10055]: Based on follower's leader epoch, leader replied with an unknown offset in requests-79. High watermark 0 will be used for truncation. (kafka.server.ReplicaFetcherThread) 2017-08-30T10:27:31.756 node-89 INFO Truncating log requests-79 to offset 0. (kafka.log.Log) {noformat} This was a rolling upgrade from 0.10.2.0 to 0.11.0.0. Nodes that truncated logs were not leaders before the upgrade (not even preferred). > Temporary loss of leader resulted in log being completely truncated > ------------------------------------------------------------------- > > Key: KAFKA-3039 > URL: https://issues.apache.org/jira/browse/KAFKA-3039 > Project: Kafka > Issue Type: Bug > Components: core > Affects Versions: 0.9.0.0 > Environment: Debian 3.2.54-2 x86_64 GNU/Linux > Reporter: Imran Patel > Priority: Critical > Labels: reliability > > We had an event recently where the temporarily loss of a leader for a > partition (during a manual restart), resulted in the leader coming back with > no high watermark state and truncating its log to zero. Logs (attached below) > indicate that it did have the data but not the commit state. How is this > possible? > Leader (broker 3) > [2015-12-18 21:19:44,666] INFO Completed load of log messages-14 with log end > offset 14175963374 (kafka.log.Log) > [2015-12-18 21:19:45,170] INFO Partition [messages,14] on broker 3: No > checkpointed highwatermark is found for partition [messages,14] > (kafka.cluster.Partition) > [2015-12-18 21:19:45,238] INFO Truncating log messages-14 to offset 0. > (kafka.log.Log) > [2015-12-18 21:20:34,066] INFO Partition [messages,14] on broker 3: Expanding > ISR for partition [messages,14] from 3 to 3,10 (kafka.cluster.Partition) > Replica (broker 10) > [2015-12-18 21:19:19,525] INFO Partition [messages,14] on broker 10: > Shrinking ISR for partition [messages,14] from 3,10,4 to 10,4 > (kafka.cluster.Partition) > [2015-12-18 21:20:34,049] ERROR [ReplicaFetcherThread-0-3], Current offset > 14175984203 for partition [messages,14] out of range; reset offset to 35977 > (kafka.server.ReplicaFetcherThread) > [2015-12-18 21:20:34,033] WARN [ReplicaFetcherThread-0-3], Replica 10 for > partition [messages,14] reset its fetch offset from 14175984203 to current > leader 3's latest offset 35977 (kafka.server.ReplicaFetcherThread) > Some relevant config parameters: > offsets.topic.replication.factor = 3 > offsets.commit.required.acks = -1 > replica.high.watermark.checkpoint.interval.ms = 5000 > unclean.leader.election.enable = false -- This message was sent by Atlassian JIRA (v6.4.14#64029)