[ https://issues.apache.org/jira/browse/KAFKA-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Onur Karaman updated KAFKA-5310: -------------------------------- Description: This ticket is all about ControllerContext initialization and teardown. The key points are: 1. we should teardown ControllerContext during resignation instead of waiting on election to fix it up. A heapdump shows that the former controller keeps pretty much all of its ControllerContext state laying around. 2. we don't properly teardown/reset {{ControllerContext.partitionsBeingReassigned}}. This caused problems for us in a production cluster at linkedin as shown in the scenario below: {code} > rm -rf /tmp/zookeeper/ /tmp/kafka-logs* logs* > ./gradlew clean jar > ./bin/zookeeper-server-start.sh config/zookeeper.properties > export LOG_DIR=logs0 && ./bin/kafka-server-start.sh config/server0.properties > export LOG_DIR=logs1 && ./bin/kafka-server-start.sh config/server1.properties > ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t > --replica-assignment 1 > ./bin/zookeeper-shell.sh localhost:2181 get /brokers/topics/t {"version":1,"partitions":{"0":[1]}} create /admin/reassign_partitions {"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}],"version":1} Created /admin/reassign_partitions get /brokers/topics/t {"version":1,"partitions":{"0":[1,2]}} get /admin/reassign_partitions {"version":1,"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}]} delete /admin/reassign_partitions delete /controller get /brokers/topics/t {"version":1,"partitions":{"0":[1,2]}} get /admin/reassign_partitions Node does not exist: /admin/reassign_partitions > echo > '{"partitions":[{"topic":"t","partition":0,"replicas":[1]}],"version":1}' > > reassignment.txt > ./bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 > --reassignment-json-file reassignment.txt --execute get /brokers/topics/t {"version":1,"partitions":{"0":[1]}} get /admin/reassign_partitions Node does not exist: /admin/reassign_partitions delete /controller get /brokers/topics/t {"version":1,"partitions":{"0":[1,2]}} get /admin/reassign_partitions Node does not exist: /admin/reassign_partitions {code} Notice that the replica set goes from \[1\] to \[1,2\] (as expected with the explicit {{/admin/reassign_partitions}} znode creation during the initial controller) back to \[1\] (as expected with the partition reassignment during the second controller) and again back to \[1,2\] after the original controller gets re-elected. That last transition from \[1\] to \[1,2\] is unexpected. It's due to the original controller not resetting its {{ControllerContext.partitionsBeingReassigned}} correctly. {{initializePartitionReassignment}} simply adds to what's already in {{ControllerContext.partitionsBeingReassigned}}. The explicit {{/admin/reassign_partitions}} znode creation is to circumvent KAFKA-5161 (95b48b157aca44beec4335e62a59f37097fe7499). Doing so is valid since: 1. our code in production doesn't have that change 2. KAFKA-5161 doesn't address the underlying race condition between a broker failure and the ReassignPartitionsCommand tool creating the znode. was: This ticket is all about ControllerContext initialization and teardown. The key points are: 1. we should teardown ControllerContext during resignation instead of waiting on election to fix it up. A heapdump shows that the former controller keeps pretty much all of its ControllerContext state laying around. 2. we don't properly teardown/reset `ControllerContext.partitionsBeingReassigned`. This caused problems for us in a production cluster at linkedin as shown in the scenario below: {code} > rm -rf /tmp/zookeeper/ /tmp/kafka-logs* logs* > ./gradlew clean jar > ./bin/zookeeper-server-start.sh config/zookeeper.properties > export LOG_DIR=logs0 && ./bin/kafka-server-start.sh config/server0.properties > export LOG_DIR=logs1 && ./bin/kafka-server-start.sh config/server1.properties > ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t > --replica-assignment 1 > ./bin/zookeeper-shell.sh localhost:2181 get /brokers/topics/t {"version":1,"partitions":{"0":[1]}} create /admin/reassign_partitions {"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}],"version":1} Created /admin/reassign_partitions get /brokers/topics/t {"version":1,"partitions":{"0":[1,2]}} get /admin/reassign_partitions {"version":1,"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}]} delete /admin/reassign_partitions delete /controller get /brokers/topics/t {"version":1,"partitions":{"0":[1,2]}} get /admin/reassign_partitions Node does not exist: /admin/reassign_partitions > echo > '{"partitions":[{"topic":"t","partition":0,"replicas":[1]}],"version":1}' > > reassignment.txt > ./bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 > --reassignment-json-file reassignment.txt --execute get /brokers/topics/t {"version":1,"partitions":{"0":[1]}} get /admin/reassign_partitions Node does not exist: /admin/reassign_partitions delete /controller get /brokers/topics/t {"version":1,"partitions":{"0":[1,2]}} get /admin/reassign_partitions Node does not exist: /admin/reassign_partitions {code} Notice that the replica set goes from \[1\] to \[1,2\] (as expected with the explicit `/admin/reassign_partitions` znode creation during the initial controller) back to \[1\] (as expected with the partition reassignment during the second controller) and again back to \[1,2\] after the original controller gets re-elected. That last transition from \[1\] to \[1,2\] is unexpected. It's due to the original controller not resetting its `ControllerContext.partitionsBeingReassigned` correctly. `initializePartitionReassignment` simply adds to what's already in `ControllerContext.partitionsBeingReassigned`. The explicit `/admin/reassign_partitions` znode creation is to circumvent KAFKA-5161 (95b48b157aca44beec4335e62a59f37097fe7499). Doing so is valid since: 1. our code in production doesn't have that change 2. KAFKA-5161 doesn't address the underlying race condition between a broker failure and the ReassignPartitionsCommand tool creating the znode. It looks like this bug has been around for quite some time (definitely before 0.10.2). > reset ControllerContext during resignation > ------------------------------------------ > > Key: KAFKA-5310 > URL: https://issues.apache.org/jira/browse/KAFKA-5310 > Project: Kafka > Issue Type: Sub-task > Reporter: Onur Karaman > Assignee: Onur Karaman > > This ticket is all about ControllerContext initialization and teardown. The > key points are: > 1. we should teardown ControllerContext during resignation instead of waiting > on election to fix it up. A heapdump shows that the former controller keeps > pretty much all of its ControllerContext state laying around. > 2. we don't properly teardown/reset > {{ControllerContext.partitionsBeingReassigned}}. This caused problems for us > in a production cluster at linkedin as shown in the scenario below: > {code} > > rm -rf /tmp/zookeeper/ /tmp/kafka-logs* logs* > > ./gradlew clean jar > > ./bin/zookeeper-server-start.sh config/zookeeper.properties > > export LOG_DIR=logs0 && ./bin/kafka-server-start.sh > > config/server0.properties > > export LOG_DIR=logs1 && ./bin/kafka-server-start.sh > > config/server1.properties > > ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t > > --replica-assignment 1 > > ./bin/zookeeper-shell.sh localhost:2181 > get /brokers/topics/t > {"version":1,"partitions":{"0":[1]}} > create /admin/reassign_partitions > {"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}],"version":1} > Created /admin/reassign_partitions > get /brokers/topics/t > {"version":1,"partitions":{"0":[1,2]}} > get /admin/reassign_partitions > {"version":1,"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}]} > delete /admin/reassign_partitions > delete /controller > get /brokers/topics/t > {"version":1,"partitions":{"0":[1,2]}} > get /admin/reassign_partitions > Node does not exist: /admin/reassign_partitions > > echo > > '{"partitions":[{"topic":"t","partition":0,"replicas":[1]}],"version":1}' > > > reassignment.txt > > ./bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 > > --reassignment-json-file reassignment.txt --execute > get /brokers/topics/t > {"version":1,"partitions":{"0":[1]}} > get /admin/reassign_partitions > Node does not exist: /admin/reassign_partitions > delete /controller > get /brokers/topics/t > {"version":1,"partitions":{"0":[1,2]}} > get /admin/reassign_partitions > Node does not exist: /admin/reassign_partitions > {code} > Notice that the replica set goes from \[1\] to \[1,2\] (as expected with the > explicit {{/admin/reassign_partitions}} znode creation during the initial > controller) back to \[1\] (as expected with the partition reassignment during > the second controller) and again back to \[1,2\] after the original > controller gets re-elected. > That last transition from \[1\] to \[1,2\] is unexpected. It's due to the > original controller not resetting its > {{ControllerContext.partitionsBeingReassigned}} correctly. > {{initializePartitionReassignment}} simply adds to what's already in > {{ControllerContext.partitionsBeingReassigned}}. > The explicit {{/admin/reassign_partitions}} znode creation is to circumvent > KAFKA-5161 (95b48b157aca44beec4335e62a59f37097fe7499). Doing so is valid > since: > 1. our code in production doesn't have that change > 2. KAFKA-5161 doesn't address the underlying race condition between a broker > failure and the ReassignPartitionsCommand tool creating the znode. -- This message was sent by Atlassian JIRA (v6.3.15#6346)