[ 
https://issues.apache.org/jira/browse/KAFKA-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Onur Karaman updated KAFKA-5310:
--------------------------------
    Description: 
This ticket is all about ControllerContext initialization and teardown. The key 
points are:
1. we should teardown ControllerContext during resignation instead of waiting 
on election to fix it up. A heapdump shows that the former controller keeps 
pretty much all of its ControllerContext state laying around.
2. we don't properly teardown/reset 
{{ControllerContext.partitionsBeingReassigned}}. This caused problems for us in 
a production cluster at linkedin as shown in the scenario below:
{code}
> rm -rf /tmp/zookeeper/ /tmp/kafka-logs* logs*
> ./gradlew clean jar
> ./bin/zookeeper-server-start.sh config/zookeeper.properties
> export LOG_DIR=logs0 && ./bin/kafka-server-start.sh config/server0.properties
> export LOG_DIR=logs1 && ./bin/kafka-server-start.sh config/server1.properties
> ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t 
> --replica-assignment 1
> ./bin/zookeeper-shell.sh localhost:2181

get /brokers/topics/t
{"version":1,"partitions":{"0":[1]}}

create /admin/reassign_partitions 
{"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}],"version":1}
Created /admin/reassign_partitions

get /brokers/topics/t
{"version":1,"partitions":{"0":[1,2]}}

get /admin/reassign_partitions
{"version":1,"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}]}

delete /admin/reassign_partitions
delete /controller

get /brokers/topics/t
{"version":1,"partitions":{"0":[1,2]}}

get /admin/reassign_partitions
Node does not exist: /admin/reassign_partitions

> echo 
> '{"partitions":[{"topic":"t","partition":0,"replicas":[1]}],"version":1}' > 
> reassignment.txt
> ./bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 
> --reassignment-json-file reassignment.txt --execute

get /brokers/topics/t
{"version":1,"partitions":{"0":[1]}}

get /admin/reassign_partitions
Node does not exist: /admin/reassign_partitions

delete /controller

get /brokers/topics/t
{"version":1,"partitions":{"0":[1,2]}}

get /admin/reassign_partitions
Node does not exist: /admin/reassign_partitions
{code}

Notice that the replica set goes from \[1\] to \[1,2\] (as expected with the 
explicit {{/admin/reassign_partitions}} znode creation during the initial 
controller) back to \[1\] (as expected with the partition reassignment during 
the second controller) and again back to \[1,2\] after the original controller 
gets re-elected.

That last transition from \[1\] to \[1,2\] is unexpected. It's due to the 
original controller not resetting its 
{{ControllerContext.partitionsBeingReassigned}} correctly. 
{{initializePartitionReassignment}} simply adds to what's already in 
{{ControllerContext.partitionsBeingReassigned}}.

The explicit {{/admin/reassign_partitions}} znode creation is to circumvent 
KAFKA-5161 (95b48b157aca44beec4335e62a59f37097fe7499). Doing so is valid since:
1. our code in production doesn't have that change
2. KAFKA-5161 doesn't address the underlying race condition between a broker 
failure and the ReassignPartitionsCommand tool creating the znode.

  was:
This ticket is all about ControllerContext initialization and teardown. The key 
points are:
1. we should teardown ControllerContext during resignation instead of waiting 
on election to fix it up. A heapdump shows that the former controller keeps 
pretty much all of its ControllerContext state laying around.
2. we don't properly teardown/reset 
`ControllerContext.partitionsBeingReassigned`. This caused problems for us in a 
production cluster at linkedin as shown in the scenario below:
{code}
> rm -rf /tmp/zookeeper/ /tmp/kafka-logs* logs*
> ./gradlew clean jar
> ./bin/zookeeper-server-start.sh config/zookeeper.properties
> export LOG_DIR=logs0 && ./bin/kafka-server-start.sh config/server0.properties
> export LOG_DIR=logs1 && ./bin/kafka-server-start.sh config/server1.properties
> ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t 
> --replica-assignment 1
> ./bin/zookeeper-shell.sh localhost:2181

get /brokers/topics/t
{"version":1,"partitions":{"0":[1]}}

create /admin/reassign_partitions 
{"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}],"version":1}
Created /admin/reassign_partitions

get /brokers/topics/t
{"version":1,"partitions":{"0":[1,2]}}

get /admin/reassign_partitions
{"version":1,"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}]}

delete /admin/reassign_partitions
delete /controller

get /brokers/topics/t
{"version":1,"partitions":{"0":[1,2]}}

get /admin/reassign_partitions
Node does not exist: /admin/reassign_partitions

> echo 
> '{"partitions":[{"topic":"t","partition":0,"replicas":[1]}],"version":1}' > 
> reassignment.txt
> ./bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 
> --reassignment-json-file reassignment.txt --execute

get /brokers/topics/t
{"version":1,"partitions":{"0":[1]}}

get /admin/reassign_partitions
Node does not exist: /admin/reassign_partitions

delete /controller

get /brokers/topics/t
{"version":1,"partitions":{"0":[1,2]}}

get /admin/reassign_partitions
Node does not exist: /admin/reassign_partitions
{code}

Notice that the replica set goes from \[1\] to \[1,2\] (as expected with the 
explicit `/admin/reassign_partitions` znode creation during the initial 
controller) back to \[1\] (as expected with the partition reassignment during 
the second controller) and again back to \[1,2\] after the original controller 
gets re-elected.

That last transition from \[1\] to \[1,2\] is unexpected. It's due to the 
original controller not resetting its 
`ControllerContext.partitionsBeingReassigned` correctly. 
`initializePartitionReassignment` simply adds to what's already in 
`ControllerContext.partitionsBeingReassigned`.

The explicit `/admin/reassign_partitions` znode creation is to circumvent 
KAFKA-5161 (95b48b157aca44beec4335e62a59f37097fe7499). Doing so is valid since:
1. our code in production doesn't have that change
2. KAFKA-5161 doesn't address the underlying race condition between a broker 
failure and the ReassignPartitionsCommand tool creating the znode.

It looks like this bug has been around for quite some time (definitely before 
0.10.2).


> reset ControllerContext during resignation
> ------------------------------------------
>
>                 Key: KAFKA-5310
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5310
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Onur Karaman
>            Assignee: Onur Karaman
>
> This ticket is all about ControllerContext initialization and teardown. The 
> key points are:
> 1. we should teardown ControllerContext during resignation instead of waiting 
> on election to fix it up. A heapdump shows that the former controller keeps 
> pretty much all of its ControllerContext state laying around.
> 2. we don't properly teardown/reset 
> {{ControllerContext.partitionsBeingReassigned}}. This caused problems for us 
> in a production cluster at linkedin as shown in the scenario below:
> {code}
> > rm -rf /tmp/zookeeper/ /tmp/kafka-logs* logs*
> > ./gradlew clean jar
> > ./bin/zookeeper-server-start.sh config/zookeeper.properties
> > export LOG_DIR=logs0 && ./bin/kafka-server-start.sh 
> > config/server0.properties
> > export LOG_DIR=logs1 && ./bin/kafka-server-start.sh 
> > config/server1.properties
> > ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t 
> > --replica-assignment 1
> > ./bin/zookeeper-shell.sh localhost:2181
> get /brokers/topics/t
> {"version":1,"partitions":{"0":[1]}}
> create /admin/reassign_partitions 
> {"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}],"version":1}
> Created /admin/reassign_partitions
> get /brokers/topics/t
> {"version":1,"partitions":{"0":[1,2]}}
> get /admin/reassign_partitions
> {"version":1,"partitions":[{"topic":"t","partition":0,"replicas":[1,2]}]}
> delete /admin/reassign_partitions
> delete /controller
> get /brokers/topics/t
> {"version":1,"partitions":{"0":[1,2]}}
> get /admin/reassign_partitions
> Node does not exist: /admin/reassign_partitions
> > echo 
> > '{"partitions":[{"topic":"t","partition":0,"replicas":[1]}],"version":1}' > 
> > reassignment.txt
> > ./bin/kafka-reassign-partitions.sh --zookeeper localhost:2181 
> > --reassignment-json-file reassignment.txt --execute
> get /brokers/topics/t
> {"version":1,"partitions":{"0":[1]}}
> get /admin/reassign_partitions
> Node does not exist: /admin/reassign_partitions
> delete /controller
> get /brokers/topics/t
> {"version":1,"partitions":{"0":[1,2]}}
> get /admin/reassign_partitions
> Node does not exist: /admin/reassign_partitions
> {code}
> Notice that the replica set goes from \[1\] to \[1,2\] (as expected with the 
> explicit {{/admin/reassign_partitions}} znode creation during the initial 
> controller) back to \[1\] (as expected with the partition reassignment during 
> the second controller) and again back to \[1,2\] after the original 
> controller gets re-elected.
> That last transition from \[1\] to \[1,2\] is unexpected. It's due to the 
> original controller not resetting its 
> {{ControllerContext.partitionsBeingReassigned}} correctly. 
> {{initializePartitionReassignment}} simply adds to what's already in 
> {{ControllerContext.partitionsBeingReassigned}}.
> The explicit {{/admin/reassign_partitions}} znode creation is to circumvent 
> KAFKA-5161 (95b48b157aca44beec4335e62a59f37097fe7499). Doing so is valid 
> since:
> 1. our code in production doesn't have that change
> 2. KAFKA-5161 doesn't address the underlying race condition between a broker 
> failure and the ReassignPartitionsCommand tool creating the znode.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to