[ https://issues.apache.org/jira/browse/KAFKA-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158933#comment-16158933 ]
Onur Karaman commented on KAFKA-5857: ------------------------------------- I wouldn't be surprised if there were no attempts so far at making the controller memory-efficient. There's a slight chance I may have coincidentally ran into the same issue yesterday while preparing for an upcoming talk. I tried timing how long it takes to complete a reassignment with many empty partitions and noticed that progress eventually halted and the controller hit OOM. Here's my setup on my laptop: {code} > rm -rf /tmp/zookeeper/ /tmp/kafka-logs* logs* > ./gradlew clean jar > ./bin/zookeeper-server-start.sh config/zookeeper.properties > export LOG_DIR=logs0 && ./bin/kafka-server-start.sh config/server0.properties > ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t > --partitions 5000 --replication-factor 1 > export LOG_DIR=logs1 && ./bin/kafka-server-start.sh config/server1.properties > python import json with open("reassignment.txt", "w") as f: reassignment = {"version":1, "partitions": [{"topic": "t", "partition": partition, "replicas": [0, 1]} for partition in range(5000)]} json.dump(reassignment, f, separators=(',',':')) > ./zkCli.sh -server localhost:2181 > create /admin/reassign_partitions <json here> {code} Note that I had to use the zkCli.sh that comes with zookeeper just to write the reassignment into zk. Kafka's kafka-reassign-partitions.sh gets stuck before writing to zookeeper and zookeeper-shell.sh seems to hang while copying the reassignment into the command. Below are my broker configs: {code} > cat config/server0.properties broker.id=0 listeners=PLAINTEXT://localhost:9090 log.dirs=/tmp/kafka-logs0 zookeeper.connect=127.0.0.1:2181 auto.leader.rebalance.enable=false unclean.leader.election.enable=false delete.topic.enable=true log.index.size.max.bytes=1024 zookeeper.session.timeout.ms=60000 replica.lag.time.max.ms=100000 [09:57:16] okaraman@okaraman-mn3:~/code/kafka > cat config/server1.properties broker.id=1 listeners=PLAINTEXT://localhost:9091 log.dirs=/tmp/kafka-logs1 zookeeper.connect=localhost:2181 auto.leader.rebalance.enable=false unclean.leader.election.enable=false delete.topic.enable=true log.index.size.max.bytes=1024 zookeeper.session.timeout.ms=60000 replica.lag.time.max.ms=100000 {code} I haven't looked into the cause of the OOM. I ran the scenario again just now and found that the controller spent a significant amount of time in G1 Old Gen GC. > Excessive heap usage on controller node during reassignment > ----------------------------------------------------------- > > Key: KAFKA-5857 > URL: https://issues.apache.org/jira/browse/KAFKA-5857 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.11.0.0 > Environment: CentOs 7, Java 1.8 > Reporter: Raoufeh Hashemian > Labels: reliability > Fix For: 1.1.0 > > Attachments: CPU.png, disk_write_x.png, memory.png, > reassignment_plan.txt > > > I was trying to expand our kafka cluster of 6 broker nodes to 12 broker > nodes. > Before expansion, we had a single topic with 960 partitions and a replication > factor of 3. So each node had 480 partitions. The size of data in each node > was 3TB . > To do the expansion, I submitted a partition reassignment plan (see attached > file for the current/new assignments). The plan was optimized to minimize > data movement and be rack aware. > When I submitted the plan, it took approximately 3 hours for moving data from > old to new nodes to complete. After that, it started deleting source > partitions (I say this based on the number of file descriptors) and > rebalancing leaders which has not been successful. Meanwhile, the heap usage > in the controller node started to go up with a large slope (along with long > GC times) and it took 5 hours for the controller to go out of memory and > another controller started to have the same behaviour for another 4 hours. At > this time the zookeeper ran out of disk and the service stopped. > To recover from this condition: > 1) Removed zk logs to free up disk and restarted all 3 zk nodes > 2) Deleted /kafka/admin/reassign_partitions node from zk > 3) Had to do unclean restarts of kafka service on oom controller nodes which > took 3 hours to complete . After this stage there was still 676 under > replicated partitions. > 4) Do a clean restart on all 12 broker nodes. > After step 4 , number of under replicated nodes went to 0. > So I was wondering if this memory footprint from controller is expected for > 1k partitions ? Did we do sth wrong or it is a bug? > Attached are some resource usage graph during this 30 hours event and the > reassignment plan. I'll try to add log files as well -- This message was sent by Atlassian JIRA (v6.4.14#64029)