[ 
https://issues.apache.org/jira/browse/KAFKA-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158933#comment-16158933
 ] 

Onur Karaman commented on KAFKA-5857:
-------------------------------------

I wouldn't be surprised if there were no attempts so far at making the 
controller memory-efficient.

There's a slight chance I may have coincidentally ran into the same issue 
yesterday while preparing for an upcoming talk. I tried timing how long it 
takes to complete a reassignment with many empty partitions and noticed that 
progress eventually halted and the controller hit OOM.

Here's my setup on my laptop:
{code}
> rm -rf /tmp/zookeeper/ /tmp/kafka-logs* logs*
> ./gradlew clean jar
> ./bin/zookeeper-server-start.sh config/zookeeper.properties
> export LOG_DIR=logs0 && ./bin/kafka-server-start.sh config/server0.properties
> ./bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic t 
> --partitions 5000 --replication-factor 1
> export LOG_DIR=logs1 && ./bin/kafka-server-start.sh config/server1.properties
> python
import json
with open("reassignment.txt", "w") as f:
  reassignment = {"version":1, "partitions": [{"topic": "t", "partition": 
partition, "replicas": [0, 1]} for partition in range(5000)]}
  json.dump(reassignment, f, separators=(',',':'))
> ./zkCli.sh -server localhost:2181
> create /admin/reassign_partitions <json here>
{code}

Note that I had to use the zkCli.sh that comes with zookeeper just to write the 
reassignment into zk. Kafka's kafka-reassign-partitions.sh gets stuck before 
writing to zookeeper and zookeeper-shell.sh seems to hang while copying the 
reassignment into the command.

Below are my broker configs:
{code}
> cat config/server0.properties
broker.id=0
listeners=PLAINTEXT://localhost:9090
log.dirs=/tmp/kafka-logs0
zookeeper.connect=127.0.0.1:2181
auto.leader.rebalance.enable=false
unclean.leader.election.enable=false
delete.topic.enable=true
log.index.size.max.bytes=1024
zookeeper.session.timeout.ms=60000
replica.lag.time.max.ms=100000
[09:57:16] okaraman@okaraman-mn3:~/code/kafka
> cat config/server1.properties
broker.id=1
listeners=PLAINTEXT://localhost:9091
log.dirs=/tmp/kafka-logs1
zookeeper.connect=localhost:2181
auto.leader.rebalance.enable=false
unclean.leader.election.enable=false
delete.topic.enable=true
log.index.size.max.bytes=1024
zookeeper.session.timeout.ms=60000
replica.lag.time.max.ms=100000
{code}

I haven't looked into the cause of the OOM. I ran the scenario again just now 
and found that the controller spent a significant amount of time in G1 Old Gen 
GC.

> Excessive heap usage on controller node during reassignment
> -----------------------------------------------------------
>
>                 Key: KAFKA-5857
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5857
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.11.0.0
>         Environment: CentOs 7, Java 1.8
>            Reporter: Raoufeh Hashemian
>              Labels: reliability
>             Fix For: 1.1.0
>
>         Attachments: CPU.png, disk_write_x.png, memory.png, 
> reassignment_plan.txt
>
>
> I was trying to expand our kafka cluster of 6 broker nodes to 12 broker 
> nodes. 
> Before expansion, we had a single topic with 960 partitions and a replication 
> factor of 3. So each node had 480 partitions. The size of data in each node 
> was 3TB . 
> To do the expansion, I submitted a partition reassignment plan (see attached 
> file for the current/new assignments). The plan was optimized to minimize 
> data movement and be rack aware. 
> When I submitted the plan, it took approximately 3 hours for moving data from 
> old to new nodes to complete. After that, it started deleting source 
> partitions (I say this based on the number of file descriptors) and 
> rebalancing leaders which has not been successful. Meanwhile, the heap usage 
> in the controller node started to go up with a large slope (along with long 
> GC times) and it took 5 hours for the controller to go out of memory and 
> another controller started to have the same behaviour for another 4 hours. At 
> this time the zookeeper ran out of disk and the service stopped.
> To recover from this condition:
> 1) Removed zk logs to free up disk and restarted all 3 zk nodes
> 2) Deleted /kafka/admin/reassign_partitions node from zk
> 3) Had to do unclean restarts of kafka service on oom controller nodes which 
> took 3 hours to complete  . After this stage there was still 676 under 
> replicated partitions.
> 4) Do a clean restart on all 12 broker nodes.
> After step 4 , number of under replicated nodes went to 0.
> So I was wondering if this memory footprint from controller is expected for 
> 1k partitions ? Did we do sth wrong or it is a bug?
> Attached are some resource usage graph during this 30 hours event and the 
> reassignment plan. I'll try to add log files as well



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to