Raoufeh Hashemian created KAFKA-5857:
----------------------------------------

             Summary: Excessive heap usage on controller node during 
reassignment
                 Key: KAFKA-5857
                 URL: https://issues.apache.org/jira/browse/KAFKA-5857
             Project: Kafka
          Issue Type: Bug
          Components: controller
    Affects Versions: 0.11.0.0
         Environment: CentOs 7, Java 1.8
            Reporter: Raoufeh Hashemian
         Attachments: CPU.png, disk_write_x.png, memory.png, 
reassignment_plan.txt

I was trying to expand our kafka cluster of 6 broker nodes to 12 broker nodes. 
Before expansion, we had a single topic with 960 partitions and a replication 
factor of 3. So each node had 480 partitions. The size of data in each node was 
3TB . 
To do the expansion, I submitted a partition reassignment plan (see attached 
file for the current/new assignments). The plan was optimized to minimize data 
movement and be rack aware. 

When I submitted the plan, it took approximately 3 hours to complete data 
movement after that, it started deleting source partitions (I say this based on 
the number of file descriptors) and rebalancing leaders which has never been 
successful. Meanwhile, the heap usage in the controller node started to peak up 
with a large slope (along with long GC times) and it took 5 hours for the 
controller to go out of memory and another controller started to have the same 
behaviour for another 4 hours. at this time the zookeeper ran out of disk and 
the service stopped.

To recover from this condition:
1) removed zk logs to free up disk and restarted all 3 zk nodes
2) deleted /kafka/admin/reassign_partitions node from zk
3) had to do unclean restarts of kafka service on oom controller nodes which 
took 3 hours to complete  . After this stage there was still 676 under 
replicated partitions.
4) Do a clean restart on all 12 broker nodes.

After step 5 , number of under replicated nodes went to 0.


So I was wondering if this memory footprint from controller is expected to 1k 
partitions ? Did we do sth wrong or it is a bug?

Attached are some resource usage graph during this 30 hours event and the 
reassignment plan.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to