[
https://issues.apache.org/jira/browse/KAFKA-5857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raoufeh Hashemian updated KAFKA-5857:
-------------------------------------
Description:
I was trying to expand our kafka cluster of 6 broker nodes to 12 broker nodes.
Before expansion, we had a single topic with 960 partitions and a replication
factor of 3. So each node had 480 partitions. The size of data in each node was
3TB .
To do the expansion, I submitted a partition reassignment plan (see attached
file for the current/new assignments). The plan was optimized to minimize data
movement and be rack aware.
When I submitted the plan, it took approximately 3 hours for moving data from
old to new nodes to complete. After that, it started deleting source partitions
(I say this based on the number of file descriptors) and rebalancing leaders
which has not been successful. Meanwhile, the heap usage in the controller node
started to go up with a large slope (along with long GC times) and it took 5
hours for the controller to go out of memory and another controller started to
have the same behaviour for another 4 hours. At this time the zookeeper ran out
of disk and the service stopped.
To recover from this condition:
1) Removed zk logs to free up disk and restarted all 3 zk nodes
2) Deleted /kafka/admin/reassign_partitions node from zk
3) Had to do unclean restarts of kafka service on oom controller nodes which
took 3 hours to complete . After this stage there was still 676 under
replicated partitions.
4) Do a clean restart on all 12 broker nodes.
After step 5 , number of under replicated nodes went to 0.
So I was wondering if this memory footprint from controller is expected to 1k
partitions ? Did we do sth wrong or it is a bug?
Attached are some resource usage graph during this 30 hours event and the
reassignment plan. I'll try to add log files as well
was:
I was trying to expand our kafka cluster of 6 broker nodes to 12 broker nodes.
Before expansion, we had a single topic with 960 partitions and a replication
factor of 3. So each node had 480 partitions. The size of data in each node was
3TB .
To do the expansion, I submitted a partition reassignment plan (see attached
file for the current/new assignments). The plan was optimized to minimize data
movement and be rack aware.
When I submitted the plan, it took approximately 3 hours to complete data
movement after that, it started deleting source partitions (I say this based on
the number of file descriptors) and rebalancing leaders which has never been
successful. Meanwhile, the heap usage in the controller node started to peak up
with a large slope (along with long GC times) and it took 5 hours for the
controller to go out of memory and another controller started to have the same
behaviour for another 4 hours. at this time the zookeeper ran out of disk and
the service stopped.
To recover from this condition:
1) removed zk logs to free up disk and restarted all 3 zk nodes
2) deleted /kafka/admin/reassign_partitions node from zk
3) had to do unclean restarts of kafka service on oom controller nodes which
took 3 hours to complete . After this stage there was still 676 under
replicated partitions.
4) Do a clean restart on all 12 broker nodes.
After step 5 , number of under replicated nodes went to 0.
So I was wondering if this memory footprint from controller is expected to 1k
partitions ? Did we do sth wrong or it is a bug?
Attached are some resource usage graph during this 30 hours event and the
reassignment plan.
> Excessive heap usage on controller node during reassignment
> -----------------------------------------------------------
>
> Key: KAFKA-5857
> URL: https://issues.apache.org/jira/browse/KAFKA-5857
> Project: Kafka
> Issue Type: Bug
> Components: controller
> Affects Versions: 0.11.0.0
> Environment: CentOs 7, Java 1.8
> Reporter: Raoufeh Hashemian
> Attachments: CPU.png, disk_write_x.png, memory.png,
> reassignment_plan.txt
>
>
> I was trying to expand our kafka cluster of 6 broker nodes to 12 broker
> nodes.
> Before expansion, we had a single topic with 960 partitions and a replication
> factor of 3. So each node had 480 partitions. The size of data in each node
> was 3TB .
> To do the expansion, I submitted a partition reassignment plan (see attached
> file for the current/new assignments). The plan was optimized to minimize
> data movement and be rack aware.
> When I submitted the plan, it took approximately 3 hours for moving data from
> old to new nodes to complete. After that, it started deleting source
> partitions (I say this based on the number of file descriptors) and
> rebalancing leaders which has not been successful. Meanwhile, the heap usage
> in the controller node started to go up with a large slope (along with long
> GC times) and it took 5 hours for the controller to go out of memory and
> another controller started to have the same behaviour for another 4 hours. At
> this time the zookeeper ran out of disk and the service stopped.
> To recover from this condition:
> 1) Removed zk logs to free up disk and restarted all 3 zk nodes
> 2) Deleted /kafka/admin/reassign_partitions node from zk
> 3) Had to do unclean restarts of kafka service on oom controller nodes which
> took 3 hours to complete . After this stage there was still 676 under
> replicated partitions.
> 4) Do a clean restart on all 12 broker nodes.
> After step 5 , number of under replicated nodes went to 0.
> So I was wondering if this memory footprint from controller is expected to 1k
> partitions ? Did we do sth wrong or it is a bug?
> Attached are some resource usage graph during this 30 hours event and the
> reassignment plan. I'll try to add log files as well
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)