[ https://issues.apache.org/jira/browse/KAFKA-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jun Rao resolved KAFKA-10002. ----------------------------- Fix Version/s: 2.7.0 Resolution: Fixed merged the PR to trunk > Improve performances of StopReplicaRequest with large number of partitions to > be deleted > ---------------------------------------------------------------------------------------- > > Key: KAFKA-10002 > URL: https://issues.apache.org/jira/browse/KAFKA-10002 > Project: Kafka > Issue Type: Improvement > Reporter: David Jacot > Assignee: David Jacot > Priority: Major > Fix For: 2.7.0 > > > I have noticed that StopReplicaRequests with partitions to be deleted are > extremely slow when there is more than 2000 partitions which leads to hitting > the request timeout in the controller. A request with 2000 partitions to be > deleted still works but performances degrades significantly with the number > increases. For examples, a request with 3000 partitions to be deletes takes > appox. 60 seconds to be processed. > A CPU profile shows that most of the time is spent in checkpointing log start > offsets and recovery offsets. Almost 90% of the time is there. See attached. > When a partition is deleted, the replica manager calls > `ReplicaManager#asyncDelete` that checkpoints recovery offsets and log start > offsets. As the checkpoints are per data directory, the checkpointing is made > for all the partitions in the directory of the partition to be deleted. In > our case where we have only one data directory, if you deletes 1000 > partitions, we end up checkpointing the same things 1000 times which is not > efficient. -- This message was sent by Atlassian Jira (v8.3.4#803005)