[
https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241774#comment-15241774
]
Jason Ruckman commented on KAFKA-1464:
--------------------------------------
Hello Neha,
One problem we've run into, is we run a system where sometimes we replace
brokers completely, in an automated fashion, and rebalance leadership and
replicas across them. When we bring a new broker online, we move some
partitions to it. What we see is something like this:
Consider topics A, B, C with replication factors of 3
Consider brokers 1,2,3 as serving topics A,B,C
A new broker 4 is replacing 1 (maybe the machine died, or whatever)
A and B are relatively small, but C is large
1. Move some leaders and replicas to 4 for A and B from 2 and 3. Everything is
good up until now
2. Move some leaders and replicas to 4 for C from 2 and 3.
At this point, broker 4 is pegged, since it's trying to pull in data from 2 and
3 (the other two replicas) trying to catch up, so it causes timeouts for
partitions it is the leader for. Brokers 2 and 3 are ok because 4 can only use
1/2 of their bandwidth to replicate, since they still have some bandwidth
available to serve requests.
> Add a throttling option to the Kafka replication tool
> -----------------------------------------------------
>
> Key: KAFKA-1464
> URL: https://issues.apache.org/jira/browse/KAFKA-1464
> Project: Kafka
> Issue Type: New Feature
> Components: replication
> Affects Versions: 0.8.0
> Reporter: mjuarez
> Assignee: Ismael Juma
> Priority: Minor
> Labels: replication, replication-tools
> Fix For: 0.10.1.0
>
>
> When performing replication on new nodes of a Kafka cluster, the replication
> process will use all available resources to replicate as fast as possible.
> This causes performance issues (mostly disk IO and sometimes network
> bandwidth) when doing this in a production environment, in which you're
> trying to serve downstream applications, at the same time you're performing
> maintenance on the Kafka cluster.
> An option to throttle the replication to a specific rate (in either MB/s or
> activities/second) would help production systems to better handle maintenance
> tasks while still serving downstream applications.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)