[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241774#comment-15241774 ]
Jason Ruckman commented on KAFKA-1464: -------------------------------------- Hello Neha, One problem we've run into, is we run a system where sometimes we replace brokers completely, in an automated fashion, and rebalance leadership and replicas across them. When we bring a new broker online, we move some partitions to it. What we see is something like this: Consider topics A, B, C with replication factors of 3 Consider brokers 1,2,3 as serving topics A,B,C A new broker 4 is replacing 1 (maybe the machine died, or whatever) A and B are relatively small, but C is large 1. Move some leaders and replicas to 4 for A and B from 2 and 3. Everything is good up until now 2. Move some leaders and replicas to 4 for C from 2 and 3. At this point, broker 4 is pegged, since it's trying to pull in data from 2 and 3 (the other two replicas) trying to catch up, so it causes timeouts for partitions it is the leader for. Brokers 2 and 3 are ok because 4 can only use 1/2 of their bandwidth to replicate, since they still have some bandwidth available to serve requests. > Add a throttling option to the Kafka replication tool > ----------------------------------------------------- > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication > Affects Versions: 0.8.0 > Reporter: mjuarez > Assignee: Ismael Juma > Priority: Minor > Labels: replication, replication-tools > Fix For: 0.10.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)