[ https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121691#comment-15121691 ]
Jay Kreps commented on KAFKA-1464: ---------------------------------- I agree that the key difference is in-sync vs out-of-sync replicas. In-sync replicas add to the commit time so they are really the highest priority and generally should add much load anyway. Out-of-sync replicas are the catch up case that add load. Blindly reducing the fetch size for out-of-sync partitions probably would make things worse though. Large fetch size is actually good for efficiency and shrinking it will add overhead (more physical I/O, more FS reads, more requests overall, etc). However it should be possible to throttle dynamically at the partition level for out of sync partitions. This could be done by dynamically omitting partitions that have exceeded their throttle rate from either the fetch request that the follower sends or from the fetch response the leader constructs. For example when handling follower fetch requests the leader could check the observed fetch rate for that follower and whether it is in sync or not; if the rate exceeds the configured maximum for catch-up traffic the leader would ignore that partition and only answer for other partitions (if there are no other partitions the purgatory time would need to be calculated to be no greater than the time in which the fetch rate might come down below the throttle). This would allow for dynamically throttling down the catch up traffic without reducing efficiency. > Add a throttling option to the Kafka replication tool > ----------------------------------------------------- > > Key: KAFKA-1464 > URL: https://issues.apache.org/jira/browse/KAFKA-1464 > Project: Kafka > Issue Type: New Feature > Components: replication > Affects Versions: 0.8.0 > Reporter: mjuarez > Assignee: Ismael Juma > Priority: Minor > Labels: replication, replication-tools > Fix For: 0.9.1.0 > > > When performing replication on new nodes of a Kafka cluster, the replication > process will use all available resources to replicate as fast as possible. > This causes performance issues (mostly disk IO and sometimes network > bandwidth) when doing this in a production environment, in which you're > trying to serve downstream applications, at the same time you're performing > maintenance on the Kafka cluster. > An option to throttle the replication to a specific rate (in either MB/s or > activities/second) would help production systems to better handle maintenance > tasks while still serving downstream applications. -- This message was sent by Atlassian JIRA (v6.3.4#6332)