[ 
https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15121691#comment-15121691
 ] 

Jay Kreps commented on KAFKA-1464:
----------------------------------

I agree that the key difference is in-sync vs out-of-sync replicas. In-sync 
replicas add to the commit time so they are really the highest priority and 
generally should add much load anyway. Out-of-sync replicas are the catch up 
case that add load.

Blindly reducing the fetch size for out-of-sync partitions probably would make 
things worse though. Large fetch size is actually good for efficiency and 
shrinking it will add overhead (more physical I/O, more FS reads, more requests 
overall, etc).

However it should be possible to throttle dynamically at the partition level 
for out of sync partitions. This could be done by dynamically omitting 
partitions that have exceeded their throttle rate from either the fetch request 
that the follower sends or from the fetch response the leader constructs. For 
example when handling follower fetch requests the leader could check the 
observed fetch rate for that follower and whether it is in sync or not; if the 
rate exceeds the configured maximum for catch-up traffic the leader would 
ignore that partition and only answer for other partitions (if there are no 
other partitions the purgatory time would need to be calculated to be no 
greater than the time in which the fetch rate might come down below the 
throttle). This would allow for dynamically throttling down the catch up 
traffic without reducing efficiency.

> Add a throttling option to the Kafka replication tool
> -----------------------------------------------------
>
>                 Key: KAFKA-1464
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1464
>             Project: Kafka
>          Issue Type: New Feature
>          Components: replication
>    Affects Versions: 0.8.0
>            Reporter: mjuarez
>            Assignee: Ismael Juma
>            Priority: Minor
>              Labels: replication, replication-tools
>             Fix For: 0.9.1.0
>
>
> When performing replication on new nodes of a Kafka cluster, the replication 
> process will use all available resources to replicate as fast as possible.  
> This causes performance issues (mostly disk IO and sometimes network 
> bandwidth) when doing this in a production environment, in which you're 
> trying to serve downstream applications, at the same time you're performing 
> maintenance on the Kafka cluster.
> An option to throttle the replication to a specific rate (in either MB/s or 
> activities/second) would help production systems to better handle maintenance 
> tasks while still serving downstream applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to