[ 
https://issues.apache.org/jira/browse/KAFKA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328472#comment-14328472
 ] 

Neha Narkhede commented on KAFKA-1546:
--------------------------------------

bq. If this isn't feasible, then I do think that the heuristic proposed in 
Neha's comment is a good one.. and I will submit a patch for it.

Sounds good. Will help you review it.

bq. a. keepInSyncMessages - This tracks replica lag as a function of the number 
of messages it is trailing behind. I believe we will remove this entirely 
regardless of the approach we choose.

Correct.

bq. b. keepInSyncTimeMs - This tracks the amount of time between fetch 
requests. I think we can remove this as well.

Hmm, depends. There are 2 things we need to check - dead replicas and slow 
replicas. The dead replica check is to remove a replica that hasn't sent a 
fetch request to the leader for some time. Take the example of a replica that 
is in sync with the leader (lagBegin is -1), there aren't new messages coming 
in and it stops fetching entirely. We can remove the replica when there are new 
messages based on the lagBegin logic but really that replica should've been 
removed long before that, because it stopped fetching and was dead.

The logic we have above works pretty well for slow replicas, but I think we 
still need to handle dead replicas for low-volume topics. 

> Automate replica lag tuning
> ---------------------------
>
>                 Key: KAFKA-1546
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1546
>             Project: Kafka
>          Issue Type: Improvement
>          Components: replication
>    Affects Versions: 0.8.0, 0.8.1, 0.8.1.1
>            Reporter: Neha Narkhede
>            Assignee: Aditya Auradkar
>              Labels: newbie++
>
> Currently, there is no good way to tune the replica lag configs to 
> automatically account for high and low volume topics on the same cluster. 
> For the low-volume topic it will take a very long time to detect a lagging
> replica, and for the high-volume topic it will have false-positives.
> One approach to making this easier would be to have the configuration
> be something like replica.lag.max.ms and translate this into a number
> of messages dynamically based on the throughput of the partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to