[ 
https://issues.apache.org/jira/browse/KAFKA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358039#comment-14358039
 ] 

Jay Kreps commented on KAFKA-1546:
----------------------------------

Personally I don't think it really needs a KIP, it subtly changes the meaning 
of one config, but it actually changes it to mean what everyone thinks it 
currently means. What do you think? I think this one is less about user 
expectations or our opinions and more about "does it actually work". Speaking 
of which...

[~auradkar] What is the test plan for this? It is trivially easy to reproduce 
the problems with the old approach. Start a server with default settings and 
1-2 replicas and use the perf test to generate a ton of load with itty bitty 
messages and just watch the replicas drop in and out of sync. We should concoct 
the most brutal case of this and validate that unless the follower actually 
falls behind it never failure detects out of the ISR. But we also need to check 
the reverse condition, that both a soft death and a lag are still detected. You 
can cause a soft death by setting the zk session timeout to something massive 
and just using unix signals to pause the process. You can cause lag by just 
running some commands on one of the followers to eat up all the cpu or I/O 
while a load test is running until the follower falls behind. Both cases should 
get caught.

Anyhow, awesome job getting this done. I think this is one of the biggest 
stability issues in Kafka right now. The patch lgtm, but it would be good for 
[~junrao] and [~nehanarkhede] to take a look.



> Automate replica lag tuning
> ---------------------------
>
>                 Key: KAFKA-1546
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1546
>             Project: Kafka
>          Issue Type: Improvement
>          Components: replication
>    Affects Versions: 0.8.0, 0.8.1, 0.8.1.1
>            Reporter: Neha Narkhede
>            Assignee: Aditya Auradkar
>              Labels: newbie++
>         Attachments: KAFKA-1546.patch, KAFKA-1546_2015-03-11_18:48:09.patch
>
>
> Currently, there is no good way to tune the replica lag configs to 
> automatically account for high and low volume topics on the same cluster. 
> For the low-volume topic it will take a very long time to detect a lagging
> replica, and for the high-volume topic it will have false-positives.
> One approach to making this easier would be to have the configuration
> be something like replica.lag.max.ms and translate this into a number
> of messages dynamically based on the throughput of the partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to