[
https://issues.apache.org/jira/browse/KAFKA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14359925#comment-14359925
]
Aditya Auradkar commented on KAFKA-1546:
----------------------------------------
I ran a bunch of tests on my patch for KAFKA-1546. I started a cluster and used
the PerformanceTest class to throw a ton of load.
1. Verify that the process stays in ISR for a large volume of messages.
Generated lots of load with small messages and very high throughout. I noticed
that the replica did not fall out of ISR. The previous solution would have
fluctuated in and out of ISR.
bin/kafka-run-class.sh org.apache.kafka.clients.tools.ProducerPerformance test
50000000000 100 -1 acks=1 bootstrap.servers=localhost:9092
buffer.memory=67108864 batch.size=8196
2. Stuck follower - Generated some load and paused the follower process using
SIGSTOP. I raised the zk session timeout so the process stayed registered with
ZK but did not send a fetch request for 'n' seconds. This threw it out of ISR
as expected.
3. Lagging follower - I was able to to do this by reducing the max fetch size
on the follower instance. This made it impossible for the follower to catch up
causing it to be removed from ISR.
4. I also simulated the case where the follower was down for a long time and
the leader had accumulated a significant amount of data. On starting the
follower, it stayed out of ISR until it caught up to the log end offset.
> Automate replica lag tuning
> ---------------------------
>
> Key: KAFKA-1546
> URL: https://issues.apache.org/jira/browse/KAFKA-1546
> Project: Kafka
> Issue Type: Improvement
> Components: replication
> Affects Versions: 0.8.0, 0.8.1, 0.8.1.1
> Reporter: Neha Narkhede
> Assignee: Aditya Auradkar
> Labels: newbie++
> Fix For: 0.8.3
>
> Attachments: KAFKA-1546.patch, KAFKA-1546_2015-03-11_18:48:09.patch,
> KAFKA-1546_2015-03-12_13:42:01.patch
>
>
> Currently, there is no good way to tune the replica lag configs to
> automatically account for high and low volume topics on the same cluster.
> For the low-volume topic it will take a very long time to detect a lagging
> replica, and for the high-volume topic it will have false-positives.
> One approach to making this easier would be to have the configuration
> be something like replica.lag.max.ms and translate this into a number
> of messages dynamically based on the throughput of the partition.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)