Haruki Okada created KAFKA-10690:
------------------------------------
Summary: Produce-response delay caused by lagging replica fetch
which blocks in-sync one
Key: KAFKA-10690
URL: https://issues.apache.org/jira/browse/KAFKA-10690
Project: Kafka
Issue Type: Improvement
Components: core
Affects Versions: 2.4.1
Reporter: Haruki Okada
Attachments: image-2020-11-06-11-15-21-781.png,
image-2020-11-06-11-15-38-390.png, image-2020-11-06-11-17-09-910.png
h2. Our environment
* Kafka version: 2.4.1
h2. Phenomenon
* Produce response time 99th (remote scope) degrades to 500ms, which is 20
times worse than usual
** Meanwhile, the cluster was running replica reassignment to service-in new
machine to recover replicas which held by failed (Hardware issue) broker machine
!image-2020-11-06-11-15-21-781.png|width=292,height=166!
h2. Analysis
Let's say
* broker-X: The broker we observed produce latency degradation
* broker-Y: The broker under servicing-in
broker-Y was catching up replicas of partitions:
* partition-A: has relatively small log size
* partition-B: has large log size
(actually, broker-Y was catching-up many other partitions. I noted only two
partitions here to make explanation simple)
broker-X was the leader for both partition-A and partition-B.
We found that both partition-A and partition-B are assigned to same
ReplicaFetcherThread of broker-Y, and produce latency started to degrade right
after broker-Y finished catching up partition-A.
!image-2020-11-06-11-17-09-910.png|width=476,height=174!
Besides, we observed disk reads on broker-X during service-in. (This is natural
since old segments are likely not in page cache)
!image-2020-11-06-11-15-38-390.png|width=292,height=193!
So we suspected that:
* In-sync replica fetch (partition-A) was involved by lagging replica fetch
(partition-B), which should be slow because it causes actual disk reads
** Since ReplicaFetcherThread sends fetch requests in blocking manner, next
fetch request can't be sent until one fetch request completes
** => Causes in-sync replica fetch for partitions assigned to same replica
fetcher thread to delay
** => Causes remote scope produce latency degradation
h2. Possible fix
We think this issue can be addressed by designating part of
ReplicaFetcherThread (or creating another thread pool) for lagging replica
catching-up, but not so sure this is the appropriate way.
Please give your opinions about this issue.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)