Peter Sinoros-Szabo created KAFKA-15688:
-------------------------------------------
Summary: Partition leader election not running when disk IO hangs
Key: KAFKA-15688
URL: https://issues.apache.org/jira/browse/KAFKA-15688
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 3.3.2
Reporter: Peter Sinoros-Szabo
We run our Kafka brokers on AWS EC2 nodes using AWS EBS as disk to store the
messages.
Recently we had an issue when the EBS disk IO just stalled so Kafka was not
able to write or read anything from the disk, well except the data that was
still in page cache or that still fitted into the page cache before it is
synced to EBS.
We experienced this issue in a few cases: sometimes partition leaders were
moved away to other brokers automatically, in other cases that didn't happen
and caused the Producers to fail producing messages to that broker.
My expectation from Kafka in such a case would be that it notices it and moves
the leaders to other brokers where the partition has in sync replicas, but as I
mentioned this didn't happen always.
I know Kafka will shut itself down in case it can't write to its disk, that
might be a good solution in this case as well as it would trigger the leader
election automatically.
Is it possible to add such a feature to Kafka so that it shuts down in this
case as well?
I guess similar issue might happen with other disk subsystems too or even with
a broken and slow disk.
This scenario can be easily reproduced using AWS FIS.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)