Kevin Fletcher created KAFKA-17666:
--------------------------------------
Summary: Kafka doesn't monitor disk space after it detects it is
full
Key: KAFKA-17666
URL: https://issues.apache.org/jira/browse/KAFKA-17666
Project: Kafka
Issue Type: Bug
Reporter: Kevin Fletcher
Scenario: Kafka data volume becomes full (100%) but once freed up, Kafka
ignores this disk until it is restarted. It does not auto-detect that space has
been freed up, and begin to function again, without restarting.
{code:java}
Sep 12 17:40:39 kq-2b.was2.sd.com kafka-server-start[1696]: Stopping serving
replicas in dir /opt/clusterone/kafka/disk4 (kafka.server.ReplicaManager)
...
Sep 12 17:40:39 kq-2b.was2.sd.com kafka-server-start[1696]:
java.io.IOException: No space left on device {code}
Example showing disk4 became full:
{code:java}
[[email protected] ~]$ df -h | grep kafka
/dev/nvme2n1 400G 256G 145G 64% /opt/clusterone/kafka/disk3
/dev/nvme4n1 400G 275G 126G 69% /opt/clusterone/kafka/disk1
/dev/nvme3n1 400G 229G 172G 58% /opt/clusterone/kafka/disk2
/dev/nvme1n1 400G 400G 0G 100% /opt/clusterone/kafka/disk4 {code}
This topic is 128 partitions, 2 replicas of each, spread across 8 KQ brokers.
Each broker has partitions spread across 4 dirs (4 disks) - here is disk4:
{code:java}
/usr/bin/kafka-log-dirs --bootstrap-server=localhost:9092 --describe
--topic-list txn | tail -1 | jq
{
"logDir": "/opt/clusterone/kafka/disk4",
"error": null,
"partitions": [
{
"partition": "txn-49",
"size": 5676453238,
"offsetLag": 0,
"isFuture": false
},
{
"partition": "txn-84",
"size": 5616346237,
"offsetLag": 0,
"isFuture": false
},
{
"partition": "txn-52",
"size": 5587352418,
"offsetLag": 0,
"isFuture": false
},
{
"partition": "txn-36",
"size": 5559175359,
"offsetLag": 0,
"isFuture": false
},
{
"partition": "txn-116",
"size": 5532912024,
"offsetLag": 0,
"isFuture": false
},
{
"partition": "txn-105",
"size": 5525176032,
"offsetLag": 0,
"isFuture": false
},
{
"partition": "txn-76",
"size": 5429519389,
"offsetLag": 0,
"isFuture": false
},
{
"partition": "txn-119",
"size": 5632860112,
"offsetLag": 0,
"isFuture": false
}
]
}, {code}
Issue 1: After freeing up disk space (or growing a volume in real-time), Kafka
never reports any further log msgs about disk4, it just ignores it until it is
restarted.
It would be ideal if Kafka could periodicially check back on this disk and see
if it is freed up yet so it can continue.
Issue 2: When shrinking retention live (via retention.ms for example), Kafka
does not begin to delete files from the 100% full disk, instead it continues to
ignore all activity related to disk4. This forces us to have to manually delete
files from the disk (less than ideal).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)