[
https://issues.apache.org/jira/browse/KAFKA-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885976#comment-17885976
]
Kevin Fletcher commented on KAFKA-17666:
----------------------------------------
Great, thanks! Is there another master ticket, or do you want to use this
ticket to track the enhancement?
> Kafka doesn't monitor disk space after it detects it is full
> ------------------------------------------------------------
>
> Key: KAFKA-17666
> URL: https://issues.apache.org/jira/browse/KAFKA-17666
> Project: Kafka
> Issue Type: Bug
> Reporter: Kevin Fletcher
> Priority: Major
>
> Scenario: Kafka data volume becomes full (100%) but once freed up, Kafka
> ignores this disk until it is restarted. It does not auto-detect that space
> has been freed up, and begin to function again, without restarting.
> {code:java}
> Sep 12 17:40:39 kq-2b.was2.sd.com kafka-server-start[1696]: Stopping serving
> replicas in dir /opt/clusterone/kafka/disk4 (kafka.server.ReplicaManager)
> ...
> Sep 12 17:40:39 kq-2b.was2.sd.com kafka-server-start[1696]:
> java.io.IOException: No space left on device {code}
> Example showing disk4 became full:
> {code:java}
> [[email protected] ~]$ df -h | grep kafka
> /dev/nvme2n1 400G 256G 145G 64% /opt/clusterone/kafka/disk3
> /dev/nvme4n1 400G 275G 126G 69% /opt/clusterone/kafka/disk1
> /dev/nvme3n1 400G 229G 172G 58% /opt/clusterone/kafka/disk2
> /dev/nvme1n1 400G 400G 0G 100% /opt/clusterone/kafka/disk4 {code}
> This topic is 128 partitions, 2 replicas of each, spread across 8 KQ brokers.
> Each broker has partitions spread across 4 dirs (4 disks) - here is disk4:
> {code:java}
> /usr/bin/kafka-log-dirs --bootstrap-server=localhost:9092 --describe
> --topic-list txn | tail -1 | jq
> {
> "logDir": "/opt/clusterone/kafka/disk4",
> "error": null,
> "partitions": [
> {
> "partition": "txn-49",
> "size": 5676453238,
> "offsetLag": 0,
> "isFuture": false
> },
> {
> "partition": "txn-84",
> "size": 5616346237,
> "offsetLag": 0,
> "isFuture": false
> },
> {
> "partition": "txn-52",
> "size": 5587352418,
> "offsetLag": 0,
> "isFuture": false
> },
> {
> "partition": "txn-36",
> "size": 5559175359,
> "offsetLag": 0,
> "isFuture": false
> },
> {
> "partition": "txn-116",
> "size": 5532912024,
> "offsetLag": 0,
> "isFuture": false
> },
> {
> "partition": "txn-105",
> "size": 5525176032,
> "offsetLag": 0,
> "isFuture": false
> },
> {
> "partition": "txn-76",
> "size": 5429519389,
> "offsetLag": 0,
> "isFuture": false
> },
> {
> "partition": "txn-119",
> "size": 5632860112,
> "offsetLag": 0,
> "isFuture": false
> }
> ]
> }, {code}
>
> Issue 1: After freeing up disk space (or growing a volume in real-time),
> Kafka never reports any further log msgs about disk4, it just ignores it
> until it is restarted.
> It would be ideal if Kafka could periodicially check back on this disk and
> see if it is freed up yet so it can continue.
> Issue 2: When shrinking retention live (via retention.ms for example), Kafka
> does not begin to delete files from the 100% full disk, instead it continues
> to ignore all activity related to disk4. This forces us to have to manually
> delete files from the disk (less than ideal).
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)