Kevin Fletcher created KAFKA-17666:
--------------------------------------

             Summary: Kafka doesn't monitor disk space after it detects it is 
full
                 Key: KAFKA-17666
                 URL: https://issues.apache.org/jira/browse/KAFKA-17666
             Project: Kafka
          Issue Type: Bug
            Reporter: Kevin Fletcher


Scenario: Kafka data volume becomes full (100%) but once freed up, Kafka 
ignores this disk until it is restarted. It does not auto-detect that space has 
been freed up, and begin to function again, without restarting.
{code:java}
Sep 12 17:40:39 kq-2b.was2.sd.com kafka-server-start[1696]: Stopping serving 
replicas in dir /opt/clusterone/kafka/disk4 (kafka.server.ReplicaManager)
...
Sep 12 17:40:39 kq-2b.was2.sd.com kafka-server-start[1696]: 
java.io.IOException: No space left on device {code}
Example showing disk4 became full:
{code:java}
[[email protected] ~]$ df -h | grep kafka
/dev/nvme2n1    400G  256G  145G  64% /opt/clusterone/kafka/disk3
/dev/nvme4n1    400G  275G  126G  69% /opt/clusterone/kafka/disk1
/dev/nvme3n1    400G  229G  172G  58% /opt/clusterone/kafka/disk2
/dev/nvme1n1    400G  400G    0G 100% /opt/clusterone/kafka/disk4 {code}
This topic is 128 partitions, 2 replicas of each, spread across 8 KQ brokers.

Each broker has partitions spread across 4 dirs (4 disks) - here is disk4:
{code:java}
/usr/bin/kafka-log-dirs --bootstrap-server=localhost:9092 --describe 
--topic-list txn | tail -1 | jq
        {
          "logDir": "/opt/clusterone/kafka/disk4",
          "error": null,
          "partitions": [
            {
              "partition": "txn-49",
              "size": 5676453238,
              "offsetLag": 0,
              "isFuture": false
            },
            {
              "partition": "txn-84",
              "size": 5616346237,
              "offsetLag": 0,
              "isFuture": false
            },
            {
              "partition": "txn-52",
              "size": 5587352418,
              "offsetLag": 0,
              "isFuture": false
            },
            {
              "partition": "txn-36",
              "size": 5559175359,
              "offsetLag": 0,
              "isFuture": false
            },
            {
              "partition": "txn-116",
              "size": 5532912024,
              "offsetLag": 0,
              "isFuture": false
            },
            {
              "partition": "txn-105",
              "size": 5525176032,
              "offsetLag": 0,
              "isFuture": false
            },
            {
              "partition": "txn-76",
              "size": 5429519389,
              "offsetLag": 0,
              "isFuture": false
            },
            {
              "partition": "txn-119",
              "size": 5632860112,
              "offsetLag": 0,
              "isFuture": false
            }
          ]
        }, {code}
 

Issue 1: After freeing up disk space (or growing a volume in real-time), Kafka 
never reports any further log msgs about disk4, it just ignores it until it is 
restarted.

It would be ideal if Kafka could periodicially check back on this disk and see 
if it is freed up yet so it can continue.

Issue 2: When shrinking retention live (via retention.ms for example), Kafka 
does not begin to delete files from the 100% full disk, instead it continues to 
ignore all activity related to disk4. This forces us to have to manually delete 
files from the disk (less than ideal).

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to