[ 
https://issues.apache.org/jira/browse/KAFKA-17666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17885963#comment-17885963
 ] 

Kamal Chandraprakash commented on KAFKA-17666:
----------------------------------------------

[KIP-928|https://cwiki.apache.org/confluence/display/KAFKA/KIP-928%3A+Making+Kafka+resilient+to+log+directories+becoming+full]
 is proposed to make Kafka resilient to log directories becoming full. It is in 
the discussion phase. 

> Kafka doesn't monitor disk space after it detects it is full
> ------------------------------------------------------------
>
>                 Key: KAFKA-17666
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17666
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Kevin Fletcher
>            Priority: Major
>
> Scenario: Kafka data volume becomes full (100%) but once freed up, Kafka 
> ignores this disk until it is restarted. It does not auto-detect that space 
> has been freed up, and begin to function again, without restarting.
> {code:java}
> Sep 12 17:40:39 kq-2b.was2.sd.com kafka-server-start[1696]: Stopping serving 
> replicas in dir /opt/clusterone/kafka/disk4 (kafka.server.ReplicaManager)
> ...
> Sep 12 17:40:39 kq-2b.was2.sd.com kafka-server-start[1696]: 
> java.io.IOException: No space left on device {code}
> Example showing disk4 became full:
> {code:java}
> [[email protected] ~]$ df -h | grep kafka
> /dev/nvme2n1    400G  256G  145G  64% /opt/clusterone/kafka/disk3
> /dev/nvme4n1    400G  275G  126G  69% /opt/clusterone/kafka/disk1
> /dev/nvme3n1    400G  229G  172G  58% /opt/clusterone/kafka/disk2
> /dev/nvme1n1    400G  400G    0G 100% /opt/clusterone/kafka/disk4 {code}
> This topic is 128 partitions, 2 replicas of each, spread across 8 KQ brokers.
> Each broker has partitions spread across 4 dirs (4 disks) - here is disk4:
> {code:java}
> /usr/bin/kafka-log-dirs --bootstrap-server=localhost:9092 --describe 
> --topic-list txn | tail -1 | jq
>         {
>           "logDir": "/opt/clusterone/kafka/disk4",
>           "error": null,
>           "partitions": [
>             {
>               "partition": "txn-49",
>               "size": 5676453238,
>               "offsetLag": 0,
>               "isFuture": false
>             },
>             {
>               "partition": "txn-84",
>               "size": 5616346237,
>               "offsetLag": 0,
>               "isFuture": false
>             },
>             {
>               "partition": "txn-52",
>               "size": 5587352418,
>               "offsetLag": 0,
>               "isFuture": false
>             },
>             {
>               "partition": "txn-36",
>               "size": 5559175359,
>               "offsetLag": 0,
>               "isFuture": false
>             },
>             {
>               "partition": "txn-116",
>               "size": 5532912024,
>               "offsetLag": 0,
>               "isFuture": false
>             },
>             {
>               "partition": "txn-105",
>               "size": 5525176032,
>               "offsetLag": 0,
>               "isFuture": false
>             },
>             {
>               "partition": "txn-76",
>               "size": 5429519389,
>               "offsetLag": 0,
>               "isFuture": false
>             },
>             {
>               "partition": "txn-119",
>               "size": 5632860112,
>               "offsetLag": 0,
>               "isFuture": false
>             }
>           ]
>         }, {code}
>  
> Issue 1: After freeing up disk space (or growing a volume in real-time), 
> Kafka never reports any further log msgs about disk4, it just ignores it 
> until it is restarted.
> It would be ideal if Kafka could periodicially check back on this disk and 
> see if it is freed up yet so it can continue.
> Issue 2: When shrinking retention live (via retention.ms for example), Kafka 
> does not begin to delete files from the 100% full disk, instead it continues 
> to ignore all activity related to disk4. This forces us to have to manually 
> delete files from the disk (less than ideal).
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to