[ https://issues.apache.org/jira/browse/KAFKA-16779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847001#comment-17847001 ]
Nicholas Feinberg commented on KAFKA-16779: ------------------------------------------- No problem. > Kafka retains logs past specified retention > ------------------------------------------- > > Key: KAFKA-16779 > URL: https://issues.apache.org/jira/browse/KAFKA-16779 > Project: Kafka > Issue Type: Bug > Affects Versions: 3.7.0 > Reporter: Nicholas Feinberg > Priority: Major > Labels: expiration, retention > Attachments: OOM.txt, kafka-20240512.log.gz, kafka-20240514.log.gz, > kafka-ooms.png, server.log.2024-05-12.gz, server.log.2024-05-14.gz, > state-change.log.2024-05-12.gz, state-change.log.2024-05-14.gz > > > In a Kafka cluster with all topics set to four days of retention or longer > (345600000ms), most brokers seem to be retaining six days of data. > This is true even for topics which have high throughput (500MB/s, 50k msgs/s) > and thus are regularly rolling new log segments. We observe this unexpectedly > high retention both via disk usage statistics and by requesting the oldest > available messages from Kafka. > Some of these brokers crashed with an 'mmap failed' error (attached). When > those brokers started up again, they returned to the expected four days of > retention. > Manually restarting brokers also seems to cause them to return to four days > of retention. Demoting and promoting brokers only has this effect on a small > part of the data hosted on a broker. > These hosts had ~170GiB of free memory available. We saw no signs of pressure > on either system or JVM heap memory before or after they reported this error. > Committed memory seems to be around 10%, so this doesn't seem to be an > overcommit issue. > This Kafka cluster was upgraded to Kafka 3.7 two weeks ago (April 29th). > Prior to the upgrade, it was running on Kafka 2.4. > We last reduced retention for ops on May 7th, after which we restored > retention to our default of four days. This was the second time we've > temporarily reduced and restored retention since the upgrade. This problem > did not manifest the previous time we did so, nor did it manifest on our > other Kafka 3.7 clusters. > We are running on AWS > [d3en.12xlarge|https://instances.vantage.sh/aws/ec2/d3en.12xlarge] hosts. We > have 23 brokers, each with 24 disks. We're running in a JBOD configuration > (i.e. unraided). > Since this cluster was upgraded from Kafka 2.4 and since we're using JBOD, > we're still using Zookeeper. > Sample broker logs are attached. The 05-12 and 05-14 logs are from separate > hosts. Please let me know if I can provide any further information. -- This message was sent by Atlassian Jira (v8.20.10#820010)