Hey Colin, I tried the following setup:
* Create 3 EC2 machines. * EC2 machine named A acts as a KRaft Controller. * EC2 machine named B acts as a KRaft Broker. (The only configurations different to the default values: log.retention.ms=30000, log.segment.bytes=1048576, log.retention.check.interval.ms=30000, leader.imbalance.check.interval.seconds=30) * EC2 machine named C acts as a Producer. * I attached 1 GB EBS volume to the EC2 machine B (Broker) and configured the log.dirs to point to it. * I filled 995 MB of that EBS volume using fallocate. * I created a topic with 6 partitions and a replication factor of 1. * From the Producer machine I used `~/kafka/bin/kafka-producer-perf-test.sh --producer.config ~/kafka/config/client.properties --topic batman --record-size 524288 --throughput 5 --num-records 150`. The disk on EC2 machine B filled up and the broker shut down. I stopped the producer. * I stopped the controller on EC2 machine A. I started the controller to both be a controller and a broker (I need this because I cannot communicate directly with a controller - https://cwiki.apache.org/confluence/display/KAFKA/KIP-919%3A+Allow+AdminClient+to+Talk+Directly+with+the+KRaft+Controller+Quorum ). * I deleted the topic to which I had been writing by using kafka-topics.sh . * I started the broker on EC2 machine B and it failed due to no space left on disk during its recovery process. The topic was not deleted from the disk. As such, I am not convinced that KRaft addresses the problem of deleting topics on startup if there is no space left on the disk - is there something wrong with my setup that you disagree with? I think this will continue to be the case even when JBOD + KRaft is implemented. Let me know your thoughts! Best, Christo On Mon, 5 Jun 2023 at 11:03, Christo Lolov <christolo...@gmail.com> wrote: > Hey Colin, > > Thanks for the review! > > I am also skeptical that much space can be reclaimed via compaction as > detailed in the limitations section of the KIP. > > In my head there are two ways to get out of the saturated state - > configure more aggressive retention and delete topics. I wasn't aware that > KRaft deletes topics marked for deletion on startup if the disks occupied > by those partitions are full - I will check it out, thank you for the > information! On the retention side, I believe there is still a benefit in > keeping the broker up and responsive - in my experience, people first try > to reduce the data they have and only when that also does not work they are > okay with sacrificing all of the data. > > Let me know your thoughts! > > Best, > Christo > > On Fri, 2 Jun 2023 at 20:09, Colin McCabe <cmcc...@apache.org> wrote: > >> Hi Christo, >> >> We're not adding new stuff to ZK at this point (it's deprecated), so it >> would be good to drop that from the design. >> >> With regard to the "saturated" state: I'm skeptical that compaction could >> really move the needle much in terms of freeing up space -- in most >> workloads I've seen, it wouldn't. Compaction also requires free space to >> function as well. >> >> So the main benefit of the "satured" state seems to be enabling deletion >> on full disks. But KRaft mode already has most of that benefit. Full disks >> (or, indeed, downed brokers) don't block deletion on KRaft. If you delete a >> topic and then bounce the broker that had the disk full, it will delete the >> topic directory on startup as part of its snapshot load process. >> >> So I'm not sure if we really need this. Maybe we should re-evaluate once >> we have JBOD + KRaft. >> >> best, >> Colin >> >> >> On Mon, May 22, 2023, at 02:23, Christo Lolov wrote: >> > Hello all! >> > >> > I would like to start a discussion on KIP-928: Making Kafka resilient to >> > log directories becoming full which can be found at >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-928%3A+Making+Kafka+resilient+to+log+directories+becoming+full >> > . >> > >> > In summary, I frequently run into problems where Kafka becomes >> unresponsive >> > when the disks backing its log directories become full. Such >> > unresponsiveness generally requires intervention outside of Kafka. I >> have >> > found it to be significantly nicer of an experience when Kafka maintains >> > control plane operations and allows you to free up space. >> > >> > I am interested in your thoughts and any suggestions for improving the >> > proposal! >> > >> > Best, >> > Christo >> >