Re: [DISCUSS] KIP-928: Making Kafka resilient to log directories becoming full

Christo Lolov Wed, 07 Jun 2023 07:07:58 -0700

Hey Colin,

I tried the following setup:


* Create 3 EC2 machines.
* EC2 machine named A acts as a KRaft Controller.
* EC2 machine named B acts as a KRaft Broker. (The only configurations
different to the default values: log.retention.ms=30000,
log.segment.bytes=1048576, log.retention.check.interval.ms=30000,
leader.imbalance.check.interval.seconds=30)
* EC2 machine named C acts as a Producer.
* I attached 1 GB EBS volume to the EC2 machine B (Broker) and configured
the log.dirs to point to it.
* I filled 995 MB of that EBS volume using fallocate.
* I created a topic with 6 partitions and a replication factor of 1.
* From the Producer machine I used `~/kafka/bin/kafka-producer-perf-test.sh
--producer.config ~/kafka/config/client.properties --topic batman
--record-size 524288 --throughput 5 --num-records 150`. The disk on EC2
machine B filled up and the broker shut down. I stopped the producer.
* I stopped the controller on EC2 machine A. I started the controller to
both be a controller and a broker (I need this because I cannot communicate
directly with a controller -
https://cwiki.apache.org/confluence/display/KAFKA/KIP-919%3A+Allow+AdminClient+to+Talk+Directly+with+the+KRaft+Controller+Quorum
).
* I deleted the topic to which I had been writing by using kafka-topics.sh .
* I started the broker on EC2 machine B and it failed due to no space left
on disk during its recovery process. The topic was not deleted from the
disk.

As such, I am not convinced that KRaft addresses the problem of deleting
topics on startup if there is no space left on the disk - is there
something wrong with my setup that you disagree with? I think this will
continue to be the case even when JBOD + KRaft is implemented.

Let me know your thoughts!

Best,
Christo

On Mon, 5 Jun 2023 at 11:03, Christo Lolov <christolo...@gmail.com> wrote:

> Hey Colin,
>
> Thanks for the review!
>
> I am also skeptical that much space can be reclaimed via compaction as
> detailed in the limitations section of the KIP.
>
> In my head there are two ways to get out of the saturated state -
> configure more aggressive retention and delete topics. I wasn't aware that
> KRaft deletes topics marked for deletion on startup if the disks occupied
> by those partitions are full - I will check it out, thank you for the
> information! On the retention side, I believe there is still a benefit in
> keeping the broker up and responsive - in my experience, people first try
> to reduce the data they have and only when that also does not work they are
> okay with sacrificing all of the data.
>
> Let me know your thoughts!
>
> Best,
> Christo
>
> On Fri, 2 Jun 2023 at 20:09, Colin McCabe <cmcc...@apache.org> wrote:
>
>> Hi Christo,
>>
>> We're not adding new stuff to ZK at this point (it's deprecated), so it
>> would be good to drop that from the design.
>>
>> With regard to the "saturated" state: I'm skeptical that compaction could
>> really move the needle much in terms of freeing up space -- in most
>> workloads I've seen, it wouldn't. Compaction also requires free space to
>> function as well.
>>
>> So the main benefit of the "satured" state seems to be enabling deletion
>> on full disks. But KRaft mode already has most of that benefit. Full disks
>> (or, indeed, downed brokers) don't block deletion on KRaft. If you delete a
>> topic and then bounce the broker that had the disk full, it will delete the
>> topic directory on startup as part of its snapshot load process.
>>
>> So I'm not sure if we really need this. Maybe we should re-evaluate once
>> we have JBOD + KRaft.
>>
>> best,
>> Colin
>>
>>
>> On Mon, May 22, 2023, at 02:23, Christo Lolov wrote:
>> > Hello all!
>> >
>> > I would like to start a discussion on KIP-928: Making Kafka resilient to
>> > log directories becoming full which can be found at
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-928%3A+Making+Kafka+resilient+to+log+directories+becoming+full
>> > .
>> >
>> > In summary, I frequently run into problems where Kafka becomes
>> unresponsive
>> > when the disks backing its log directories become full. Such
>> > unresponsiveness generally requires intervention outside of Kafka. I
>> have
>> > found it to be significantly nicer of an experience when Kafka maintains
>> > control plane operations and allows you to free up space.
>> >
>> > I am interested in your thoughts and any suggestions for improving the
>> > proposal!
>> >
>> > Best,
>> > Christo
>>
>

Re: [DISCUSS] KIP-928: Making Kafka resilient to log directories becoming full

Reply via email to