Re: [DISCUSS] KIP-838 Simulate batching and compression

Sergio Troiano Tue, 17 May 2022 23:24:32 -0700

I’ll start with the code if you consider this feature could be useful, what do 
you think ?


Regarding to the code I’ll use the same classes dump-logs uses for reading 
records and add the logic on top of it.


Best regards
Sergio Troiano

Sent from my iPhone

> On 16 May 2022, at 20:48, Sergio Daniel Troiano <sergio.troi...@adevinta.com> 
> wrote:
> 
> 
> Hi Divij,
> 
> First of all thanks for your time and dedication.
> 
> About point one:
> You are right, the idea is to have "in real time" visibility of the way the 
> clients are using the service as that is translated into a lot of money 
> saving.
> I agree with the further vision although I think we are still far away from 
> it :)
> 
> About the resource usage my idea is to be zero invasive, so taking a few MB 
> samples once every few hours will be more than enough to understand the 
> produced pattern, so in this case the CPU usage is only a cost for the 
> producer and consumer.
> Worth to mention that the additional 3% extra usage while producing is 
> negligible compared to the gain of batching and compression but maybe that 
> discussion is not related to this KIP, that is a decision between the cluster 
> admin and the clients.
> 
> About the "auto tuning" that is a great idea, again I think it is very 
> ambitious for the scope of this KIP but if the core of this is properly done 
> then this can be used in the future.
> 
> 
> About point two:
> Below is detailed the benefits of bathing and compression :
> - Reduction of network bandwidth while data is produced.
> - Reduction of disk usage to store the data, less IO for read and write the 
> segments (supposing the message format has not to be converted)
> - Reduction of network traffic while data is replicated.
> - Reduction of network traffic while the data is consumer.
> 
> The script I propose will output the percentage of network traffic reduction 
> + the disk space saved.
> - Batching will be recommended based on the parameters $batching-window-time 
> (ms) and $min-records-for-batching the idea is to check the CreationTime of 
> each batch, lets suppose we use:
> 
> batching-window-time = 300
> min-records-for-batching = 30
> 
> * This means we want to check if at least we can batch together 30 records in 
> 300 ms, this could be in 2 batches or in 30 (one record per batch)
> * If the batching is achievable then we jump the next check to simulate the 
> compression even if the compression is already applied as batching more data 
> will improve the compression ratio.
> * Finally the payload ( a few MB are brought to memory in order to get its 
> current size, then it is 
> compressed and the difference is calculated.
> 
> 
> As a side note I think if the classes are properly created this can be reused 
> in the future for a more "automagic" way of usage. Again I really like the 
> idea of allowing the cluster to configure the producers (maybe the producer 
> could have a parameter to allow this)
> 
> I did not enter into details about the code as I would like to know if the 
> idea worth it, I use this "solution" in the company I work and it saved us a 
> lot of money, for now we have just get the output from the dump-logs.sh 
> script in order to see the CreateTime and the compression type, this is 
> a first step but we can't yet simulate the compression.
> So for now we reach our clients saying "there is a potential benefit of cost 
> reduction if you apply these changes in the producer" 
> 
> 
> I hope this help, please feel free to add more feedback 
> 
> Best regards.
> Sergio Troiano
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Mon, 16 May 2022 at 10:35, Divij Vaidya <divijvaidy...@gmail.com> wrote:
>> Thank you for the KIP Sergio.
>> 
>> High level thoughts:
>> 1\ I understand that the idea here is to provide better visibility to the
>> admins about potential improvements using compression and modifying batch
>> size. I would take it a step further and say that we should be providing
>> this visibility in a programmatic push based manner and make this system
>> generic enough so that adding new "optimization rules" in the future is
>> seamless. Perhaps, have a "diagnostic" mode in the cluster, which can be
>> dynamically enabled. In such a mode, the cluster would run a set of
>> "optimization" rules (at the cost of additional CPU cycles). One of such
>> rules would be the compression rule you mentioned in your KIP. At the end
>> of the diagnostic run, the generated report would contain a set of
>> recommendations. To begin with, we can introduce this "diagnostic" as a
>> one-time run by admin and later, enhance it further to be triggered
>> periodically in the cluster automatically (with results being published via
>> existing metric libraries). Even further down the line, this could lead to
>> "auto-tuning" producer libraries based on recommendations from the server.
>> 
>> KIP implementation specific comments/questions:
>> 2\ Can you please add the algorithm that would be used to determine whether
>> compression is recommended or not? I am assuming that the algorithm would
>> take into account the factors impacting compression optimization such as
>> CPU utilization, network bandwidth, decompression cost by the consumers etc.
>> 3\ Can you please add the algorithm that would be used to determine whether
>> batching is recommended?
>> 
>> 
>> Divij Vaidya
>> 
>> 
>> 
>> On Mon, May 16, 2022 at 8:42 AM Sergio Daniel Troiano
>> <sergio.troi...@adevinta.com.invalid> wrote:
>> 
>> > Hey guys!
>> >
>> > I would like to start an early discussion on this:
>> >
>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-838+Simulate+batching+and+compression
>> >
>> >
>> > Thanks!
>> >

Re: [DISCUSS] KIP-838 Simulate batching and compression

Reply via email to