I’ll start with the code if you consider this feature could be useful, what do you think ?
Regarding to the code I’ll use the same classes dump-logs uses for reading records and add the logic on top of it. Best regards Sergio Troiano Sent from my iPhone > On 16 May 2022, at 20:48, Sergio Daniel Troiano <sergio.troi...@adevinta.com> > wrote: > > > Hi Divij, > > First of all thanks for your time and dedication. > > About point one: > You are right, the idea is to have "in real time" visibility of the way the > clients are using the service as that is translated into a lot of money > saving. > I agree with the further vision although I think we are still far away from > it :) > > About the resource usage my idea is to be zero invasive, so taking a few MB > samples once every few hours will be more than enough to understand the > produced pattern, so in this case the CPU usage is only a cost for the > producer and consumer. > Worth to mention that the additional 3% extra usage while producing is > negligible compared to the gain of batching and compression but maybe that > discussion is not related to this KIP, that is a decision between the cluster > admin and the clients. > > About the "auto tuning" that is a great idea, again I think it is very > ambitious for the scope of this KIP but if the core of this is properly done > then this can be used in the future. > > > About point two: > Below is detailed the benefits of bathing and compression : > - Reduction of network bandwidth while data is produced. > - Reduction of disk usage to store the data, less IO for read and write the > segments (supposing the message format has not to be converted) > - Reduction of network traffic while data is replicated. > - Reduction of network traffic while the data is consumer. > > The script I propose will output the percentage of network traffic reduction > + the disk space saved. > - Batching will be recommended based on the parameters $batching-window-time > (ms) and $min-records-for-batching the idea is to check the CreationTime of > each batch, lets suppose we use: > > batching-window-time = 300 > min-records-for-batching = 30 > > * This means we want to check if at least we can batch together 30 records in > 300 ms, this could be in 2 batches or in 30 (one record per batch) > * If the batching is achievable then we jump the next check to simulate the > compression even if the compression is already applied as batching more data > will improve the compression ratio. > * Finally the payload ( a few MB are brought to memory in order to get its > current size, then it is > compressed and the difference is calculated. > > > As a side note I think if the classes are properly created this can be reused > in the future for a more "automagic" way of usage. Again I really like the > idea of allowing the cluster to configure the producers (maybe the producer > could have a parameter to allow this) > > I did not enter into details about the code as I would like to know if the > idea worth it, I use this "solution" in the company I work and it saved us a > lot of money, for now we have just get the output from the dump-logs.sh > script in order to see the CreateTime and the compression type, this is > a first step but we can't yet simulate the compression. > So for now we reach our clients saying "there is a potential benefit of cost > reduction if you apply these changes in the producer" > > > I hope this help, please feel free to add more feedback > > Best regards. > Sergio Troiano > > > > > > > > > > > On Mon, 16 May 2022 at 10:35, Divij Vaidya <divijvaidy...@gmail.com> wrote: >> Thank you for the KIP Sergio. >> >> High level thoughts: >> 1\ I understand that the idea here is to provide better visibility to the >> admins about potential improvements using compression and modifying batch >> size. I would take it a step further and say that we should be providing >> this visibility in a programmatic push based manner and make this system >> generic enough so that adding new "optimization rules" in the future is >> seamless. Perhaps, have a "diagnostic" mode in the cluster, which can be >> dynamically enabled. In such a mode, the cluster would run a set of >> "optimization" rules (at the cost of additional CPU cycles). One of such >> rules would be the compression rule you mentioned in your KIP. At the end >> of the diagnostic run, the generated report would contain a set of >> recommendations. To begin with, we can introduce this "diagnostic" as a >> one-time run by admin and later, enhance it further to be triggered >> periodically in the cluster automatically (with results being published via >> existing metric libraries). Even further down the line, this could lead to >> "auto-tuning" producer libraries based on recommendations from the server. >> >> KIP implementation specific comments/questions: >> 2\ Can you please add the algorithm that would be used to determine whether >> compression is recommended or not? I am assuming that the algorithm would >> take into account the factors impacting compression optimization such as >> CPU utilization, network bandwidth, decompression cost by the consumers etc. >> 3\ Can you please add the algorithm that would be used to determine whether >> batching is recommended? >> >> >> Divij Vaidya >> >> >> >> On Mon, May 16, 2022 at 8:42 AM Sergio Daniel Troiano >> <sergio.troi...@adevinta.com.invalid> wrote: >> >> > Hey guys! >> > >> > I would like to start an early discussion on this: >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-838+Simulate+batching+and+compression >> > >> > >> > Thanks! >> >