[ https://issues.apache.org/jira/browse/KAFKA-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17608609#comment-17608609 ]
Nikolay Izhikov commented on KAFKA-13866: ----------------------------------------- Hello, [~mjsax] Can you, please, share your feedback on KIP? https://cwiki.apache.org/confluence/display/KAFKA/KIP-870%3A+Retention+policy+based+on+record+event+time https://lists.apache.org/thread/9njcnjd231l3l7xv121os99m3f7gggb3 > Support more advanced time retention policies > --------------------------------------------- > > Key: KAFKA-13866 > URL: https://issues.apache.org/jira/browse/KAFKA-13866 > Project: Kafka > Issue Type: Improvement > Components: config, core, log cleaner > Reporter: Matthias J. Sax > Assignee: Nikolay Izhikov > Priority: Major > Labels: needs-kip > > Time-based retention policy compares the record timestamp to broker > wall-clock time. Those semantics are questionable and also lead to issues for > data reprocessing: If one want to re-process older data then retention time, > it's not possible as broker expire those record aggressively and user need to > increate the retention time accordingly. > Especially for Kafka Stream, we have seen many cases when users got bit by > the current behavior. > It would be best, if Kafka would track _two_ timestamps per record: the > record event-time (as the broker do currently), plus the log append-time > (which is only tracked currently if the topic is configured with > "append-time" tracking, but the issue is, that it overwrite the producer > provided record event-time). > Tracking both timestamps would allow to set a pure wall-clock time retention > time plus a pure event-time retention time policy: > * Wall-clock time: keep (at least) the date X days after writing > * Event-time: keep (at max) the X days worth of event-time data > Comparing wall-clock time to wall-clock time and event-time to event-time > provides much cleaner semantics. The idea is to combine both policies and > only expire data if both policies trigger. > For the event-time policy, the broker would need to track "stream time" as > max event-timestamp it has see per partition (similar to how Kafka Streams is > tracking "stream time" client side). > Note the difference between "at least" and "at max" above: for the > data-reprocessing case, the max-based event-time policy avoids that the > broker would keep a huge history for the reprocessing case. > It would be part of a KIP discussion on the details how wall-clock/event-time > and mix/max policies could be combined. For example, it might also be useful > to have the following policy: keep at least X days worth of event-time > history no matter how long the data is already stored (ie, there would only > be an event-time base expiration but not wall-clock time). It could also be > combined with a wall-clock time expiration: delete data only after it's at > least X days old and stored for at least Y days. -- This message was sent by Atlassian Jira (v8.20.10#820010)