stream2000 commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1211260244
########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,110 @@ +## Proposers +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao +## Approvers +## Status +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) +## Abstract +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely. +This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. +## Background +TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew, it's more important to implement a user-friendly TTL management mechanism for hudi. +## Implementation +There are 3 components to implement Partition TTL Management + +- TTL policy definition & storage +- Partition statistics for TTL management +- Appling policies +### TTL Policy Definition +We have three main considerations when designing TTL policy: + +1. User hopes to manage partition TTL not only by expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types. + 1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time. Review Comment: Maybe we can add the stash/restore mechanism to replace commit/clean process of hudi instead of dealing with it in TTL management? TTL management should only decide which partitions are outdated and call `delete_partition` to delete them. If we want to retain the deleted data we can add extra mechanism in the `delete_parrtition` method. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
