stream2000 commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1406043167
########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,110 @@ +## Proposers +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao +## Approvers +## Status +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) +## Abstract +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the dataset from growing infinitely. +This proposal introduces Partition TTL Management policies to hudi, people can config the policies by table config directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. +## Background +TTL management mechanism is an important feature for databases. Hudi already provides a delete_partition interface to delete outdated partitions. However, users still need to detect which partitions are outdated and call `delete_partition` manually, which means that users need to define and implement some kind of TTL policies and maintain proper statistics to find expired partitions by themself. As the scale of installations grew, it's more important to implement a user-friendly TTL management mechanism for hudi. +## Implementation +There are 3 components to implement Partition TTL Management + +- TTL policy definition & storage +- Partition statistics for TTL management +- Appling policies +### TTL Policy Definition +We have three main considerations when designing TTL policy: + +1. User hopes to manage partition TTL not only by expired time but also by sub-partitions count and sub-partitions size. So we need to support the following three different TTL policy types. + 1. **KEEP_BY_TIME**. Partitions will expire N days after their last modified time. Review Comment: In the latest version of the RFC, we use the max instant time of the committed file slices in the partition as the partition's last modified time for simplicity. Otherwise, we need some extra mechanism to get the last modified time. In our inner version, we maintain an extra JSON file and update it incrementally as new instants committed to get the real modified time for the partition. Also, we can use metadata table to track the last modify time. What do you think about this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
