stream2000 commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1406076695
########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List<String> getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may have multi partition fileds(productId, day). For partitions under `product=1` they want to keep for 30 days while for partitions under `product=2` they want to keep for 7 days only. Review Comment: > Kind if think it is valuable if the specification could be a wildcard matching. We need to handle the lex parsing of the spec string though. Yes, it's valuable. However, it will introduce more complexity into the design. In our inner version which have implement a wild card match ttl policy, we need to deal with things like: 1. provide a spark SQL procedure for defining ttl policies, user can add/delete/update policies to hudi table 2. policies will be stored in `hoodie.properties`. We need to consider the concurrent modification issue for `hoodie.properties`. 3. We need a default policy, so that partitions that does not explicitly define a ttl policy uses the default policy. So, when implement it, we need to first find which partitions have defined TTL, and then find which partitions shoud apply the default policy. 4. We need to care about policies defined in different level, for example we have `a,b,c,d` as partition fields, can we define ttl policies in any level for the partition? ( We do not allow this in our inner version) However, for most of our users, they just need a simple ttl policy that just care about last modified time for each partition and do not need the complicated ttl policy ( which need to understand the multi-level ttl policy definition, the procedure to set/unset policies and so on), so we simplify the design. In the simplified version of partition lifecycle management, we can define the policy just by some write configs, and we do not need a default policy or complicated regex rules. Partition lifecycle management now is more like a simple syntactic sugar for `delete_partition` which is easy to understand and use by normal users. For advanced users, we can provide an advanced policy in the future that takes care of things I mentioned above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
