danny0405 commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1398624021
########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,248 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a lifecycle management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces partition lifecycle management strategies to hudi, people can config the strategies by write +configs. With proper configs set, Hudi can find out which partitions are expired and delete them. + +This proposal introduces partition lifecycle management service to hudi. Lifecycle management is like other table +services such as Clean/Compaction/Clustering. +Users can config their partition lifecycle management strategies through write configs and Hudi will help users find +expired partitions and delete them automatically. + +## Background + +Lifecycle management mechanism is an important feature for databases. Hudi already provides a `delete_partition` +interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of partition lifecycle +management strategies, find expired partitions and call `delete_partition` by themselves. As the scale of installations +grew, it is becoming increasingly important to implement a user-friendly lifecycle management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition lifecycle management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or + asynchronous table services. + +### Strategy Definition + +The lifecycle strategies is similar to existing table service strategies. We can define lifecycle strategies like +defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.lifecycle.management.strategy=KEEP_BY_TIME +hoodie.partition.lifecycle.management.strategy.class=org.apache.hudi.table.action.lifecycle.strategy.KeepByTimePartitionLifecycleManagementStrategy Review Comment: `hoodie.partition.lifecycle.management.strategy` -> `hoodie.partition.ttl.strategy` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
