stream2000 commented on code in PR #8062: URL: https://github.com/apache/hudi/pull/8062#discussion_r1351764739
########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME Review Comment: We can change the word `TTL` to `lifecycle` if we want to support other types of data expiration policies. In current version this RFC, we do not plan to support other types of policies like `KEEP_BY_SIZE` since they rely on the condition that the partition values are comparable which is not always true. However we do have some users who have this kind of requirement so we need to make the policy extensible. ########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List<String> getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. Review Comment: We should always think twice before making format changes. We have considered using `HoodiePartitionMetadata` to provide the `lastModifiedTime`, however, the current implementation of partition metadata does not support transactions, and will never be modified since the first creation. Not sure it is a good choice for maintaining the `lastModifiedTime` ########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List<String> getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. Review Comment: Thanks for your review~ > it's just providing TTL functionality by custom implementation of PartitionTTLManagementStrategy is not user friendly. For common users that want to do TTL management, they just need to add some write configs to their ingestion job. In the simplest situation they may just add the following config to enable async ttl management: ```properties hoodie.partition.ttl.days.retain=10 hoodie.ttl.management.async.enabled=true ``` The TTL policy will default to `KEEP_BY_TIME` and we can automatically detect expired partitions by their last modified time and delete them. However, this simple policy may not suit for all users, they may need to implement their own partition TTL policy so we provide an abstraction `PartitionTTLManagementStrategy` that returns expired partitions as users need and hudi will help delete them. > This is a main scope for TTL, and we shouldn't allow to have more flexibility. Customized implementation of PartitionTTLManagementStrategy will allow to do anything with partitions. It still could be PartitionManagementStrategy, but then we shouldn't named it with TTL part. Customized implementation of PartitionTTLManagementStrategy will only provide a list of expired partitions, and hudi will help delete them. We can rename the word `PartitionTTLManagementStrategy` to `PartitionLifecycleManagementStrategy` since we do not always delete the partitions by time. ########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List<String> getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may have multi partition fileds(productId, day). For partitions under `product=1` they want to keep for 30 days while for partitions under `product=2` they want to keep for 7 days only. Review Comment: We can implement two independent ttl policies, one is simple while another is advanced as I commented below. ########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List<String> getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may have multi partition fileds(productId, day). For partitions under `product=1` they want to keep for 30 days while for partitions under `product=2` they want to keep for 7 days only. + +For the first version of TTL management, we do not plan to implement a complicated strategy (For example, use an array to store strategies, introduce partition regex etc.). Instead, we add a new abstract method `getPartitionPathsForTTLManagement` in `PartitionTTLManagementStrategy` and provides a new config `hoodie.partition.ttl.management.partition.selected`. + +If `hoodie.partition.ttl.management.partition.selected` is set, `getPartitionPathsForTTLManagement` will return partitions provided by this config. If not, `getPartitionPathsForTTLManagement` will return all partitions in the hudi table. + +TTL management strategies will only be applied for partitions return by `getPartitionPathsForTTLManagement`. + +Thus, if users want to apply different strategies for different partitions, they can do the TTL management multiple times, selecting different partitions and apply corresponding strategies. And we may provide a batch interface in the future to simplify this. Review Comment: > I offer to allow user set only spec and TTL values with support of add, show, and delete operations. > So, if I want to implement 6 partition policies, then I should prepare 6 PartitionTTLManagementStrategys and call them 6 times? We can implement the advanced TTL management policy in the future to provide the ability to set different policies for different partitions. To achieve this, we can introduce a new `hoodie.partition.ttl.management.strategy` and implement the logic you mentioned. For most users, providing a simple policy that TTL any partition whose last mod time (last time when data was added/updated), is greater than the TTL time specified will be enough. @geserdugarov @nsivabalan What do you think about this? ########## rfc/rfc-65/rfc-65.md: ########## @@ -0,0 +1,209 @@ +## Proposers + +- @stream2000 +- @hujincalrin +- @huberylee +- @YuweiXiao + +## Approvers + +## Status + +JIRA: [HUDI-5823](https://issues.apache.org/jira/browse/HUDI-5823) + +## Abstract + +In some classic hudi use cases, users partition hudi data by time and are only interested in data from a recent period +of time. The outdated data is useless and costly, we need a TTL(Time-To-Live) management mechanism to prevent the +dataset from growing infinitely. +This proposal introduces Partition TTL Management strategies to hudi, people can config the strategies by table config +directly or by call commands. With proper configs set, Hudi can find out which partitions are outdated and delete them. + + +This proposal introduces Partition TTL Management service to hudi. TTL management is like other table services such as Clean/Compaction/Clustering. +The user can config their ttl strategies through write configs and Hudi will help users find expired partitions and delete them automatically. + +## Background + +TTL management mechanism is an important feature for databases. Hudi already provides a `delete_partition` interface to +delete outdated partitions. However, users still need to detect which partitions are outdated and +call `delete_partition` manually, which means that users need to define and implement some kind of TTL strategies, find expired partitions and call call `delete_partition` by themselves. As the scale of installations grew, it is becoming increasingly important to implement a user-friendly TTL management mechanism for hudi. + +## Implementation + +Our main goals are as follows: + +* Providing an extensible framework for partition TTL management. +* Implement a simple KEEP_BY_TIME strategy, which can be executed through independent Spark job, synchronous or asynchronous table services. + +### Strategy Definition + +The TTL strategies is similar to existing table service strategies. We can define TTL strategies like defining a clustering/clean/compaction strategy: + +```properties +hoodie.partition.ttl.management.strategy=KEEP_BY_TIME +hoodie.partition.ttl.management.strategy.class=org.apache.hudi.table.action.ttl.strategy.KeepByTimePartitionTTLManagementStrategy +hoodie.partition.ttl.days.retain=10 +``` + +The config `hoodie.partition.ttl.management.strategy.class` is to provide a strategy class (subclass of `PartitionTTLManagementStrategy`) to get expired partition paths to delete. And `hoodie.partition.ttl.days.retain` is the strategy value used by `KeepByTimePartitionTTLManagementStrategy` which means that we will expire partitions that haven't been modified for this strategy value set. We will cover the `KeepByTimeTTLManagementStrategy` strategy in detail in the next section. + +The core definition of `PartitionTTLManagementStrategy` looks like this: + +```java +/** + * Strategy for partition-level TTL management. + */ +public abstract class PartitionTTLManagementStrategy { + /** + * Get expired partition paths for a specific partition ttl management strategy. + * + * @return Expired partition paths. + */ + public abstract List<String> getExpiredPartitionPaths(); +} +``` + +Users can provide their own implementation of `PartitionTTLManagementStrategy` and hudi will help delete the expired partitions. + +### KeepByTimeTTLManagementStrategy + +We will provide a strategy call `KeepByTimePartitionTTLManagementStrategy` in the first version of partition TTL management implementation. + +The `KeepByTimePartitionTTLManagementStrategy` will calculate the `lastModifiedTime` for each input partitions. If duration between now and 'lastModifiedTime' for the partition is larger than what `hoodie.partition.ttl.days.retain` configured, `KeepByTimePartitionTTLManagementStrategy` will mark this partition as an expired partition. We use day as the unit of expired time since it is very common-used for datalakes. Open to ideas for this. + +we will to use the largest commit time of committed file groups in the partition as the partition's +`lastModifiedTime`. So any write (including normal DMLs, clustering etc.) with larger instant time will change the partition's `lastModifiedTime`. + +For file groups generated by replace commit, it may not reveal the real insert/update time for the file group. However, we can assume that we won't do clustering for a partition without new writes for a long time when using the strategy. And in the future, we may introduce a more accurate mechanism to get `lastModifiedTime` of a partition, for example using metadata table. + +### Apply different strategies for different partitions + +For some specific users, they may want to apply different strategies for different partitions. For example, they may have multi partition fileds(productId, day). For partitions under `product=1` they want to keep for 30 days while for partitions under `product=2` they want to keep for 7 days only. + +For the first version of TTL management, we do not plan to implement a complicated strategy (For example, use an array to store strategies, introduce partition regex etc.). Instead, we add a new abstract method `getPartitionPathsForTTLManagement` in `PartitionTTLManagementStrategy` and provides a new config `hoodie.partition.ttl.management.partition.selected`. + +If `hoodie.partition.ttl.management.partition.selected` is set, `getPartitionPathsForTTLManagement` will return partitions provided by this config. If not, `getPartitionPathsForTTLManagement` will return all partitions in the hudi table. Review Comment: It will be a write config, and we will only apply TTL policy to the partition specified by this config. If this config is not specified, we will apply ttl policy to all the partitions in hudi table. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
