Yue Zhang created HUDI-2194: ------------------------------- Summary: Skip the latest N partitions when creating ClusteringPlan Key: HUDI-2194 URL: https://issues.apache.org/jira/browse/HUDI-2194 Project: Apache Hudi Issue Type: Task Reporter: Yue Zhang
As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering strategy to create ClusteringPlan. And it is useful when Hudi table is partitioned by time. For now, users can set `hoodie.clustering.plan.strategy.daybased.lookback.partitions` to control the number of partitions to list from the latest partition to create ClusteringPlan. For example, we have 6 partitions based on date, and users set `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 20210718 | 20210719 | 20210720 | 20210721 | 20210722 | 20210723(latest) | |<----- choose to cluster ---->| Sometimes users also what to skip x partitions from latest when make clustering plan because latest partitions contains lots of update data or some reasons else. This patch will add a new config named ` hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions ` to set the number of partitions to skip from latest when choosing partitions to create ClusteringPlan for example users set `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and ` hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions ` 2 | 20210718 | 20210719 | 20210720 | 20210721 | 20210722 | 20210723(latest) | |<----- choose ----->| -- This message was sent by Atlassian Jira (v8.3.4#803005)