Yue Zhang created HUDI-2194:
-------------------------------
Summary: Skip the latest N partitions when creating ClusteringPlan
Key: HUDI-2194
URL: https://issues.apache.org/jira/browse/HUDI-2194
Project: Apache Hudi
Issue Type: Task
Reporter: Yue Zhang
As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering
strategy to create ClusteringPlan. And it is useful when Hudi table is
partitioned by time.
For now, users can set
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` to control the
number of partitions to list from the latest partition to create ClusteringPlan.
For example, we have 6 partitions based on date, and users set
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2
20210718 | 20210719 | 20210720 | 20210721 | 20210722 | 20210723(latest) |
|<----- choose to cluster ---->|
Sometimes users also what to skip x partitions from latest when make clustering
plan because latest partitions contains lots of update data or some reasons
else.
This patch will add a new config named `
hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
` to set the
number of partitions to skip from latest when choosing partitions to create
ClusteringPlan
for example users set
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and
`
hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
` 2
| 20210718 | 20210719 | 20210720 | 20210721 | 20210722 | 20210723(latest) |
|<----- choose ----->|
--
This message was sent by Atlassian Jira
(v8.3.4#803005)