[
https://issues.apache.org/jira/browse/HUDI-2194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yue Zhang updated HUDI-2194:
----------------------------
Description:
As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering
strategy to create ClusteringPlan. And it is useful when Hudi table is
partitioned by time.
For now, users can set
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` to control the
number of partitions to list from the latest partition to create ClusteringPlan.
For example, we have 6 partitions based on date, and users set
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2
|20210718 | 20210719 | 20210720 | 20210721 | 20210722 | 20210723(latest) |
|<----- choose to cluster ---->|
Sometimes users also what to skip x partitions from latest when make clustering
plan because latest partitions contains lots of update data or some reasons
else.
This patch will add a new config named `
hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
` to set the
number of partitions to skip from latest when choosing partitions to create
ClusteringPlan
for example users set
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and
`
hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
` 2
|20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)|
|<----- choose ----->|
was:
As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering
strategy to create ClusteringPlan. And it is useful when Hudi table is
partitioned by time.
For now, users can set
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` to control the
number of partitions to list from the latest partition to create ClusteringPlan.
For example, we have 6 partitions based on date, and users set
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2
20210718 | 20210719 | 20210720 | 20210721 | 20210722 | 20210723(latest) |
|<----- choose to cluster ---->|
Sometimes users also what to skip x partitions from latest when make clustering
plan because latest partitions contains lots of update data or some reasons
else.
This patch will add a new config named `
hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
` to set the
number of partitions to skip from latest when choosing partitions to create
ClusteringPlan
for example users set
`hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and
`
hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
` 2
| 20210718 | 20210719 | 20210720 | 20210721 | 20210722 | 20210723(latest) |
|<----- choose ----->|
> Skip the latest N partitions when creating ClusteringPlan
> ---------------------------------------------------------
>
> Key: HUDI-2194
> URL: https://issues.apache.org/jira/browse/HUDI-2194
> Project: Apache Hudi
> Issue Type: Task
> Reporter: Yue Zhang
> Priority: Major
>
> As we known, SparkRecentDaysClusteringPlanStrategy is the default clustering
> strategy to create ClusteringPlan. And it is useful when Hudi table is
> partitioned by time.
>
> For now, users can set
> `hoodie.clustering.plan.strategy.daybased.lookback.partitions` to control
> the number of partitions to list from the latest partition to create
> ClusteringPlan.
> For example, we have 6 partitions based on date, and users set
> `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2
> |20210718 | 20210719 | 20210720 | 20210721 | 20210722 | 20210723(latest) |
>
> |<----- choose to cluster ---->|
> Sometimes users also what to skip x partitions from latest when make
> clustering plan because latest partitions contains lots of update data or
> some reasons else.
>
> This patch will add a new config named `
> hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
> ` to set the
> number of partitions to skip from latest when choosing partitions to create
> ClusteringPlan
>
> for example users set
> `hoodie.clustering.plan.strategy.daybased.lookback.partitions` 2 and
> `
> hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions
> ` 2
> |20210718|20210719 |20210720 |20210721 |20210722 |20210723(latest)|
> |<----- choose ----->|
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)