[
https://issues.apache.org/jira/browse/HUDI-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-2338:
---------------------------------
Labels: pull-request-available (was: )
> Hoodie data update reject clustering using SparkRejectClusteringStrategy
> ------------------------------------------------------------------------
>
> Key: HUDI-2338
> URL: https://issues.apache.org/jira/browse/HUDI-2338
> Project: Apache Hudi
> Issue Type: Task
> Reporter: Yue Zhang
> Priority: Major
> Labels: pull-request-available
>
> Hudi now support async clustering in HoodieDeltaStreamer and StructStreaming
> and support offline clustering through HoodieClusteringJob.
> Data update conflicts with clustering is one of the more common scenarios.
> And now hudi can only reject data using SparkRejectUpdateStrategy and failed
> the ingestion.
> Sometimes, we think that clustering is an optimization service that runs in
> the background, and data ingestion has a higher priority than it.
> So this tickets add a new UpdateStrategy named SparkRejectClusteringStrategy.
> This SparkRejectClusteringStrategy will reject and failing clustering job and
> let data update success.
> When update happened after clustering plan created and before clustering
> executed.When update happened after clustering plan created and before
> clustering executed.
> 1. There will be a request replace commit.
> 2. SparkRejectClusteringStrategy will create a clustering reject file
> under .tmp dir named xxx.replacement.request.reject.
> 3. Before perform clustering job, hudi can check this reject file using
> SparkRejectClusteringStrategy.validateClustering() function.
> 3.1 if reject file is exists then abort this clustering plan and
> remove reject file.
> When update happened after clustering executed but not finished.
> 1. There will be a inflight replace commit.
> 2. SparkRejectClusteringStrategy will create a clustering reject file
> under .tmp dir named xxx.replacement.inflight.reject.
> 3. Before clustering job finished and committed, hudi can check this
> reject file using SparkRejectClusteringStrategy.validateClustering() function.
> 3.1 if reject file is exists then failed this clustering execution
> and remove reject file.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)