[ 
https://issues.apache.org/jira/browse/HUDI-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-2338:
---------------------------------
    Labels: pull-request-available  (was: )

> Hoodie data update reject clustering using SparkRejectClusteringStrategy
> ------------------------------------------------------------------------
>
>                 Key: HUDI-2338
>                 URL: https://issues.apache.org/jira/browse/HUDI-2338
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Yue Zhang
>            Priority: Major
>              Labels: pull-request-available
>
> Hudi now support async clustering in HoodieDeltaStreamer and StructStreaming 
> and support offline clustering through HoodieClusteringJob.
> Data update conflicts with clustering is one of the more common scenarios. 
> And now hudi can only reject data using SparkRejectUpdateStrategy and failed 
> the ingestion.
> Sometimes, we think that clustering is an optimization service that runs in 
> the background, and data ingestion has a higher priority than it.
> So this tickets add a new UpdateStrategy named SparkRejectClusteringStrategy.
> This SparkRejectClusteringStrategy will reject and failing clustering job and 
> let data update success. 
> When update happened after clustering plan created and before clustering 
> executed.When update happened after clustering plan created and before 
> clustering executed.
>      1. There will be a request replace commit.  
>      2. SparkRejectClusteringStrategy will create a clustering reject file 
> under .tmp dir named xxx.replacement.request.reject.
>      3. Before perform clustering job, hudi can check this reject file using 
> SparkRejectClusteringStrategy.validateClustering() function.
>           3.1 if reject file is exists then abort this clustering plan and 
> remove reject file.
> When update happened after clustering executed but not finished.
>      1. There will be a inflight replace commit.
>      2. SparkRejectClusteringStrategy will create a clustering reject file 
> under .tmp dir named xxx.replacement.inflight.reject.
>      3. Before clustering job finished and committed, hudi can check this 
> reject file using SparkRejectClusteringStrategy.validateClustering() function.
>           3.1 if reject file is exists then failed this clustering execution 
> and remove reject file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to