[jira] [Created] (HUDI-2338) Hoodie data update reject clustering using SparkRejectClusteringStrategy

Yue Zhang (Jira) Thu, 19 Aug 2021 19:02:04 -0700

Yue Zhang created HUDI-2338:
-------------------------------

             Summary: Hoodie data update reject clustering using 
SparkRejectClusteringStrategy
                 Key: HUDI-2338
                 URL: https://issues.apache.org/jira/browse/HUDI-2338
             Project: Apache Hudi
          Issue Type: Task
            Reporter: Yue Zhang



Hudi now support async clustering in HoodieDeltaStreamer and StructStreaming 
and support offline clustering through HoodieClusteringJob.

Data update conflicts with clustering is one of the more common scenarios. And 
now hudi can only reject data using SparkRejectUpdateStrategy and failed the 
ingestion.



Sometimes, we think that clustering is an optimization service that runs in the 
background, and data ingestion has a higher priority than it.

So this tickets add a new UpdateStrategy named SparkRejectClusteringStrategy.

This SparkRejectClusteringStrategy will reject and failing clustering job and 
let data update success. 

When update happened after clustering plan created and before clustering 
executed.When update happened after clustering plan created and before 
clustering executed.

     1. There will be a request replace commit.  

     2. SparkRejectClusteringStrategy will create a clustering reject file 
under .tmp dir named xxx.replacement.request.reject.

     3. Before perform clustering job, hudi can check this reject file using 
SparkRejectClusteringStrategy.validateClustering() function.

          3.1 if reject file is exists then abort this clustering plan and 
remove reject file.

When update happened after clustering executed but not finished.

     1. There will be a inflight replace commit.

     2. SparkRejectClusteringStrategy will create a clustering reject file 
under .tmp dir named xxx.replacement.inflight.reject.

     3. Before clustering job finished and committed, hudi can check this 
reject file using SparkRejectClusteringStrategy.validateClustering() function.

          3.1 if reject file is exists then failed this clustering execution 
and remove reject file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-2338) Hoodie data update reject clustering using SparkRejectClusteringStrategy

Reply via email to