Yue Zhang created HUDI-2338:
-------------------------------
Summary: Hoodie data update reject clustering using
SparkRejectClusteringStrategy
Key: HUDI-2338
URL: https://issues.apache.org/jira/browse/HUDI-2338
Project: Apache Hudi
Issue Type: Task
Reporter: Yue Zhang
Hudi now support async clustering in HoodieDeltaStreamer and StructStreaming
and support offline clustering through HoodieClusteringJob.
Data update conflicts with clustering is one of the more common scenarios. And
now hudi can only reject data using SparkRejectUpdateStrategy and failed the
ingestion.
Sometimes, we think that clustering is an optimization service that runs in the
background, and data ingestion has a higher priority than it.
So this tickets add a new UpdateStrategy named SparkRejectClusteringStrategy.
This SparkRejectClusteringStrategy will reject and failing clustering job and
let data update success.
When update happened after clustering plan created and before clustering
executed.When update happened after clustering plan created and before
clustering executed.
1. There will be a request replace commit.
2. SparkRejectClusteringStrategy will create a clustering reject file
under .tmp dir named xxx.replacement.request.reject.
3. Before perform clustering job, hudi can check this reject file using
SparkRejectClusteringStrategy.validateClustering() function.
3.1 if reject file is exists then abort this clustering plan and
remove reject file.
When update happened after clustering executed but not finished.
1. There will be a inflight replace commit.
2. SparkRejectClusteringStrategy will create a clustering reject file
under .tmp dir named xxx.replacement.inflight.reject.
3. Before clustering job finished and committed, hudi can check this
reject file using SparkRejectClusteringStrategy.validateClustering() function.
3.1 if reject file is exists then failed this clustering execution
and remove reject file.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)