SteNicholas opened a new pull request, #8613:
URL: https://github.com/apache/hudi/pull/8613

   ### Change Logs
   
   `ClusteringCommitSink` could strengthen commit and rollback strategy from 
two solutions:
   
   - Commit: Introduces `clusteringPlanCache` that caches to store clustering 
plan for each instant. `clusteringPlanCache` stores the mapping of instant_time 
-> clusteringPlan.
   - Rolback: Updates `commitBuffer` that stores the mapping of instant_time -> 
file_ids -> event. Use a map to collect the events because the rolling back of 
intermediate clustering tasks generates corrupt events.
   
   ### Impact
   
   Clustering commit and rollback strategy are improved. When the number of 
filegroups contained in the clustering plan is relatively large, it will be 
very expensive to read the clustering plan for each event received. Meanwhile, 
the rolling back of intermediate clustering tasks could generate corrupt events 
and collects the events via the map.
   
   ### Risk level (write none, low medium or high below)
   
   none.
   
   ### Documentation Update
   
   none.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - [x] CI passed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to