SteNicholas opened a new pull request, #8613: URL: https://github.com/apache/hudi/pull/8613
### Change Logs `ClusteringCommitSink` could strengthen commit and rollback strategy from two solutions: - Commit: Introduces `clusteringPlanCache` that caches to store clustering plan for each instant. `clusteringPlanCache` stores the mapping of instant_time -> clusteringPlan. - Rolback: Updates `commitBuffer` that stores the mapping of instant_time -> file_ids -> event. Use a map to collect the events because the rolling back of intermediate clustering tasks generates corrupt events. ### Impact Clustering commit and rollback strategy are improved. When the number of filegroups contained in the clustering plan is relatively large, it will be very expensive to read the clustering plan for each event received. Meanwhile, the rolling back of intermediate clustering tasks could generate corrupt events and collects the events via the map. ### Risk level (write none, low medium or high below) none. ### Documentation Update none. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Change Logs and Impact were stated clearly - [x] Adequate tests were added if applicable - [x] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
