kbuci opened a new pull request, #18288:
URL: https://github.com/apache/hudi/pull/18288

   ### Describe the issue this Pull Request addresses
   
   If `getClusteringPlan` is called after the target instant is rolled back by 
a concurrent writer, a runtime exception is thrown. This causes the following 
important use cases to fail:
   
   - **Ingestion** checking whether other replacecommits are from clustering 
(via `ClusteringUtils.getAllFileGroupsInPendingClusteringPlans`)
   - **Clustering jobs** calling `ClusteringUtils.getAllPendingClusteringPlans` 
to find failed clustering attempts to rollback
   - **File system view initialization** calling 
`ClusteringUtils.getAllFileGroupsInPendingClusteringPlans` to track file groups 
involved in pending clustering
   
   In all of these cases, between the time the timeline is loaded and before 
`getClusteringPlan` is called, the instant can be rolled back by a concurrent 
writer, causing the requested metadata file to no longer exist.
   
   ### Summary and Changelog
   
   Update `ClusteringUtils.getClusteringPlan` to gracefully handle the case 
where a clustering/replacecommit instant is rolled back by a concurrent writer 
between timeline load and metadata read.
   
   - The method that directly reads requested replace metadata now catches both 
`IOException` and `HoodieIOException`
   - When a `HoodieTableMetaClient` is available, the active timeline is 
reloaded on error and the instant's presence is re-checked. If the instant is 
no longer in the timeline, the error is suppressed and an empty `Option` is 
returned instead of throwing
   - When `metaClient` is not available (e.g. callers using the timeline-only 
overload), the original exception behavior is preserved
   - A new overload accepting `Option<HoodieTableMetaClient>` is introduced to 
allow callers to opt into error recovery
   - Added unit tests covering: non-existent instant, deleted requested file 
(simulated rollback), and `getAllPendingClusteringPlans` gracefully skipping a 
rolled-back instant
   
   ### Impact
   
   No public API changes. The existing 
`getClusteringPlan(HoodieTableMetaClient, HoodieInstant)` and 
`getClusteringPlan(HoodieTimeline, HoodieInstant, InstantGenerator)` signatures 
are unchanged. A new overload `getClusteringPlan(HoodieTimeline, HoodieInstant, 
InstantGenerator, Option<HoodieTableMetaClient>)` is added.
   
   Behavioral change: `getClusteringPlan` now returns `Option.empty()` instead 
of throwing when the instant was concurrently rolled back and `metaClient` is 
available for verification. This also prevents file system view initialization 
from failing when it calls `getAllFileGroupsInPendingClusteringPlans` during a 
concurrent rollback.
   
   ### Risk Level
   
   Low. The fix only changes error handling behavior in a narrow race condition 
(concurrent rollback during metadata read). The happy path is unaffected. The 
error recovery path (reload timeline + check instant presence) is consistent 
with how other parts of the codebase handle concurrent modifications.
   
   ### Documentation Update
   
   None.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to