suryaprasanna opened a new pull request, #18196:
URL: https://github.com/apache/hudi/pull/18196

   ### Describe the issue this Pull Request addresses
   
   During parallel ingestion where cell and row level commits happen 
concurrently, if a row's clean commit fails during post-commit processing, it 
causes the cell commit to fail as well, wasting the entire run. This PR adds 
graceful handling of post-commit failures to allow ingestion to complete 
without failure, along with metrics to track duration and failures.
   
   ### Summary and Changelog
   
   Users can now configure Hudi to ignore post-commit operation failures 
(cleaning, archival, and table services) through a new configuration flag. When 
enabled, post-commit failures are logged but don't kill the application, 
allowing the main write operation to succeed. Metrics have been added to track 
post-commit operation status and duration.
   
   **Changes:**
   - Added `hoodie.post.commit.failures.ignored` config to control post-commit 
failure handling
   - Wrapped post-commit operations (`postCommit`, `mayBeCleanAndArchive`, 
`runTableServicesInline`) in try-catch blocks with configurable failure handling
   - Added `updatePostCommitMetrics()` method in `HoodieMetrics` to track 
post-commit success/failure and duration
   - Added comprehensive unit test `testPostCommitFailureHandlingWithMetrics()` 
in `TestSparkRDDWriteClient` to verify both failure handling modes and metrics 
tracking
   
   ### Impact
   
   **Public API Changes:**
   - New configuration property: `hoodie.post.commit.failures.ignored` 
(default: `false`)
   - New builder method: 
`HoodieWriteConfig.Builder.withIgnorePostCommitFailure(boolean)`
   
   **Metrics:**
   - New metrics: `postCommit.failure.counter` and `postCommit.duration`
   
   **Behavior Change:**
   When `hoodie.post.commit.failures.ignored=true`, post-commit operation 
failures (cleaning, archival, table services) will be logged as errors but will 
not cause the write operation to fail.
   
   ### Risk Level
   
   **Low**
   
   The change is backward compatible (default behavior unchanged) and only 
activates when explicitly configured. The try-catch blocks have finally blocks 
to ensure metrics are always updated. Comprehensive testing has been added to 
verify both success and failure scenarios.
   
   ### Documentation Update
   
   Configuration documentation needs to be updated:
   - Add `hoodie.post.commit.failures.ignored` to configuration reference with 
description: "When this config is true, any failures in the post commit 
operations will be ignored and does not kill the application."
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to