suryaprasanna opened a new pull request, #18196: URL: https://github.com/apache/hudi/pull/18196
### Describe the issue this Pull Request addresses During parallel ingestion where cell and row level commits happen concurrently, if a row's clean commit fails during post-commit processing, it causes the cell commit to fail as well, wasting the entire run. This PR adds graceful handling of post-commit failures to allow ingestion to complete without failure, along with metrics to track duration and failures. ### Summary and Changelog Users can now configure Hudi to ignore post-commit operation failures (cleaning, archival, and table services) through a new configuration flag. When enabled, post-commit failures are logged but don't kill the application, allowing the main write operation to succeed. Metrics have been added to track post-commit operation status and duration. **Changes:** - Added `hoodie.post.commit.failures.ignored` config to control post-commit failure handling - Wrapped post-commit operations (`postCommit`, `mayBeCleanAndArchive`, `runTableServicesInline`) in try-catch blocks with configurable failure handling - Added `updatePostCommitMetrics()` method in `HoodieMetrics` to track post-commit success/failure and duration - Added comprehensive unit test `testPostCommitFailureHandlingWithMetrics()` in `TestSparkRDDWriteClient` to verify both failure handling modes and metrics tracking ### Impact **Public API Changes:** - New configuration property: `hoodie.post.commit.failures.ignored` (default: `false`) - New builder method: `HoodieWriteConfig.Builder.withIgnorePostCommitFailure(boolean)` **Metrics:** - New metrics: `postCommit.failure.counter` and `postCommit.duration` **Behavior Change:** When `hoodie.post.commit.failures.ignored=true`, post-commit operation failures (cleaning, archival, table services) will be logged as errors but will not cause the write operation to fail. ### Risk Level **Low** The change is backward compatible (default behavior unchanged) and only activates when explicitly configured. The try-catch blocks have finally blocks to ensure metrics are always updated. Comprehensive testing has been added to verify both success and failure scenarios. ### Documentation Update Configuration documentation needs to be updated: - Add `hoodie.post.commit.failures.ignored` to configuration reference with description: "When this config is true, any failures in the post commit operations will be ignored and does not kill the application." ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
