prashantwason commented on PR #9035: URL: https://github.com/apache/hudi/pull/9035#issuecomment-3824996820
Hi @nbalajee @yihua, We are interested in helping revive this PR. This feature addresses a critical issue that continues to affect Hudi users in production environments. ### Why This Is Important The Spark task/stage retry problem causes real data quality issues: - **Duplicate data files** left on the dataset when stray executors complete writes after the driver has finalized the commit - **Duplicate records** visible to query engines, causing data quality issues - **Partial/corrupt Parquet files** when speculative tasks are killed mid-write Related issues that would benefit from this fix: - #9615 - Broken parquet files with speculation execution enabled - #697 - Spark retry problem causing duplicate files - #1764 - Commits stay INFLIGHT due to duplicate file cleanup failures - #8674 - Parquet file length too low (0 bytes) ### Current State I noticed the PR currently has merge conflicts (`mergeable_state: dirty`) and the last CI run had failures. We would be happy to help: 1. **Rebase the PR** onto the latest master to resolve conflicts 2. **Fix CI failures** and ensure tests pass 3. **Address any outstanding review feedback** ### Questions 1. Are there any blocking concerns or design decisions that prevented this from moving forward? 2. Would it be acceptable for us to create a new PR based on this work with the necessary updates? 3. Is there any overlap or conflict with the RFC in #11593 (Robust Spark Writes)? We have been running a similar implementation internally and can validate that the approach works well in production. Looking forward to helping get this merged. cc: @nsivabalan @codope -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
