prashantwason opened a new pull request, #18045: URL: https://github.com/apache/hudi/pull/18045
### Describe the issue this Pull Request addresses Closes #18044 When Spark speculative execution is enabled (`spark.speculation=true`), it can lead to data corruption and duplicate records in Hudi tables. This happens because speculative execution launches duplicate task attempts, which can result in multiple write attempts to the same file slices, broken/partial parquet files, and duplicate records. This is also related to issue #9615, which reports broken parquet files after compaction with speculative execution enabled. ### Summary and Changelog Adds a configuration-controlled guardrail that prevents `SparkRDDWriteClient` creation when Spark speculative execution is detected: - Added new config `hoodie.block.writes.on.speculative.execution` (default: `true`) in `HoodieWriteConfig` - Added `checkSpeculativeExecution()` method in `SparkRDDWriteClient` that is called during client initialization - Throws `HoodieException` with a clear error message if speculative execution is enabled and the guardrail is active - Added comprehensive tests for the guardrail behavior ### Impact - **New Config**: `hoodie.block.writes.on.speculative.execution` - When enabled (default), throws an exception if Spark speculative execution is enabled during client creation. Users who understand the risks can set this to `false` to bypass the guardrail. - **User-facing Change**: Users with `spark.speculation=true` will now get a clear error message instead of potential data corruption. ### Risk Level low - This is an additive change that provides a safety guardrail. The default behavior (blocking writes with speculative execution) is intentionally conservative to prevent data corruption. Users who need speculative execution can explicitly opt-out. ### Documentation Update The config description is self-documenting. The new configuration property includes detailed documentation explaining: - What the guardrail does - Why speculative execution is problematic for Hudi - How to bypass it if needed ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
