prashantwason opened a new issue, #18044: URL: https://github.com/apache/hudi/issues/18044
## Description When Spark speculative execution is enabled (`spark.speculation=true`), it can lead to data corruption and duplicate records in Hudi tables. This happens because speculative execution launches duplicate task attempts, which can result in: - Multiple write attempts to the same file slices - Broken/partial parquet files when speculative tasks are killed mid-write - Duplicate records in the dataset This is related to issue #9615, which reports broken parquet files after compaction with speculative execution enabled. ## Proposed Solution Add a configuration-controlled guardrail that prevents `SparkRDDWriteClient` creation when Spark speculative execution is detected: 1. Add new config `hoodie.block.writes.on.speculative.execution` (default: `true`) 2. Check `spark.speculation` setting during client initialization 3. Throw an exception with a clear error message if speculative execution is enabled This provides a proactive defense mechanism rather than trying to handle corruption after the fact. Users who understand the risks can disable the guardrail by setting the config to `false`. ## Impact - **Data Safety**: Prevents potential data corruption and duplicate records - **User Experience**: Clear error message guides users to either disable speculation or explicitly opt-out of the guardrail - **Backward Compatible**: Default behavior blocks writes, but existing users who knowingly use speculative execution can opt-out -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
