prashantwason opened a new issue, #18044:
URL: https://github.com/apache/hudi/issues/18044

   ## Description
   
   When Spark speculative execution is enabled (`spark.speculation=true`), it 
can lead to data corruption and duplicate records in Hudi tables. This happens 
because speculative execution launches duplicate task attempts, which can 
result in:
   
   - Multiple write attempts to the same file slices
   - Broken/partial parquet files when speculative tasks are killed mid-write
   - Duplicate records in the dataset
   
   This is related to issue #9615, which reports broken parquet files after 
compaction with speculative execution enabled.
   
   ## Proposed Solution
   
   Add a configuration-controlled guardrail that prevents `SparkRDDWriteClient` 
creation when Spark speculative execution is detected:
   
   1. Add new config `hoodie.block.writes.on.speculative.execution` (default: 
`true`)
   2. Check `spark.speculation` setting during client initialization
   3. Throw an exception with a clear error message if speculative execution is 
enabled
   
   This provides a proactive defense mechanism rather than trying to handle 
corruption after the fact. Users who understand the risks can disable the 
guardrail by setting the config to `false`.
   
   ## Impact
   
   - **Data Safety**: Prevents potential data corruption and duplicate records
   - **User Experience**: Clear error message guides users to either disable 
speculation or explicitly opt-out of the guardrail
   - **Backward Compatible**: Default behavior blocks writes, but existing 
users who knowingly use speculative execution can opt-out


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to