[I] [FEATURE] Recover WriteStatus from previously successful writer attempt to prevent duplicate files during Spark task/stage retries [hudi]

via GitHub Thu, 29 Jan 2026 15:42:14 -0800


prashantwason opened a new issue, #18049:
URL: https://github.com/apache/hudi/issues/18049


   ## Describe the problem
   
   When Spark tasks or stages are retried (due to failures, speculation, or 
task resubmission), write operations may be re-executed. This can cause:
   
   1. **Duplicate data files** - Multiple copies of the same logical write are 
created with different file names
   2. **Corrupted/partial Parquet files** - When speculative tasks are killed 
mid-write, they leave behind incomplete files (often just headers)
   3. **Downstream job failures** - Subsequent read, ingestion, or compaction 
jobs fail when encountering corrupt or duplicate files
   
   ### Related issues
   - #9615 - After enabling speculation execution, broken parquet files are 
generated
   - #697 - Spark retry problem causing duplicate files
   - #1764 - Commits stay INFLIGHT due to duplicate file cleanup failures
   
   ## Describe the solution
   
   When DIRECT markers are configured:
   
   1. **Store WriteStatus in completion markers** - After a file is 
successfully written, serialize and store the `WriteStatus` in the completed 
marker file (with checksum for integrity)
   
   2. **Recover WriteStatus on retry** - When a task retry attempts to create 
an in-progress marker, check if a completed marker already exists. If it does:
      - Deserialize and recover the `WriteStatus` from the existing marker
      - Skip the write operation entirely
      - Reuse the original data file
   
   3. **Support file splitting** - Handle scenarios where a single executor 
splits records into multiple files by recovering `WriteStatus` for all files 
with matching fileId prefix
   
   ### Key changes required
   - `HoodieWriteHandle`: Add `hasRecoveredWriteStatus` flag and 
`recoveredWriteStatuses` list; modify `createInProgressMarkerFile` to return 
boolean and recover write statuses from existing completed markers
   - `DirectWriteMarkers`: Add methods to write serialized data with checksum 
to completion markers and read it back
   - Various write handles (`HoodieAppendHandle`, `HoodieCreateHandle`, 
`HoodieMergeHandle`): Store serialized `WriteStatus` in completed markers
   - Table classes: Check for recovered write status and skip re-writing if 
already completed
   
   ### Benefits
   - Prevents duplicate data files during task/stage retries
   - Avoids corrupted partial files from killed speculative tasks
   - Improves job reliability when speculation is enabled
   - No data loss since the original successfully written file is preserved
   
   ## Additional context
   This enhancement works with DIRECT markers only. Timeline-server-based 
markers do not support storing content in markers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [FEATURE] Recover WriteStatus from previously successful writer attempt to prevent duplicate files during Spark task/stage retries [hudi]

Reply via email to