prashantwason opened a new issue, #18049:
URL: https://github.com/apache/hudi/issues/18049
## Describe the problem
When Spark tasks or stages are retried (due to failures, speculation, or
task resubmission), write operations may be re-executed. This can cause:
1. **Duplicate data files** - Multiple copies of the same logical write are
created with different file names
2. **Corrupted/partial Parquet files** - When speculative tasks are killed
mid-write, they leave behind incomplete files (often just headers)
3. **Downstream job failures** - Subsequent read, ingestion, or compaction
jobs fail when encountering corrupt or duplicate files
### Related issues
- #9615 - After enabling speculation execution, broken parquet files are
generated
- #697 - Spark retry problem causing duplicate files
- #1764 - Commits stay INFLIGHT due to duplicate file cleanup failures
## Describe the solution
When DIRECT markers are configured:
1. **Store WriteStatus in completion markers** - After a file is
successfully written, serialize and store the `WriteStatus` in the completed
marker file (with checksum for integrity)
2. **Recover WriteStatus on retry** - When a task retry attempts to create
an in-progress marker, check if a completed marker already exists. If it does:
- Deserialize and recover the `WriteStatus` from the existing marker
- Skip the write operation entirely
- Reuse the original data file
3. **Support file splitting** - Handle scenarios where a single executor
splits records into multiple files by recovering `WriteStatus` for all files
with matching fileId prefix
### Key changes required
- `HoodieWriteHandle`: Add `hasRecoveredWriteStatus` flag and
`recoveredWriteStatuses` list; modify `createInProgressMarkerFile` to return
boolean and recover write statuses from existing completed markers
- `DirectWriteMarkers`: Add methods to write serialized data with checksum
to completion markers and read it back
- Various write handles (`HoodieAppendHandle`, `HoodieCreateHandle`,
`HoodieMergeHandle`): Store serialized `WriteStatus` in completed markers
- Table classes: Check for recovered write status and skip re-writing if
already completed
### Benefits
- Prevents duplicate data files during task/stage retries
- Avoids corrupted partial files from killed speculative tasks
- Improves job reliability when speculation is enabled
- No data loss since the original successfully written file is preserved
## Additional context
This enhancement works with DIRECT markers only. Timeline-server-based
markers do not support storing content in markers.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]