shangxinli opened a new pull request, #18795:
URL: https://github.com/apache/hudi/pull/18795

   ### Change Logs
   
   Introduces `RollbackOrphanDetector` and a new feature-flag config that will 
later gate archival of rollback instants when their orphan files are still on 
storage. **No behavior change yet** — this PR only lands the building block; 
wiring into the archive planner follows in a separate cascade PR.
   
   #### Motivation
   
   See issue #18783 for the full problem statement. In summary: when a rollback 
partially fails (crash mid-rollback, marker loss, or a blocked storage 
`close()` that lands data after rollback completed) and the rollback instant is 
later archived, the system loses the metadata anchor that lets readers filter 
out the orphan files. Readers then return corrupt-parquet errors or duplicate 
records — a hard violation of the reader/writer isolation guarantee.
   
   This PR is the foundation for an archive-time precondition check that 
prevents that loss-of-anchor scenario.
   
   #### What's in this PR
   
   1. New `RollbackOrphanDetector` in `hudi-client/hudi-client-common` under 
`org.apache.hudi.table.action.rollback`. Two detection modes:
      - `LIGHT` — reads `HoodieRollbackMetadata.failedDeleteFiles`. O(metadata 
size). Catches files the rollback explicitly tried and failed to delete but 
misses post-rollback late landings.
      - `THOROUGH` — additionally lists the partitions named in the rollback 
metadata and matches filenames against the rollback's target instant time(s) 
(both base parquet and MoR log file naming). Bounded by partition count in the 
rollback metadata, not whole-table size.
   2. Safety floor: every candidate is cross-checked against completed instants 
in the active timeline — a file whose embedded instant is a `COMPLETED` commit 
is never flagged as an orphan, even if its filename matches the regex.
   3. New config `hoodie.archive.rollback.orphan.guard.mode` (values `OFF` / 
`LIGHT` / `THOROUGH`, **default `OFF`**) and a getter on `HoodieWriteConfig`.
   4. Two overloads: `(HoodieTable, HoodieInstant, Mode)` for the archive 
planner context and `(HoodieTableMetaClient, HoodieInstant, Mode)` for the 
`hudi-cli` context that follows in PR3.
   5. `TestRollbackOrphanDetector` with 5 tests covering `OFF`, `LIGHT` with 
empty/non-empty `failedDeleteFiles`, `THOROUGH` with a real partition listing, 
and the safety-floor case.
   
   ### Impact
   
   None until the new config is set to `LIGHT` or `THOROUGH`. The default `OFF` 
short-circuits before any work happens.
   
   ### Risk level
   
   low
   
   ### Documentation Update
   
   A user-facing config doc update is appropriate once PR1b (the wiring PR) 
lands — that's the change users would actually flip. This PR is plumbing only.
   
   ### Related
   
   - Issue: #18783
   - Cascade PR (wires the detector in): 
`feat/rollback-orphan-archive-precondition` on the fork; will be opened after 
this merges
   - Companion CLI PR: `feat/rollback-orphan-repair-cli` on the fork; will be 
opened after this merges
   
   ### Contributor's checklist
   
   - [ ] HUDI JIRA ticket (placeholder `HUDI-XXXX` in title — happy to file 
once approach is sanity-checked)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests added
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to