(hudi) 02/02: add log merged state orphaned blob edge case

vhs Thu, 19 Mar 2026 10:33:29 -0700

This is an automated email from the ASF dual-hosted git repository.

vhs pushed a commit to branch rfc-blob-cleaner
in repository https://gitbox.apache.org/repos/asf/hudi.git


commit 2b0aa201928f04d37e8ab20aa8d464a9b1db10b7
Author: voon <[email protected]>
AuthorDate: Fri Mar 20 01:33:06 2026 +0800

    add log merged state orphaned blob edge case
---
 rfc/rfc-100/rfc-100-blob-cleaner-design.md | 246 +++++++++++++++++++++--------
 rfc/rfc-100/rfc-100-blob-cleaner.md        | 115 ++++++++++----
 2 files changed, 266 insertions(+), 95 deletions(-)

diff --git a/rfc/rfc-100/rfc-100-blob-cleaner-design.md 
b/rfc/rfc-100/rfc-100-blob-cleaner-design.md
index 76b78bcc0af8..ddb895859ba4 100644
--- a/rfc/rfc-100/rfc-100-blob-cleaner-design.md
+++ b/rfc/rfc-100/rfc-100-blob-cleaner-design.md
@@ -154,16 +154,47 @@ MDT secondary index. The dispatch mechanism is a 
zero-cost string prefix check o
 | MOR strategy        | Over-retain (union of base + log refs)                 
 | Safe (C5, R4); cleaned after compaction                            |
 | Container strategy  | Tuple-level tracking; delete only when all ranges dead 
 | Correct (C4, R3); partial containers flagged for blob compaction   |
 
-> **DIAGRAM 1: Architecture Overview**
->
-> *Block diagram showing how blob cleanup fits into the existing cleaner 
pipeline.*
->
-> Shows: `CleanPlanActionExecutor.requestClean()` → `CleanPlanner` 
(per-partition, per-FG iteration)
-> → **Stage 1** (per-FG blob ref collection + set difference + dispatch) → 
**Stage 2** (MDT
-> secondary index lookup for external candidates) → **Stage 3** (container 
lifecycle resolution) →
-> `HoodieCleanerPlan` (with `blobFilesToDelete` + `containersToCompact`) → 
`CleanActionExecutor`
-> (parallel blob file deletion). The existing file slice deletion path runs 
alongside the blob
-> deletion path within the same clean action.
+```mermaid
+flowchart LR
+    subgraph Planning["CleanPlanActionExecutor.requestClean()"]
+        direction TB
+        Gate{"hasBlobColumns()?"}
+        Gate -- No --> Skip["Skip blob cleanup<br/>(zero cost)"]
+        Gate -- Yes --> CP
+
+        subgraph CP["CleanPlanner (per-partition, per-FG)"]
+            direction TB
+            Policy["Policy method<br/>→ FileGroupCleanResult<br/>(expired + 
retained slices)"]
+            S1["<b>Stage 1</b><br/>Per-FG blob ref<br/>set difference + 
dispatch"]
+            Policy --> S1
+        end
+
+        S1 --> S2["<b>Stage 2</b><br/>Cross-FG verification<br/>(MDT secondary 
index)"]
+        S1 -->|hudi_blob_deletes| S3
+        S2 -->|external_deletes| S3["<b>Stage 3</b><br/>Container 
lifecycle<br/>resolution"]
+    end
+
+    subgraph Plan["HoodieCleanerPlan"]
+        FP["filePathsToBeDeleted<br/>(existing)"]
+        BP["blobFilesToDelete<br/>(new)"]
+        CC["containersToCompact<br/>(new)"]
+    end
+
+    S3 --> BP
+    S3 --> CC
+    CP --> FP
+
+    subgraph Execution["CleanActionExecutor.runClean()"]
+        direction TB
+        DF["Delete file slices<br/>(existing, parallel)"]
+        DB["Delete blob files<br/>(new, parallel)"]
+        RC["Record containers<br/>for blob compaction"]
+    end
+
+    FP --> DF
+    BP --> DB
+    CC --> RC
+```
 
 ---
 
@@ -182,20 +213,28 @@ Output: hudi_blob_deletes     -- blobs safe to delete 
immediately
 
 for each file_group being cleaned:
 
-    // Collect expired blob refs
+    // Collect expired blob refs (base files + log files)
+    // Must read log files: blob refs introduced and superseded within the log
+    // chain before compaction would otherwise become permanent orphans.
     expired_refs = Set<(path, offset, length)>()
     for slice in expired_slices:
-        for ref in extractBlobRefs(slice):        // base: columnar 
projection; log: field extraction
+        for ref in extractBlobRefs(slice.baseFile):   // columnar projection
+            if ref.type == OUT_OF_LINE and ref.managed == true:
+                expired_refs.add((ref.path, ref.offset, ref.length))
+        for ref in extractBlobRefs(slice.logFiles):   // full record read
             if ref.type == OUT_OF_LINE and ref.managed == true:
                 expired_refs.add((ref.path, ref.offset, ref.length))
 
     if expired_refs is empty:
-        continue                                   // no blob work for this FG
+        continue                                       // no blob work for 
this FG
 
-    // Collect retained blob refs
+    // Collect retained blob refs (base files only)
+    // Cleaning is fenced on compaction: retained base files contain the merged
+    // state. Log reads are unnecessary -- any shadowed base ref causes safe
+    // over-retention, cleaned after the next compaction cycle.
     retained_refs = Set<(path, offset, length)>()
-    for slice in retained_slices:                  // includes base + log 
files (MOR)
-        for ref in extractBlobRefs(slice):
+    for slice in retained_slices:
+        for ref in extractBlobRefs(slice.baseFile):   // columnar projection 
only
             if ref.type == OUT_OF_LINE and ref.managed == true:
                 retained_refs.add((ref.path, ref.offset, ref.length))
 
@@ -214,10 +253,16 @@ for each file_group being cleaned:
 within the file group that created it. If a blob ref appears in an expired 
slice but not in any
 retained slice of the same FG, it is globally orphaned. No cross-FG check is 
needed.
 
-**Why correct for MOR (C5, R4).** Retained blob refs are collected as the 
union of base file refs
-and log file refs. This over-counts: a log update that changes a record's blob 
ref makes the base
-file's old ref appear live. This is safe -- over-retention prevents premature 
deletion. After
-compaction merges the log into a new base file, the orphan is identified in 
the next clean cycle.
+**Why correct for MOR (C5, R4).** Two asymmetric read strategies:
+
+- **Expired slices: base + log files.** Log files must be read because blob 
refs can be introduced
+  and superseded entirely within the log chain before compaction (e.g., 
`log@t2: row1→blob_B`,
+  `log@t3: row1→blob_C`). After compaction, `blob_B` exists only in the 
expired log. Skipping it
+  would create a permanent orphan (R2 violation).
+- **Retained slices: base files only.** Since cleaning is fenced on 
compaction, retained base files
+  contain the merged state. Any blob ref shadowed by an uncompacted log on top 
of the retained
+  slice appears in the retained set via the base file -- this causes 
over-retention (safe, never
+  premature deletion). The shadowed ref is cleaned after the next compaction 
cycle.
 
 **Why correct for savepoints (C9).** The existing cleaner excludes savepointed 
file slices from the
 expired set. Blob cleanup inherits this: savepointed slices are always in the 
retained set.
@@ -227,15 +272,18 @@ and `expired_slices` is all slices. For Hudi-created 
blobs, all are safe to dele
 creates new blobs in the target FG via F8). For external blobs, all flow to 
Stage 2 for cross-FG
 verification (clustering copies the pointer via F9, so Stage 2 finds the 
reference in the target FG).
 
-> **DIAGRAM 2: Stage 1 Flow**
->
-> *Flowchart showing the per-file-group blob cleanup logic.*
->
-> Shows: File Group → Extract expired blob refs → Extract retained blob refs → 
Set difference
-> (expired - retained) → local_orphans → Path prefix check → Hudi-created? → 
YES: add to
-> `hudi_blob_deletes` (safe to delete) / NO: add to `external_candidates` 
(needs Stage 2).
-> Annotate the Hudi-created branch with "P3: no cross-FG refs" and the 
external branch with
-> "C13: cross-FG sharing possible".
+```mermaid
+flowchart TD
+    FG["File Group being cleaned"]
+    FG --> Exp["Extract blob refs from<br/><b>expired</b> slices<br/>(base 
files + log files)"]
+    Exp --> Empty{"expired_refs<br/>empty?"}
+    Empty -- Yes --> Done["Skip FG<br/>(no blob work)"]
+    Empty -- No --> Ret["Extract blob refs from<br/><b>retained</b> 
slices<br/>(base files only —<br/>fenced on compaction)"]
+    Ret --> Diff["Set difference:<br/><code>local_orphans = expired - 
retained</code>"]
+    Diff --> Check{"Path starts with<br/><code>.hoodie/blobs/</code>?"}
+    Check -- "Yes (Hudi-created)" --> Hudi["Add to 
<b>hudi_blob_deletes</b><br/>✓ Safe to delete immediately<br/><i>P3: no 
cross-FG refs possible</i>"]
+    Check -- "No (External)" --> Ext["Add to <b>external_candidates</b><br/>→ 
Needs Stage 2 verification<br/><i>C13: cross-FG sharing possible</i>"]
+```
 
 ### Stage 2: Cross-FG Verification (External Blobs)
 
@@ -314,15 +362,32 @@ a bottleneck on large tables. The operator is warned to 
enable the MDT secondary
 | No index, few candidates    | Table scan    | O(candidates * table) | Small 
tables, few shared blobs |
 | No index, many candidates   | Circuit break | Zero (deferred)       | Large 
tables -- index required |
 
-> **DIAGRAM 3: Stage 2 Flow (MDT Secondary Index Path)**
->
-> *Sequence diagram showing the two-hop lookup.*
->
-> Shows: Cleaner → MDT Secondary Index: batched prefix scan with candidate 
paths → returns
-> `Map<path, List<recordKey>>` → for each path: Cleaner → MDT Record Index: 
lookup record key →
-> returns `(partition, fileId)` → check: fileId in cleaned_fg_ids? → YES: try 
next record key /
-> NO: **short-circuit** → blob is live, retain. If all record keys resolve to 
cleaned FGs → blob
-> is globally orphaned → add to external_deletes.
+```mermaid
+sequenceDiagram
+    participant C as Cleaner (Stage 2)
+    participant SI as MDT Secondary Index
+    participant RI as MDT Record Index
+
+    C->>SI: Batched prefix scan<br/>candidate_paths (N paths)
+    SI-->>C: Map<path, List<recordKey>>
+
+    loop For each candidate path
+        alt No record keys returned
+            Note right of C: Globally orphaned → DELETE
+        else Has record keys
+            loop For each record key (short-circuit)
+                C->>RI: Lookup record key
+                RI-->>C: (partition, fileId)
+                alt fileId NOT in cleaned_fg_ids
+                    Note right of C: Live reference found<br/>SHORT-CIRCUIT → 
RETAIN
+                end
+            end
+            alt All record keys in cleaned FGs
+                Note right of C: Globally orphaned → DELETE
+            end
+        end
+    end
+```
 
 ### Stage 3: Container File Lifecycle
 
@@ -374,16 +439,48 @@ from Stage 1 are sufficient -- no cross-FG check needed 
for container ranges.
    └── Transition to COMPLETED
 ```
 
-> **DIAGRAM 4: End-to-End Execution Lifecycle**
->
-> *Sequence/timeline diagram showing the plan-execute-complete lifecycle.*
->
-> Shows: `requestClean()` → compute file slice deletes + blob deletes (Stages 
1-3) →
-> persist `HoodieCleanerPlan` → **REQUESTED** state on timeline → `runClean()` 
→ transition to
-> **INFLIGHT** → delete file slices (parallel) + delete blob files (parallel) 
→ build
-> `HoodieCleanMetadata` (including `blobCleanStats`) → transition to 
**COMPLETED**. Annotate
-> crash recovery points: crash before REQUESTED = restart fresh; crash during 
INFLIGHT = re-execute
-> plan (idempotent); crash after delete but before COMPLETED = re-execute 
(no-op deletes).
+```mermaid
+sequenceDiagram
+    participant P as CleanPlanActionExecutor
+    participant TL as Timeline
+    participant E as CleanActionExecutor
+    participant S as Storage
+
+    Note over P: requestClean()
+
+    P->>P: Stage 1: per-FG blob ref set difference
+    P->>P: Stage 2: MDT index lookup (if external candidates)
+    P->>P: Stage 3: container lifecycle resolution
+    P->>TL: Persist HoodieCleanerPlan
+
+    Note over TL: REQUESTED
+
+    rect rgb(255, 245, 230)
+        Note right of TL: Crash here → restart fresh<br/>(no plan persisted 
yet)
+    end
+
+    E->>TL: Transition plan state
+    Note over TL: INFLIGHT
+
+    rect rgb(255, 245, 230)
+        Note right of TL: Crash here → re-execute plan<br/>(idempotent: 
FileNotFound = success)
+    end
+
+    par Parallel deletion
+        E->>S: Delete file slices (existing)
+    and
+        E->>S: Delete blob files (new)
+    end
+
+    E->>E: Build HoodieCleanMetadata<br/>(+ blobCleanStats)
+    E->>TL: Transition plan state
+
+    rect rgb(255, 245, 230)
+        Note right of TL: Crash here → re-execute<br/>(all deletes are no-ops)
+    end
+
+    Note over TL: COMPLETED
+```
 
 ---
 
@@ -482,23 +579,42 @@ check:
 metadata reads -- negligible compared to the commit's own I/O. False positives 
(unnecessary
 rejections) are rare and handled by existing retry logic.
 
-> **DIAGRAM 5: Writer-Cleaner Concurrency Timeline**
->
-> *Timeline diagram showing two parallel timelines (writer and cleaner) with 
four scenarios.*
->
-> **Scenario A:** Writer commits before cleaner plans → cleaner sees the 
reference in a retained
-> slice → blob not deleted. **Safe.**
->
-> **Scenario B:** Writer commits after cleaner plans, before cleaner deletes → 
cleaner is in
-> REQUESTED/INFLIGHT state → writer's `preCommit()` reads the plan, finds 
intersection →
-> `HoodieWriteConflictException` → writer retries. **Safe -- conflict 
detected.**
->
-> **Scenario C:** Writer commits after cleaner deletes, before cleaner 
transitions to COMPLETED →
-> cleaner is still INFLIGHT → writer's `preCommit()` reads the INFLIGHT plan → 
rejection.
-> **Safe -- same as B.**
->
-> **Scenario D:** Cleaner transitions to COMPLETED, then writer acquires lock 
→ COMPLETED clean
-> metadata visible on timeline → writer's check reads `deletedBlobFilePaths` → 
rejection. **Safe.**
+```mermaid
+sequenceDiagram
+    participant W as Writer
+    participant TL as Timeline
+    participant CL as Cleaner
+
+    Note over W,CL: Scenario A: Writer commits BEFORE cleaner plans
+
+    W->>TL: Commit (references blob_X)
+    CL->>TL: Plan cleanup
+    Note right of CL: Sees blob_X in retained slice → not deleted
+    Note over W,CL: ✓ Safe
+
+    Note over W,CL: Scenario B: Writer commits AFTER cleaner plans, BEFORE 
delete
+
+    CL->>TL: Plan cleanup (blob_X in blobFilesToDelete)
+    Note over TL: REQUESTED / INFLIGHT
+    W->>TL: preCommit() — reads clean plan
+    Note left of W: Intersection found!<br/>HoodieWriteConflictException<br/>→ 
Writer retries
+    Note over W,CL: ✓ Safe — conflict detected
+
+    Note over W,CL: Scenario C: Writer commits AFTER cleaner deletes, BEFORE 
COMPLETED
+
+    CL->>CL: Delete blob_X from storage
+    Note over TL: Still INFLIGHT
+    W->>TL: preCommit() — reads INFLIGHT plan
+    Note left of W: blob_X in blobFilesToDelete<br/>→ Rejection
+    Note over W,CL: ✓ Safe — same as B
+
+    Note over W,CL: Scenario D: Cleaner completes, THEN writer acquires lock
+
+    CL->>TL: Transition to COMPLETED
+    W->>TL: preCommit() — reads COMPLETED metadata
+    Note left of W: blob_X in deletedBlobFilePaths<br/>→ Rejection
+    Note over W,CL: ✓ Safe
+```
 
 ### Concurrency Matrix
 
diff --git a/rfc/rfc-100/rfc-100-blob-cleaner.md 
b/rfc/rfc-100/rfc-100-blob-cleaner.md
index 1f81b8c2d242..4e41b596a1aa 100644
--- a/rfc/rfc-100/rfc-100-blob-cleaner.md
+++ b/rfc/rfc-100/rfc-100-blob-cleaner.md
@@ -161,20 +161,31 @@ Output: hudi_blob_deletes     -- blobs safe to delete 
immediately
 
 for each file_group being cleaned:                    // from refactored 
CleanPlanner
 
-    // --- Collect expired blob refs ---
+    // --- Collect expired blob refs (base files + log files) ---
+    // Must read log files from expired slices: blob refs introduced and 
superseded
+    // within the log chain before compaction would otherwise become permanent 
orphans.
+    // Example: log@t2 adds blob_B, log@t3 supersedes with blob_C, 
compaction@t4
+    // produces base with blob_C. blob_B exists only in expired log@t2.
     expired_refs = Set<BlobRef>()                      // BlobRef = (path, 
offset, length)
     for slice in expired_slices:
-        for ref in extractBlobRefs(slice):             // base: columnar 
projection; log: field extraction
+        for ref in extractBlobRefs(slice.baseFile):    // columnar projection
+            if ref.type == OUT_OF_LINE and ref.managed == true:
+                expired_refs.add(BlobRef(ref.path, ref.offset, ref.length))
+        for ref in extractBlobRefs(slice.logFiles):    // full record read
             if ref.type == OUT_OF_LINE and ref.managed == true:
                 expired_refs.add(BlobRef(ref.path, ref.offset, ref.length))
 
     if expired_refs is empty:
         continue                                        // no blob work for 
this FG
 
-    // --- Collect retained blob refs ---
+    // --- Collect retained blob refs (base files only) ---
+    // Cleaning is fenced on compaction: retained base files contain the 
merged state.
+    // Reading only base files may over-retain shadowed refs (a base ref 
superseded by
+    // an uncompacted log on top). This is safe -- over-retention is always 
preferred
+    // over premature deletion. Shadowed refs are cleaned after the next 
compaction.
     retained_refs = Set<BlobRef>()
-    for slice in retained_slices:                       // includes base + log 
files (MOR)
-        for ref in extractBlobRefs(slice):
+    for slice in retained_slices:
+        for ref in extractBlobRefs(slice.baseFile):    // columnar projection 
only
             if ref.type == OUT_OF_LINE and ref.managed == true:
                 retained_refs.add(BlobRef(ref.path, ref.offset, ref.length))
 
@@ -193,13 +204,24 @@ for each file_group being cleaned:                    // 
from refactored CleanPl
 the file group that created it. If a blob ref appears in an expired slice but 
not in any retained
 slice of the same file group, it is globally orphaned. No cross-FG check is 
needed.
 
-**Why correct for MOR (C5, R4).** For retained slices in a MOR file group, we 
extract blob refs
-from both the base file and all log files, taking the union. This over-counts: 
if a log file updates
-a record's blob ref, the base file's old ref appears in the retained set even 
though it is
-semantically dead. This is safe -- over-retention prevents the premature 
deletion that would occur if
-the log update were later rolled back. **MOR over-retention is unbounded in 
duration -- it depends on
-compaction frequency.** For tables with infrequent compaction, blob storage 
waste from MOR
-over-retention could be significant. This is a known trade-off: correctness 
over space efficiency.
+**Why correct for MOR (C5, R4).** The read strategy is asymmetric by design:
+
+- **Expired slices: base + log files.** Log files must be read because blob 
refs can be introduced
+  and superseded entirely within the log chain before compaction. Example: 
`log@t2` writes
+  `row1→blob_B`, `log@t3` writes `row1→blob_C`, compaction produces a base 
with `blob_C`. After
+  compaction, `blob_B` exists only in expired `log@t2`. Skipping log reads 
would make `blob_B` a
+  permanent orphan (R2 violation).
+
+- **Retained slices: base files only.** Cleaning is fenced on compaction, so 
retained base files
+  contain the merged state. Any blob ref shadowed by an uncompacted log on top 
of the retained
+  slice still appears in the retained set via the base file -- this causes 
over-retention (safe).
+  The shadowed ref is cleaned after the next compaction cycle. 
**Over-retention from this source is
+  bounded by compaction frequency.** For tables with infrequent compaction, 
blob storage waste from
+  over-retention could be significant. This is a known trade-off: correctness 
over space efficiency.
+
+This asymmetry also improves performance: retained slices require only 
columnar base file
+projections (cheap), while the more expensive log file reads are confined to 
expired slices that
+are being cleaned anyway.
 
 **Why correct for savepoints (C9).** The existing cleaner already excludes 
savepointed file slices
 from the expired set (`isFileSliceExistInSavepointedFiles`). Since blob 
cleanup operates on the same
@@ -796,6 +818,8 @@ BlobCleanResult 
collectBlobRefsForFileGroup(FileGroupCleanResult fgResult, Stora
     return BlobCleanResult.EMPTY;
   }
 
+  // Expired slices: read base + log files (log files may contain blob refs
+  // introduced and superseded within the log chain before compaction)
   Set<BlobRef> expiredRefs = new HashSet<>();
   for (FileSlice slice : fgResult.getExpiredSlices()) {
     extractManagedOutOfLineBlobRefs(slice).forEach(expiredRefs::add);
@@ -805,9 +829,11 @@ BlobCleanResult 
collectBlobRefsForFileGroup(FileGroupCleanResult fgResult, Stora
     return BlobCleanResult.EMPTY;
   }
 
+  // Retained slices: read base files only (fenced on compaction -- base files
+  // contain the merged state; skipping logs causes safe over-retention)
   Set<BlobRef> retainedRefs = new HashSet<>();
   for (FileSlice slice : fgResult.getRetainedSlices()) {
-    extractManagedOutOfLineBlobRefs(slice).forEach(retainedRefs::add);
+    
extractManagedOutOfLineBlobRefsFromBaseFile(slice).forEach(retainedRefs::add);
   }
 
   Set<BlobRef> localOrphans = new HashSet<>(expiredRefs);
@@ -1147,12 +1173,19 @@ references.
 
 ### C5: MOR log updates shadow base file blob refs
 
-**Satisfied.** Retained blob refs are collected as the union of base file refs 
and log file refs.
-This over-counts -- a shadowed base ref appears live even though the log 
update superseded it. This
-is safe: over-retention prevents premature deletion. After compaction merges 
the log into a new base
-file, the superseded ref disappears, and the next clean cycle deletes the 
orphaned blob. **Note:**
-MOR over-retention is unbounded in duration and depends on compaction 
frequency. This is a known
-trade-off explicitly accepted for correctness.
+**Satisfied.** The read strategy is asymmetric:
+
+- *Expired slices* read base + log files: log files must be read because blob 
refs introduced and
+  superseded within the log chain before compaction exist only in expired 
logs. Skipping them would
+  create permanent orphans (R2 violation).
+- *Retained slices* read base files only: cleaning is fenced on compaction, so 
retained base files
+  contain the merged state. Shadowed base refs (superseded by uncompacted 
logs) appear in the
+  retained set, causing safe over-retention. After the next compaction, the 
superseded ref
+  disappears and the next clean cycle deletes the orphaned blob.
+
+**Note:** Over-retention from shadowed retained refs is bounded by compaction 
frequency. For tables
+with infrequent compaction, blob storage waste could be significant. This is a 
known trade-off
+explicitly accepted for correctness.
 
 ### C6: Existing cleaner is per-file-group scoped
 
@@ -1709,16 +1742,19 @@ table scan path until the index is fully built.
 
 ### 10.1 Cost model for Stage 1
 
-For each cleaned file group, Stage 1 reads blob ref fields from expired and 
retained slices.
+For each cleaned file group, Stage 1 reads blob ref fields from expired and 
retained slices. The
+read strategy is asymmetric: expired slices read base + log files; retained 
slices read base files
+only (cleaning is fenced on compaction, so retained base files contain the 
merged state).
 
 **Base files (Parquet):** Columnar projection reads only the blob ref struct 
columns. Cost per base
 file: one Parquet column chunk read. Typical size: 100 bytes/record * 500K 
records = ~50MB per
 slice for the blob ref column.
 
-**Log files (MOR):** Log files are not columnar. Reading blob refs requires 
reading full log records
-and extracting the blob ref field. Cost per log file: proportional to log file 
size (full scan),
-not just the blob ref column. For a 100MB log file, the entire 100MB is read 
even though only ~50MB
-is blob ref data. This is 2x the cost of a base file projection.
+**Log files (MOR, expired slices only):** Log files are not columnar. Reading 
blob refs requires
+reading full log records and extracting the blob ref field. Cost per log file: 
proportional to log
+file size (full scan), not just the blob ref column. For a 100MB log file, the 
entire 100MB is read
+even though only ~50MB is blob ref data. This is 2x the cost of a base file 
projection. However,
+log reads are only needed for expired slices -- retained slices skip log reads 
entirely.
 
 | Parameter                | Base file (Parquet) | Log file (MOR)       |
 |--------------------------|---------------------|----------------------|
@@ -1727,13 +1763,14 @@ is blob ref data. This is 2x the cost of a base file 
projection.
 | Parallelizable           | Yes                 | Yes                  |
 | Records per slice        | ~500K               | ~500K (worst case)   |
 | Blob ref size per record | ~100 bytes          | ~100 bytes           |
+| Used for                 | Expired + retained  | Expired only         |
 
 **Total cost per FG:**
 
-| Table type | Retained slices | Expired slices | Reads per FG    | Data per 
FG |
-|------------|-----------------|----------------|-----------------|-------------|
-| COW        | 3-5 base        | 1-3 base       | 4-8             | 200-400MB  
 |
-| MOR        | 3-5 (base+log)  | 1-3 (base+log) | 4-8 base + logs | 200MB-1GB  
 |
+| Table type | Retained slices      | Expired slices | Reads per FG        | 
Data per FG |
+|------------|----------------------|----------------|---------------------|-------------|
+| COW        | 3-5 base             | 1-3 base       | 4-8 base            | 
200-400MB   |
+| MOR        | 3-5 base (logs skip) | 1-3 (base+log) | 3-5 base + 1-3 b+l | 
150MB-600MB |
 
 **Memory budget analysis (addresses finding 3.9):**
 
@@ -2006,11 +2043,29 @@ function isHudiCreatedBlob(blobPath, blobPrefix):
     return blobPath.startsWith(blobPrefix)
 
 
+function extractManagedOutOfLineBlobRefsFromBaseFile(slice):
+    """
+    Extract managed, out-of-line blob refs from a file slice's base file only.
+    Uses columnar projection on the blob ref struct columns.
+    Used for retained slices (cleaning is fenced on compaction, so base files
+    contain the merged state).
+    Returns Stream<BlobRef>.
+    """
+    refs = Stream.empty()
+    if slice.getBaseFile().isPresent():
+        refs = projectBlobRefColumnsFromParquet(slice.getBaseFile().get())
+            .filter(r -> r.type == OUT_OF_LINE && r.managed == true)
+            .map(r -> BlobRef(r.path, r.offset, r.length))
+    return refs
+
+
 function extractManagedOutOfLineBlobRefs(slice):
     """
-    Extract managed, out-of-line blob refs from a file slice.
+    Extract managed, out-of-line blob refs from a file slice (base + log 
files).
     For base files: columnar projection on the blob ref struct columns.
     For log files: full record read with field extraction.
+    Used for expired slices (log files must be read to find blob refs 
introduced
+    and superseded within the log chain before compaction).
     Returns Stream<BlobRef>.
     """
     refs = Stream.empty()
@@ -2021,7 +2076,7 @@ function extractManagedOutOfLineBlobRefs(slice):
             .filter(r -> r.type == OUT_OF_LINE && r.managed == true)
             .map(r -> BlobRef(r.path, r.offset, r.length)))
 
-    // Log files (MOR only)
+    // Log files (MOR only -- required for expired slices to avoid permanent 
orphans)
     for logFile in slice.getLogFiles():
         refs = concat(refs, extractBlobRefFieldFromLogFile(logFile)
             .filter(r -> r.type == OUT_OF_LINE && r.managed == true)

(hudi) 02/02: add log merged state orphaned blob edge case

Reply via email to