voonhous commented on code in PR #18359:
URL: https://github.com/apache/hudi/pull/18359#discussion_r3008797174


##########
rfc/rfc-100/rfc-100-blob-cleaner-design.md:
##########
@@ -0,0 +1,777 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# RFC-100 Part 2: Blob Cleanup for Unstructured Data
+
+## Proposers
+
+- @voon
+
+## Approvers
+
+- (TBD)
+
+## Status
+
+Issue: <Link to GH feature issue>
+
+> Please keep the status updated in `rfc/README.md`.
+
+---
+
+## Abstract
+
+When Hudi cleans expired file slices, out-of-line blob files they reference 
may become orphaned --
+still consuming storage but unreachable by any query. This RFC extends the 
existing file slice
+cleaner to identify and delete these orphaned blob files safely and 
efficiently. The design uses a
+three-stage pipeline: (1) per-file-group set-difference to find 
locally-orphaned blobs, (2) an MDT
+secondary index lookup for cross-file-group verification of 
externally-referenced blobs, and (3)
+container file lifecycle resolution. For Hudi-created blobs, cleanup is 
essentially free -- structural
+path uniqueness eliminates cross-file-group concerns entirely. For 
user-provided external blobs,
+targeted index lookups scale with the number of candidates, not the table 
size. Tables without blob
+columns pay zero cost.
+
+---
+
+## Background
+
+### Why Blob Cleanup Is Needed
+
+RFC-100 introduces out-of-line blob storage for unstructured data (images, 
video, documents). A
+record's `BlobReference` field points to an external blob file by `(path, 
offset, length)`. When
+the cleaner expires old file slices, the blob files they reference may no 
longer be needed -- but the
+existing cleaner has no concept of transitive references. It deletes file 
slices without considering
+the blob files they point to. Without blob cleanup, orphaned blobs accumulate 
indefinitely.
+
+### Two Blob Flows
+
+Blob cleanup must support two distinct entry flows with fundamentally 
different properties:
+
+**Flow 1 -- Hudi-created blobs.** Blobs created by Hudi's write path, stored at
+`{table}/.hoodie/blobs/{partition}/{col}/{instant}/{blob_id}`. The commit 
instant in the path
+guarantees uniqueness (C11), and blobs are scoped to a single file group (P3). 
Cross-file-group
+sharing does not occur. This is the expected majority flow for Phase 3 
workloads.
+
+**Flow 2 -- User-provided external blobs.** Users have existing blob files in 
external storage
+(e.g., `s3://media-bucket/videos/`). Records reference these blobs directly by 
path. Hudi manages
+the *references*, not the *storage layout*. Cross-file-group sharing is common 
-- multiple records
+across different file groups can point to the same blob. This is the expected 
primary flow for
+Phase 1 workloads.
+
+| Property                  | Flow 1 (Hudi-created)             | Flow 2 
(External)                    |
+|---------------------------|-----------------------------------|--------------------------------------|
+| Path uniqueness           | Guaranteed (instant in path, C11) | Not 
guaranteed (user controls)       |
+| Cross-FG sharing          | Does not occur (FG-scoped)        | Common 
(multiple records, same blob) |
+| Writer/cleaner race       | Cannot occur (D2)                 | Can occur 
(D3)                       |
+| Per-FG cleanup sufficient | Yes                               | No -- 
cross-FG verification needed   |
+
+### Constraints and Requirements Reference
+
+Full descriptions and failure modes in [Appendix 
B](rfc-100-blob-cleaner-problem.md).
+
+| ID  | Constraint                                      | Flow 1 | Flow 2 | 
Remarks                      |
+|-----|-------------------------------------------------|--------|--------|------------------------------|
+| C1  | Blob immutability (append-once, read-many)      | Y      | Y      |    
                          |
+| C2  | Delete-and-re-add same path                     | --     | Y      | 
Eliminated for Flow 1 by C11 |
+| C3  | Cross-file-group blob sharing                   | --     | Y      | 
Common for external blobs    |
+| C4  | Container files (`(offset, length)` ranges)     | Y      | Y      |    
                          |
+| C5  | MOR log updates shadow base file blob refs      | Y      | Y      |    
                          |
+| C6  | Existing cleaner is per-file-group scoped       | Y      | Y      |    
                          |
+| C7  | OCC is per-file-group                           | Y      | Y      | No 
global contention allowed |
+| C8  | Clustering moves blob refs between file groups  | Y      | Y      |    
                          |
+| C9  | Savepoints freeze file slices and blob refs     | Y      | Y      |    
                          |
+| C10 | Rollback can invalidate or resurrect references | Y      | Y      |    
                          |
+| C11 | Blob paths include commit instant               | Y      | --     | 
Eliminates C2, C3, C13       |
+| C12 | Archival removes commit metadata                | Y      | Y      |    
                          |
+| C13 | Cross-FG verification needed at scale           | --     | Y      |    
                          |
+
+| ID  | Requirement                                                      |
+|-----|------------------------------------------------------------------|
+| R1  | No premature deletion (hard invariant)                           |
+| R2  | No permanent orphans (bounded cleanup)                           |
+| R3  | Container awareness (range-level liveness)                       |
+| R4  | MOR correctness (over-retention acceptable, under-retention not) |
+| R5  | Concurrency safety (no global serialization)                     |
+| R6  | Scale proportional to work, not table size                       |
+| R7  | No cost for non-blob tables                                      |
+| R8  | All cleaning policies supported                                  |
+| R9  | Crash safety and idempotency                                     |
+| R10 | Observability (metrics for deleted, retained, reclaimed)         |
+
+---
+
+## Design Overview
+
+### Design Philosophy
+
+Blob cleanup extends the existing `CleanPlanner` / `CleanActionExecutor` 
pipeline -- same timeline
+instant, same plan-execute-complete lifecycle, same crash recovery and OCC 
integration. A
+`hasBlobColumns()` check gates all blob logic so non-blob tables pay zero cost.
+
+The two flows have different cost structures, and the design keeps them 
separate. Flow 1
+(Hudi-created blobs) gets per-FG cleanup with no cross-FG overhead. Flow 2 
(external blobs) gets
+targeted cross-FG verification via MDT secondary index. Dispatch is a string 
prefix check on the
+blob path.
+
+### Three-Stage Pipeline
+
+| Stage       | Scope                | Purpose                                 
                                         | When it runs                         
  |
+|-------------|----------------------|----------------------------------------------------------------------------------|----------------------------------------|
+| **Stage 1** | Per-file-group       | Collect expired/retained blob refs, 
compute set difference, dispatch by category | Always (for blob tables)         
      |
+| **Stage 2** | Cross-file-group     | Verify external blob candidates against 
MDT secondary index or fallback scan     | Only when external candidates exist  
  |
+| **Stage 3** | Container resolution | Determine delete vs. 
flag-for-compaction at the container level                  | Only when 
container blobs are involved |
+
+### Independent Implementability
+
+The three stages have clean input/output interfaces and can be implemented, 
tested, and shipped
+independently:
+
+| Stage   | Input                                                   | Output   
                                           |
+|---------|---------------------------------------------------------|-----------------------------------------------------|
+| Stage 1 | `FileGroupCleanResult` (expired + retained slices)      | 
`hudi_blob_deletes`, `external_candidates`          |
+| Stage 2 | `external_candidates`, `cleaned_fg_ids`                 | 
`external_deletes`                                  |
+| Stage 3 | `hudi_blob_deletes` + `external_deletes`, retained refs | 
`blob_files_to_delete`, `containers_for_compaction` |
+
+A shared foundation layer must land first (see [Rollout / Adoption 
Plan](#rollout--adoption-plan)), after which stages
+can proceed in any order.
+
+### Key Decisions
+
+| Decision            | Choice                                                 
 | Rationale                                                          |
+|---------------------|---------------------------------------------------------|--------------------------------------------------------------------|
+| Blob identity       | `(path, offset, length)` tuple                         
 | Handles containers (C4) and path reuse (C2) correctly              |
+| Cleanup scope       | Per-FG (Hudi blobs) + MDT index lookup (external 
blobs) | Aligns with OCC (C7) and existing cleaner (C6); scales for C13     |
+| Dispatch mechanism  | Path prefix check on blob path                         
 | Zero-cost classification; Hudi blobs match `.hoodie/blobs/` prefix |
+| Cross-FG mechanism  | MDT secondary index on `reference.external_path`       
 | Short-circuits on first non-cleaned FG ref; first-class for Flow 2 |
+| Write-path overhead | None (Flow 1); MDT index maintenance (Flow 2)          
 | Index maintained by existing MDT pipeline, not a new write cost    |
+| MOR strategy        | Over-retain (union of base + log refs)                 
 | Safe (C5, R4); cleaned after compaction                            |
+| Container strategy  | Tuple-level tracking; delete only when all ranges dead 
 | Correct (C4, R3); partial containers flagged for blob compaction   |
+
+```mermaid
+flowchart LR
+    subgraph Planning["CleanPlanActionExecutor.requestClean()"]
+        direction TB
+        Gate{"hasBlobColumns()?"}
+        Gate -- No --> Skip["Skip blob cleanup<br/>(zero cost)"]
+        Gate -- Yes --> CP
+
+        subgraph CP["CleanPlanner (per-partition, per-FG)"]
+            direction TB
+            Policy["Policy method<br/>→ FileGroupCleanResult<br/>(expired + 
retained slices)"]
+            S1["<b>Stage 1</b><br/>Per-FG blob ref<br/>set difference + 
dispatch"]
+            Policy --> S1
+        end
+
+        S1 --> S2["<b>Stage 2</b><br/>Cross-FG verification<br/>(MDT secondary 
index)"]
+        S1 -->|hudi_blob_deletes| S3
+        S2 -->|external_deletes| S3["<b>Stage 3</b><br/>Container 
lifecycle<br/>resolution"]
+    end
+
+    subgraph Plan["HoodieCleanerPlan"]
+        FP["filePathsToBeDeleted<br/>(existing)"]
+        BP["blobFilesToDelete<br/>(new)"]

Review Comment:
   Yes... 
   
   In this design, the plan IS the isolation boundary. All blob delete 
decisions must be made at plan time against the same snapshot used for file 
slice decisions. 
   
   Keeping blob deletes out of the plan makes it impossible to add 
writer-cleaner conflict resolution for external blobs (Scenario B from your 
problem doc). That's a correctness hole I don't think I can close later without 
putting the information on the timeline.
   
   `blobFilesToDelete` being too long and large is a real concern. If this is a 
concern, maybe we can introduce a config to cap the number of blobs to delete 
in each iteration? 
   
   A config like `hoodie.cleaner.blob.max.deletes.per.cycle` that bounds how 
many blob deletes go into the plan per iteration, deferring the rest to the 
next cycle, but this might mean a clean is now broken up into multiple clean 
operations... Which might fundamentally change timeline assumptions.
   
   Will add an example into `rfc-100-*-problems` under **Example 9** to explain 
this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to