voonhous commented on code in PR #18359:
URL: https://github.com/apache/hudi/pull/18359#discussion_r3008717565


##########
rfc/rfc-100/rfc-100-blob-cleaner-problem.md:
##########
@@ -0,0 +1,677 @@
+# Blob Cleaner: Problem Statement
+
+## 1. Goal
+
+When old file slices are cleaned, out-of-line blob files they reference may 
become orphaned -- still
+consuming storage but unreachable by any query. The blob cleaner must identify 
and delete these
+unreferenced blob files without premature deletion (deleting a blob that is 
still referenced by a live
+record). This document defines the problem scope, design constraints, 
requirements, and illustrative
+failure modes. It contains no solution content.
+
+---
+
+## 2. Scope
+
+### In scope
+
+- Cleanup of **out-of-line blob files** when references to them exist only in 
expired (cleaned) file
+  slices.
+- All table types: **COW** and **MOR**.
+- All cleaning policies: `KEEP_LATEST_COMMITS`, `KEEP_LATEST_FILE_VERSIONS`,
+  `KEEP_LATEST_BY_HOURS`.
+- Interaction with table services: **compaction**, **clustering**, **blob 
compaction**.
+- Interaction with timeline operations: **savepoints**, **rollback**, 
**archival**.
+- Single-writer and multi-writer (OCC) concurrency modes.
+- Both **Hudi-created blobs** (stored under `{table}/.hoodie/blobs/...`) and 
**user-provided
+  external blobs** (arbitrary paths).
+
+### Two entry flows
+
+Blob cleanup must support two distinct entry flows. These are not edge cases 
of each other --
+they are co-equal paths with different properties, different volumes, and 
different cleanup costs.
+
+**Flow 1: Path-dispatched (Hudi-created blobs).** Blobs created by Hudi's 
write path and stored
+under `{table}/.hoodie/blobs/{partition}/{col}/{instant}/{blob_id}`. The path 
structure guarantees
+uniqueness (C11), file-group scoping, and eliminates cross-FG sharing for 
normal writes. This is the
+expected majority flow for Phase 3 workloads.
+
+**Flow 2: Non-path-dispatched (user-provided external blobs).** Users have 
existing blob files in
+external storage (e.g., `s3://media-bucket/videos/`, a shared NFS mount, or 
any user-controlled
+path). Records reference these blobs directly by path. The user does **not** 
want to bootstrap --
+they do not want Hudi to copy, move, or reorganize the blob files into 
`.hoodie/blobs/`. Hudi
+manages the *references*, not the *storage layout*. This is the expected 
primary flow for Phase 1
+workloads and remains a supported flow in Phase 3.
+
+The non-path-dispatched flow has fundamentally different properties:
+
+| Property                  | Path-dispatched (Hudi-created)    | 
Non-path-dispatched (external)       |
+|---------------------------|-----------------------------------|--------------------------------------|
+| Path uniqueness           | Guaranteed (instant in path, C11) | Not 
guaranteed (user controls)       |
+| Cross-FG sharing          | Does not occur (FG-scoped)        | Common 
(multiple records, same blob) |
+| Writer/cleaner race       | Cannot occur (D2)                 | Can occur 
(D3)                       |
+| Delete-and-re-add (C2)    | Eliminated                        | Real concern 
                        |
+| Volume                    | Scales with writes                | Can be large 
from day one            |
+| Per-FG cleanup sufficient | Yes                               | No -- 
cross-FG verification needed   |
+
+Any solution that treats the non-path-dispatched flow as a rare edge case will 
fail at scale for
+Phase 1 workloads. The cleanup algorithm must be efficient for **both** flows 
independently, and
+must not impose the cost structure of one flow on the other.
+
+### Out of scope
+
+- **Inline blobs.** Inline blob data lives inside the base/log file and is 
deleted when the file
+  slice is cleaned. No additional cleanup needed.
+- **Blob compaction internals.** Blob compaction (repacking partially-live 
container files) is a
+  separate service. This document defines the interface point (when to hand 
off to blob compaction)
+  but not its internal design.
+- **Schema evolution.** Adding or removing blob columns does not change the 
cleanup problem.
+
+### Stance on the `managed` flag
+
+The BlobReference schema includes a `managed` boolean field
+(`HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED`). The RFC states that only 
managed blobs are
+cleaned. This document acknowledges the flag and treats it as a **filter** -- 
unmanaged blobs are
+excluded from cleanup consideration. However, the cleanup design must be 
**correct regardless of the
+flag's value**. The flag selects *which* blobs enter the cleanup pipeline; it 
must not be used as a
+correctness lever within the pipeline itself. The flag may later serve as an 
optimization (skip
+cleanup work for unmanaged blobs), but the problem statement and any solution 
must not depend on it
+for safety.
+
+---
+
+## 3. Background: Existing Cleaner
+
+The existing Hudi cleaner provides the execution framework that blob cleanup 
must integrate with.
+
+### Plan-execute model
+
+Cleaning is a two-phase operation:
+
+1. **Plan** (`CleanPlanner`): For each partition and file group, determine 
which file slices are
+   expired based on the cleaning policy. Produce a `HoodieCleanerPlan` listing 
file paths to delete.
+2. **Execute** (`CleanActionExecutor`): Delete the files listed in the plan. 
Record results in
+   `HoodieCleanMetadata` on the timeline.
+
+### Per-partition, per-file-group iteration
+
+`CleanPlanner.getDeletePaths(partitionPath, earliestCommitToRetain)` iterates 
file groups within a
+partition. For each file group, it compares file slices against the retention 
policy and produces a
+list of `CleanFileInfo` objects (file paths to delete). The cleaner has no 
concept of cross-file-group
+dependencies.
+
+### Savepoint awareness
+
+The cleaner collects all savepointed timestamps and their associated data 
files. File slices that
+overlap with savepointed files are excluded from cleaning
+(`isFileSliceExistInSavepointedFiles`). This preserves the savepoint 
invariant: a savepoint freezes a
+consistent snapshot including all data files it references.
+
+### OCC conflict resolution
+
+`SimpleConcurrentFileWritesConflictResolutionStrategy` resolves write-write 
conflicts at the
+`(partition, fileId)` granularity. There is no global serialization point. 
Concurrent writers to
+different file groups proceed without contention.
+
+### Critical gap
+
+The existing cleaner operates on file paths (base files + log files) within a 
single file group. It
+has **no concept of transitive references** -- it does not know that a file 
slice contains pointers
+to external blob files that may need separate cleanup. Blob cleanup requires 
extending the cleaner
+to follow these references and determine blob-level liveness.
+
+---
+
+## 4. Design Constraints
+
+Each constraint is a fact about the Hudi system that any blob cleanup solution 
must respect. Violating
+any constraint leads to data corruption, premature deletion, or permanent 
orphans.
+
+### C1: Blob immutability
+
+Once a blob file is written, its content never changes. Blob files are 
append-once, read-many. This
+means a blob file's identity is stable for its entire lifetime.
+
+*Source: RFC-100 blob cleaner design, general storage semantics.*
+
+### C2: Delete-and-re-add same path
+
+A blob file can be deleted from storage and a new file created at the same 
path with different
+content. This is a real concern for user-provided external blobs (the user 
controls the path). For
+Hudi-created blobs, it is structurally eliminated by C11 (instant in path 
guarantees uniqueness).
+
+*Source: RFC-100 blob cleaner design; alternatives analysis constraint C2.*
+
+### C3: Cross-file-group blob sharing
+
+An out-of-line blob can be referenced by records in multiple file groups and 
multiple partitions. This
+is explicitly supported for user-provided external blobs: two records in 
different file groups can
+point to the same external file. For Hudi-created blobs, cross-FG sharing does 
not occur because the
+blob is created within a specific file group's storage scope (see C11). 
However, after clustering
+(C8), references to the same Hudi-created blob could temporarily exist in both 
the source and target
+file groups until the source is cleaned.
+
+*Source: RFC-100 lines 196-198 (Option 1 scans all active file slices); 
alternatives analysis F6.*
+
+### C4: Container files
+
+Multiple blobs can be packed into a single container file, distinguished by 
`(offset, length)` within

Review Comment:
   IIUC, our focus now is external blobs.
   
   This might be out of scope for this RFC since we're only focusing on 
external blobs. I added this because of this:
   
   
https://github.com/apache/hudi/blob/master/rfc/rfc-100/rfc-100.md?plain=1#L207-L208



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to