vinothchandar commented on code in PR #18359:
URL: https://github.com/apache/hudi/pull/18359#discussion_r3023968189


##########
rfc/rfc-100/rfc-100-blob-cleaner-problem.md:
##########
@@ -0,0 +1,677 @@
+# Blob Cleaner: Problem Statement
+
+## 1. Goal
+
+When old file slices are cleaned, out-of-line blob files they reference may 
become orphaned -- still
+consuming storage but unreachable by any query. The blob cleaner must identify 
and delete these
+unreferenced blob files without premature deletion (deleting a blob that is 
still referenced by a live
+record). This document defines the problem scope, design constraints, 
requirements, and illustrative
+failure modes. It contains no solution content.
+
+---
+
+## 2. Scope
+
+### In scope
+
+- Cleanup of **out-of-line blob files** when references to them exist only in 
expired (cleaned) file
+  slices.
+- All table types: **COW** and **MOR**.
+- All cleaning policies: `KEEP_LATEST_COMMITS`, `KEEP_LATEST_FILE_VERSIONS`,
+  `KEEP_LATEST_BY_HOURS`.
+- Interaction with table services: **compaction**, **clustering**, **blob 
compaction**.
+- Interaction with timeline operations: **savepoints**, **rollback**, 
**archival**.
+- Single-writer and multi-writer (OCC) concurrency modes.
+- Both **Hudi-created blobs** (stored under `{table}/.hoodie/blobs/...`) and 
**user-provided
+  external blobs** (arbitrary paths).
+
+### Two entry flows
+
+Blob cleanup must support two distinct entry flows. These are not edge cases 
of each other --
+they are co-equal paths with different properties, different volumes, and 
different cleanup costs.
+
+**Flow 1: Path-dispatched (Hudi-created blobs).** Blobs created by Hudi's 
write path and stored
+under `{table}/.hoodie/blobs/{partition}/{col}/{instant}/{blob_id}`. The path 
structure guarantees

Review Comment:
   To me, this is neither an out-of-line blob that is managed somewhere else 
externally by the user, nor is it an internal inline blob where it's stored 
right within the base or log files, right? 
   
   Are we assuming we are going to store blobs separately within the Hudi table 
path like this? I don't think we had alignment on this. To define anything like 
this, we should also not talk about other table services concretely. I'm going 
to simply review this with a narrow scope of external blobs and reference 
cleaning.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to