This is an automated email from the ASF dual-hosted git repository. vhs pushed a commit to branch rfc-blob-cleaner in repository https://gitbox.apache.org/repos/asf/hudi.git
commit 574df5bc244e94d2a1f0055537836f9486193ff4 Author: voon <[email protected]> AuthorDate: Fri Mar 20 16:34:29 2026 +0800 Update rollout plan --- rfc/rfc-100/rfc-100-blob-cleaner-design.md | 78 +++++++++++++++++++----------- 1 file changed, 50 insertions(+), 28 deletions(-) diff --git a/rfc/rfc-100/rfc-100-blob-cleaner-design.md b/rfc/rfc-100/rfc-100-blob-cleaner-design.md index a7f80e14bcad..e94d3ad4eafb 100644 --- a/rfc/rfc-100/rfc-100-blob-cleaner-design.md +++ b/rfc/rfc-100/rfc-100-blob-cleaner-design.md @@ -142,6 +142,20 @@ MDT secondary index. The dispatch mechanism is a zero-cost string prefix check o | **Stage 2** | Cross-file-group | Verify external blob candidates against MDT secondary index or fallback scan | Only when external candidates exist | | **Stage 3** | Container resolution | Determine delete vs. flag-for-compaction at the container level | Only when container blobs are involved | +### Independent Implementability + +The three stages have clean input/output interfaces and can be implemented, tested, and shipped +independently: + +| Stage | Input | Output | +|---------|---------------------------------------------------------|-----------------------------------------------------| +| Stage 1 | `FileGroupCleanResult` (expired + retained slices) | `hudi_blob_deletes`, `external_candidates` | +| Stage 2 | `external_candidates`, `cleaned_fg_ids` | `external_deletes` | +| Stage 3 | `hudi_blob_deletes` + `external_deletes`, retained refs | `blob_files_to_delete`, `containers_for_compaction` | + +A shared foundation layer must land first (see [Rollout / Adoption Plan](#rollout--adoption-plan)), after which stages +can proceed in any order. + ### Key Decisions | Decision | Choice | Rationale | @@ -337,11 +351,11 @@ for path in candidate_paths: resolution with per-candidate short-circuit. Steps 1 and 2 are each a single I/O pass; step 3 is pure in-memory hash set lookups (~0ms). -| Step | I/O | Cost | -|-----------------------------|-----------------------------------------|--------------------------| -| 1. Prefix scan (batched) | Single MDT call for N candidate paths | ~2-5s for 2K candidates | -| 2. Record index (batched) | Single sorted HFile forward-scan | ~1-2s for 6K record keys | -| 3. In-memory resolution | Hash set checks (cleaned_fg_ids) | ~0ms | +| Step | I/O | Cost | +|---------------------------|---------------------------------------|--------------------------| +| 1. Prefix scan (batched) | Single MDT call for N candidate paths | ~2-5s for 2K candidates | +| 2. Record index (batched) | Single sorted HFile forward-scan | ~1-2s for 6K record keys | +| 3. In-memory resolution | Hash set checks (cleaned_fg_ids) | ~0ms | **Index definition.** Uses the existing `HoodieIndexDefinition` mechanism with `sourceFields = ["<blob_col>", "reference", "external_path"]`. The nested field path is supported @@ -667,20 +681,20 @@ cleaning: ### Back-of-Envelope: Example 7 (50K FGs, 2K External Candidates) -| Parameter | Value | Notes | -|-----------------------------------------|-----------|-------------------------------------------------| -| FGs cleaned this cycle | 500 | 1% of table | -| Stage 1: reads per FG | ~6 | 3 retained + 3 expired slices | -| Stage 1: total reads | 3,000 | Parallelized across executors, ~20s | -| External blob candidates | 2,000 | Locally orphaned in cleaned FGs | -| Avg refs per candidate | 3 | Typical: video in a few playlists | -| Total record keys | 6,000 | 2,000 * 3 | -| **Stage 2 cost** | | | -| Step 1: batched prefix scan | 1 call | Returns 6K record keys, ~2-5s | -| Step 2: batched record index lookup | 1 call | 6K keys sorted, single HFile scan, ~1-2s | -| Step 3: in-memory resolution | 6K checks | Hash set lookups against cleaned_fg_ids, ~0ms | -| **Total Stage 2** | **~3-7s** | | -| Comparison: naive full-table scan | 12.5TB | 50K FGs * 5 slices * 50MB = prohibitive | +| Parameter | Value | Notes | +|-------------------------------------|-----------|-----------------------------------------------| +| FGs cleaned this cycle | 500 | 1% of table | +| Stage 1: reads per FG | ~6 | 3 retained + 3 expired slices | +| Stage 1: total reads | 3,000 | Parallelized across executors, ~20s | +| External blob candidates | 2,000 | Locally orphaned in cleaned FGs | +| Avg refs per candidate | 3 | Typical: video in a few playlists | +| Total record keys | 6,000 | 2,000 * 3 | +| **Stage 2 cost** | | | +| Step 1: batched prefix scan | 1 call | Returns 6K record keys, ~2-5s | +| Step 2: batched record index lookup | 1 call | 6K keys sorted, single HFile scan, ~1-2s | +| Step 3: in-memory resolution | 6K checks | Hash set lookups against cleaned_fg_ids, ~0ms | +| **Total Stage 2** | **~3-7s** | | +| Comparison: naive full-table scan | 12.5TB | 50K FGs * 5 slices * 50MB = prohibitive | ### Memory Budget @@ -709,18 +723,26 @@ Sections 10.1-10.3. ## Rollout / Adoption Plan -### Phase 1: Flow 1 Only (Hudi-Created Blobs) +Each stage can be implemented, tested, and shipped independently once the foundation layer is in +place (see [Independent Implementability](#independent-implementability)). + +**Foundation (shared prerequisite).** `CleanPlanner` refactoring (policy methods return +`FileGroupCleanResult`), `BlobRef` type, schema changes (nullable `blobFilesToDelete` and +`containersToCompact` fields), and the `hasBlobColumns` zero-cost gate. + +**Stage 1 (per-FG cleanup).** Set-difference logic and dispatch by blob category. Produces +`hudi_blob_deletes` (immediate) and `external_candidates` (for Stage 2). -- Requires no new dependencies (no MDT secondary index, no record index). -- `CleanPlanner` refactoring + Stage 1 + Stage 3. -- Tables with only Hudi-created blobs get full cleanup. -- Non-blob tables are completely unaffected (zero-cost gate). +**Stage 2 (cross-FG verification) -- priority.** Flow 2 (external blobs) is the primary initial +use case -- cross-FG verification prevents premature deletion of shared blobs. Requires MDT + +record index + secondary index on `reference.external_path` (P6). Includes fallback table scan +with circuit breaker. -### Phase 2: Flow 2 (External Blobs) +**Stage 3 (container lifecycle).** Delete-entire-file vs. flag-for-compaction at the container +level. Needed only when container files are used. -- Requires MDT + record index + secondary index on `reference.external_path` (P6). -- Stage 2 (MDT secondary index path) + fallback table scan with circuit breaker. -- Writer-side conflict check in `preCommit()` for external blob concurrency safety. +**Writer-side conflict check.** `preCommit()` conflict check for Flow 2 concurrency safety. +Closes the writer-cleaner race window. Independent of the three stages. ### Backward Compatibility
