(hudi) branch rfc-blob-cleaner updated: Fix derivation and assumptions

vhs Fri, 20 Mar 2026 02:22:07 -0700

This is an automated email from the ASF dual-hosted git repository.

vhs pushed a commit to branch rfc-blob-cleaner
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/rfc-blob-cleaner by this push:
     new e94e7eed7e41 Fix derivation and assumptions
e94e7eed7e41 is described below

commit e94e7eed7e4169459446ef834a7a817a0bf8e850
Author: voon <[email protected]>
AuthorDate: Fri Mar 20 17:21:28 2026 +0800

    Fix derivation and assumptions
---
 rfc/rfc-100/rfc-100-blob-cleaner-design.md | 60 ++++++++++++-------
 rfc/rfc-100/rfc-100-blob-cleaner.md        | 95 +++++++++++++++++-------------
 2 files changed, 91 insertions(+), 64 deletions(-)

diff --git a/rfc/rfc-100/rfc-100-blob-cleaner-design.md 
b/rfc/rfc-100/rfc-100-blob-cleaner-design.md
index db65d648d498..18919658c2b2 100644
--- a/rfc/rfc-100/rfc-100-blob-cleaner-design.md
+++ b/rfc/rfc-100/rfc-100-blob-cleaner-design.md
@@ -347,13 +347,29 @@ for path in candidate_paths:
 **Cost model.** The lookup is a three-step process: (1) batched prefix scan on 
the secondary index,
 (2) batched record index lookup for all record keys in a single sorted HFile 
scan, (3) in-memory
 resolution with per-candidate short-circuit. Steps 1 and 2 are each a single 
I/O pass; step 3 is
-pure in-memory hash set lookups (~0ms).
-
-| Step                      | I/O                                   | Cost     
                |
-|---------------------------|---------------------------------------|--------------------------|
-| 1. Prefix scan (batched)  | Single MDT call for N candidate paths | ~2-5s 
for 2K candidates  |
-| 2. Record index (batched) | Single sorted HFile forward-scan      | ~1-2s 
for 6K record keys |
-| 3. In-memory resolution   | Hash set checks (cleaned_fg_ids)      | ~0ms     
                |
+pure in-memory hash set lookups.
+
+**I/O assumptions.** Both MDT calls perform sorted forward scans through HFile 
blocks. The
+dominant costs are HFile open (block index + metadata) and sequential block 
reads. Estimates
+assume cloud object storage (S3/GCS/ADLS) with ~10-100ms per-read latency and 
~50-200 MB/s
+sequential throughput, and HFile block size of 64-256KB (Hudi defaults). These 
are estimates
+pending benchmarking; actual latency depends on storage backend, block cache 
state, and HFile
+size.
+
+| Step                      | I/O                                           | 
Estimated cost (2K candidates) |
+|---------------------------|-----------------------------------------------|--------------------------------|
+| 1. Prefix scan (batched)  | 1 HFile open + forward scan of N prefix keys  | 
~2-5s (see derivation below)   |
+| 2. Record index (batched) | 1 HFile open + forward scan of 6K sorted keys | 
~1-2s (see derivation below)   |
+| 3. In-memory resolution   | Hash set checks (cleaned_fg_ids)              | 
~0ms                           |
+
+**Derivation.** Step 1: the secondary index HFile for a table with 250K 
external blob references
+at ~200 bytes/entry is ~50MB. A sorted forward scan for 2K prefixes reads a 
subset of 64KB
+blocks (~200-500 blocks). At cloud storage latencies, with read-ahead 
amortization:
+1 HFile open (~100-200ms) + block reads (~50MB at 50-200 MB/s = ~0.25-1s) + 
per-block latency
+overhead (~0.5-2s for cold cache) = ~1-3s. With margin for larger indexes or 
multiple HFiles
+(pending MDT compaction): ~2-5s. Step 2: the record index HFile is larger, but 
the forward scan
+touches only blocks containing the 6K sorted keys. Similar I/O profile, fewer 
blocks hit relative
+to index size: ~1-2s.
 
 **Index definition.** Uses the existing `HoodieIndexDefinition` mechanism with
 `sourceFields = ["<blob_col>", "reference", "external_path"]`. The nested 
field path is supported
@@ -388,7 +404,7 @@ sequenceDiagram
 
     Note over C: Step 1: Batch prefix scan
     C->>SI: All candidate paths (N paths, single call)
-    SI-->>C: Map&lt;path, List&lt;recordKey&gt;&gt;
+    SI-->>C: Map<path, List<recordKey>>
 
     Note over C: Step 2: Batch record index lookup
     C->>C: Collect all record keys from all candidates
@@ -679,20 +695,20 @@ cleaning:
 
 ### Back-of-Envelope: Example 7 (50K FGs, 2K External Candidates)
 
-| Parameter                           | Value     | Notes                      
                   |
-|-------------------------------------|-----------|-----------------------------------------------|
-| FGs cleaned this cycle              | 500       | 1% of table                
                   |
-| Stage 1: reads per FG               | ~6        | 3 retained + 3 expired 
slices                 |
-| Stage 1: total reads                | 3,000     | Parallelized across 
executors, ~20s           |
-| External blob candidates            | 2,000     | Locally orphaned in 
cleaned FGs               |
-| Avg refs per candidate              | 3         | Typical: video in a few 
playlists             |
-| Total record keys                   | 6,000     | 2,000 * 3                  
                   |
-| **Stage 2 cost**                    |           |                            
                   |
-| Step 1: batched prefix scan         | 1 call    | Returns 6K record keys, 
~2-5s                 |
-| Step 2: batched record index lookup | 1 call    | 6K keys sorted, single 
HFile scan, ~1-2s      |
-| Step 3: in-memory resolution        | 6K checks | Hash set lookups against 
cleaned_fg_ids, ~0ms |
-| **Total Stage 2**                   | **~3-7s** |                            
                   |
-| Comparison: naive full-table scan   | 12.5TB    | 50K FGs * 5 slices * 50MB 
= prohibitive       |
+| Parameter                           | Value     | Notes                      
                          |
+|-------------------------------------|-----------|------------------------------------------------------|
+| FGs cleaned this cycle              | 500       | 1% of table                
                          |
+| Stage 1: reads per FG               | ~6        | 3 retained + 3 expired 
slices                        |
+| Stage 1: total reads                | 3,000     | Parallelized across 
executors, ~20s                  |
+| External blob candidates            | 2,000     | Locally orphaned in 
cleaned FGs                      |
+| Avg refs per candidate              | 3         | Typical: video in a few 
playlists                    |
+| Total record keys                   | 6,000     | 2,000 * 3                  
                          |
+| **Stage 2 cost (estimated)**        |           |                            
                          |
+| Step 1: batched prefix scan         | 1 call    | Returns 6K record keys, 
~2-5s estimated              |
+| Step 2: batched record index lookup | 1 call    | 6K keys sorted, single 
HFile scan, ~1-2s estimated   |
+| Step 3: in-memory resolution        | 6K checks | Hash set lookups against 
cleaned_fg_ids, ~0ms        |
+| **Total Stage 2**                   | **~3-7s** | Estimated; see I/O 
assumptions in Stage 2 cost model |
+| Comparison: naive full-table scan   | 12.5TB    | 50K FGs * 5 slices * 50MB 
= prohibitive              |
 
 ### Memory Budget
 
diff --git a/rfc/rfc-100/rfc-100-blob-cleaner.md 
b/rfc/rfc-100/rfc-100-blob-cleaner.md
index 6365f6807d4d..4fe0e3dee124 100644
--- a/rfc/rfc-100/rfc-100-blob-cleaner.md
+++ b/rfc/rfc-100/rfc-100-blob-cleaner.md
@@ -358,26 +358,37 @@ adjacent blocks. No per-key random I/O.
 
 **Back-of-envelope for Example 7 (revised):**
 
-| Parameter                          | Value        | Notes                    
                       |
-|------------------------------------|--------------|--------------------------------------------------|
-| Candidate paths (N)                | 2,000        | Locally orphaned in 500 
cleaned FGs             |
-| Avg refs per candidate (R_avg)     | 3            | Typical: video in a few 
playlists               |
-| Total record keys (R_total)        | 6,000        | N * R_avg                
                       |
-| Popular blob refs (viral example)  | 10,000       | A viral video in many 
playlists                 |
-| **Step 1: batched prefix scan**    | 1 MDT call   | Returns 6K record keys 
(or 10K+ with viral), ~2-5s |
-| **Step 2: batched record index**   | 1 MDT call   | 6K keys sorted, single 
HFile scan, ~1-2s        |
-| **Step 3: in-memory resolution**   | 6K checks    | Hash set lookups, ~0ms   
                        |
-| **Total Stage 2**                  | **~3-7s**    | Two I/O passes + 
in-memory                       |
-| **Viral blob (10K refs)**          |              |                          
                        |
-| Step 1: prefix scan                | 1 MDT call   | Returns 10K record keys, 
~2-3s                   |
-| Step 2: record index               | 1 MDT call   | 10K keys sorted, single 
HFile scan, ~2-3s        |
-| Step 3: in-memory resolution       | 10K checks   | ~0ms                     
                        |
-| **Total for viral blob**           | **~4-6s**    | Sequential scan, not 10K 
random seeks             |
+| Parameter                         | Value      | Notes                       
                                 |
+|-----------------------------------|------------|--------------------------------------------------------------|
+| Candidate paths (N)               | 2,000      | Locally orphaned in 500 
cleaned FGs                          |
+| Avg refs per candidate (R_avg)    | 3          | Typical: video in a few 
playlists                            |
+| Total record keys (R_total)       | 6,000      | N * R_avg                   
                                 |
+| Popular blob refs (viral example) | 10,000     | A viral video in many 
playlists                              |
+| **Step 1: batched prefix scan**   | 1 MDT call | Returns 6K record keys (or 
10K+ with viral), ~2-5s estimated |
+| **Step 2: batched record index**  | 1 MDT call | 6K keys sorted, single 
HFile scan, ~1-2s estimated           |
+| **Step 3: in-memory resolution**  | 6K checks  | Hash set lookups, ~0ms      
                                 |
+| **Total Stage 2**                 | **~3-7s**  | Estimated; two I/O passes + 
in-memory                        |
+| **Viral blob (10K refs)**         |            |                             
                                 |
+| Step 1: prefix scan               | 1 MDT call | Returns 10K record keys, 
~2-3s estimated                     |
+| Step 2: record index              | 1 MDT call | 10K keys sorted, single 
HFile scan, ~2-3s estimated          |
+| Step 3: in-memory resolution      | 10K checks | ~0ms                        
                                 |
+| **Total for viral blob**          | **~4-6s**  | Estimated; sequential scan, 
not 10K random seeks             |
 
 Both steps use batch APIs (`readSecondaryIndexDataTableRecordKeysWithKeys` and
 `readRecordIndexLocations`) that sort keys internally and perform sequential 
forward-scans
 through HFile blocks. No per-key random I/O.
 
+**I/O assumptions and derivation.** Estimates assume cloud object storage 
(S3/GCS/ADLS) with
+~10-100ms per-read latency, ~50-200 MB/s sequential throughput, and HFile 
block size of
+64-256KB (Hudi defaults). Step 1: a secondary index HFile for 250K external 
blob references at
+~200 bytes/entry is ~50MB. A sorted forward scan for 2K prefixes reads 
~200-500 blocks. With
+read-ahead amortization: 1 HFile open (~100-200ms) + block reads (~50MB at 
50-200 MB/s) +
+per-block latency overhead (~0.5-2s cold cache) = ~1-3s; with margin for 
larger indexes or
+multiple HFiles: ~2-5s. Step 2: record index HFile is larger but forward scan 
touches only
+blocks containing the 6K sorted keys, similar I/O profile: ~1-2s. These are 
estimates pending
+benchmarking; actual latency depends on storage backend, block cache state, 
and HFile size. 
+(Need proper benchmarking to actually confirm these numbers)
+
 Even in the viral blob case, this is far cheaper than the fallback table scan 
(12.5TB for Example 7).
 
 **Index definition.** The index is defined as a secondary index on the blob 
reference's
@@ -1444,22 +1455,22 @@ HFile forward-scan, (3) in-memory resolution with 
per-candidate short-circuit.
 The I/O is sequential (sorted forward-scans through HFiles), not random 
per-key lookups. See
 Section 3.2.1 for the detailed cost model and analysis.
 
-**Scaling analysis for Example 7 (media company, revised):**
-
-| Parameter                       | Value        | Notes                       
                 |
-|---------------------------------|--------------|----------------------------------------------|
-| Total videos                    | 10M          | Stored in 
s3://media-library/                |
-| Total file groups               | 50K          | Across 1K partitions        
                 |
-| FGs cleaned this cycle          | 500          | 1% of table                 
                 |
-| External blob candidates        | 2,000        | Locally orphaned in cleaned 
FGs              |
-| Avg refs per candidate          | 3            | Typical: video in a few 
playlists            |
-| Total record keys (R_total)     | 6,000        | N * R_avg                   
                 |
-| **Stage 2 (batched pipeline)**  |              |                             
                 |
-| Step 1: batched prefix scan     | 1 MDT call   | Returns 6K record keys, 
~2-5s                |
-| Step 2: batched record index    | 1 MDT call   | 6K keys sorted, single 
HFile scan, ~1-2s     |
-| Step 3: in-memory resolution    | 6K checks    | Hash set lookups, ~0ms      
                 |
-| **Total Stage 2 time**          | **~3-7s**    |                             
                 |
-| Comparison: naive table scan    | 12.5TB read  | 50K FGs * 5 slices * 50MB = 
prohibitive      |
+**Scaling analysis for Example 7:**
+
+| Parameter                      | Value       | Notes                         
                |
+|--------------------------------|-------------|-----------------------------------------------|
+| Total videos                   | 10M         | Stored in s3://media-library/ 
                |
+| Total file groups              | 50K         | Across 1K partitions          
                |
+| FGs cleaned this cycle         | 500         | 1% of table                   
                |
+| External blob candidates       | 2,000       | Locally orphaned in cleaned 
FGs               |
+| Avg refs per candidate         | 3           | Typical: video in a few 
playlists             |
+| Total record keys (R_total)    | 6,000       | N * R_avg                     
                |
+| **Stage 2 (batched pipeline)** |             |                               
                |
+| Step 1: batched prefix scan    | 1 MDT call  | Returns 6K record keys, ~2-5s 
estimated       |
+| Step 2: batched record index   | 1 MDT call  | 6K keys sorted, single HFile 
scan, ~1-2s est. |
+| Step 3: in-memory resolution   | 6K checks   | Hash set lookups, ~0ms        
                |
+| **Total Stage 2 time**         | **~3-7s**   | Estimated; see Section 3.2.1 
for derivation   |
+| Comparison: naive table scan   | 12.5TB read | 50K FGs * 5 slices * 50MB = 
prohibitive       |
 
 The index maintenance cost is borne by the existing MDT write pipeline via
 
`SecondaryIndexRecordGenerationUtils.convertWriteStatsToSecondaryIndexRecords()`.
 When a record
@@ -1812,20 +1823,20 @@ The lookup is a three-step pipeline with two batched 
I/O passes and one in-memor
 lookup for all record keys in a single sorted HFile forward-scan, (3) 
in-memory resolution with
 per-candidate short-circuit.
 
-| Parameter                             | Value                | Notes         
                                      |
-|---------------------------------------|----------------------|------------------------------------------------------|
-| Candidates (N)                        | N                    | Only 
locally-orphaned external blobs                |
-| Step 1: batched prefix scan           | 1 MDT call           | 
readSecondaryIndex...WithKeys for N paths           |
-| Record keys returned (R_total)        | N * R_avg            | R_avg = avg 
refs per candidate                      |
-| Step 2: batched record index          | 1 MDT call           | 
readRecordIndexLocations: sorted forward-scan        |
-| Step 3: in-memory resolution          | R_total checks       | Hash set 
lookups against cleaned_fg_ids              |
-| Memory                                | O(N * R_avg)         | Record keys + 
lookup results (in-memory after batch) |
+| Parameter                      | Value          | Notes                      
                          |
+|--------------------------------|----------------|------------------------------------------------------|
+| Candidates (N)                 | N              | Only locally-orphaned 
external blobs                 |
+| Step 1: batched prefix scan    | 1 MDT call     | 
readSecondaryIndex...WithKeys for N paths            |
+| Record keys returned (R_total) | N * R_avg      | R_avg = avg refs per 
candidate                       |
+| Step 2: batched record index   | 1 MDT call     | readRecordIndexLocations: 
sorted forward-scan        |
+| Step 3: in-memory resolution   | R_total checks | Hash set lookups against 
cleaned_fg_ids              |
+| Memory                         | O(N * R_avg)   | Record keys + lookup 
results (in-memory after batch) |
 
 For Example 7 (2,000 candidates, R_avg=3, R_total=6K):
-- Step 1: batched prefix scan returning 6K record keys, ~2-5s.
-- Step 2: batched record index lookup, 6K keys sorted, single HFile scan, 
~1-2s.
+- Step 1: batched prefix scan returning 6K record keys, ~2-5s (estimated).
+- Step 2: batched record index lookup, 6K keys sorted, single HFile scan, 
~1-2s (estimated).
 - Step 3: in-memory resolution, ~0ms.
-- **Total: ~3-7s.**
+- **Total: ~3-7s (estimated; see Section 3.2.1 for I/O assumptions and 
derivation).**
 
 For a viral blob (10K refs):
 - Step 1: prefix scan for that blob returns 10K keys, ~2-3s.

(hudi) branch rfc-blob-cleaner updated: Fix derivation and assumptions

Reply via email to