This is an automated email from the ASF dual-hosted git repository.
vhs pushed a commit to branch rfc-blob-cleaner
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/rfc-blob-cleaner by this push:
new e94e7eed7e41 Fix derivation and assumptions
e94e7eed7e41 is described below
commit e94e7eed7e4169459446ef834a7a817a0bf8e850
Author: voon <[email protected]>
AuthorDate: Fri Mar 20 17:21:28 2026 +0800
Fix derivation and assumptions
---
rfc/rfc-100/rfc-100-blob-cleaner-design.md | 60 ++++++++++++-------
rfc/rfc-100/rfc-100-blob-cleaner.md | 95 +++++++++++++++++-------------
2 files changed, 91 insertions(+), 64 deletions(-)
diff --git a/rfc/rfc-100/rfc-100-blob-cleaner-design.md
b/rfc/rfc-100/rfc-100-blob-cleaner-design.md
index db65d648d498..18919658c2b2 100644
--- a/rfc/rfc-100/rfc-100-blob-cleaner-design.md
+++ b/rfc/rfc-100/rfc-100-blob-cleaner-design.md
@@ -347,13 +347,29 @@ for path in candidate_paths:
**Cost model.** The lookup is a three-step process: (1) batched prefix scan on
the secondary index,
(2) batched record index lookup for all record keys in a single sorted HFile
scan, (3) in-memory
resolution with per-candidate short-circuit. Steps 1 and 2 are each a single
I/O pass; step 3 is
-pure in-memory hash set lookups (~0ms).
-
-| Step | I/O | Cost
|
-|---------------------------|---------------------------------------|--------------------------|
-| 1. Prefix scan (batched) | Single MDT call for N candidate paths | ~2-5s
for 2K candidates |
-| 2. Record index (batched) | Single sorted HFile forward-scan | ~1-2s
for 6K record keys |
-| 3. In-memory resolution | Hash set checks (cleaned_fg_ids) | ~0ms
|
+pure in-memory hash set lookups.
+
+**I/O assumptions.** Both MDT calls perform sorted forward scans through HFile
blocks. The
+dominant costs are HFile open (block index + metadata) and sequential block
reads. Estimates
+assume cloud object storage (S3/GCS/ADLS) with ~10-100ms per-read latency and
~50-200 MB/s
+sequential throughput, and HFile block size of 64-256KB (Hudi defaults). These
are estimates
+pending benchmarking; actual latency depends on storage backend, block cache
state, and HFile
+size.
+
+| Step | I/O |
Estimated cost (2K candidates) |
+|---------------------------|-----------------------------------------------|--------------------------------|
+| 1. Prefix scan (batched) | 1 HFile open + forward scan of N prefix keys |
~2-5s (see derivation below) |
+| 2. Record index (batched) | 1 HFile open + forward scan of 6K sorted keys |
~1-2s (see derivation below) |
+| 3. In-memory resolution | Hash set checks (cleaned_fg_ids) |
~0ms |
+
+**Derivation.** Step 1: the secondary index HFile for a table with 250K
external blob references
+at ~200 bytes/entry is ~50MB. A sorted forward scan for 2K prefixes reads a
subset of 64KB
+blocks (~200-500 blocks). At cloud storage latencies, with read-ahead
amortization:
+1 HFile open (~100-200ms) + block reads (~50MB at 50-200 MB/s = ~0.25-1s) +
per-block latency
+overhead (~0.5-2s for cold cache) = ~1-3s. With margin for larger indexes or
multiple HFiles
+(pending MDT compaction): ~2-5s. Step 2: the record index HFile is larger, but
the forward scan
+touches only blocks containing the 6K sorted keys. Similar I/O profile, fewer
blocks hit relative
+to index size: ~1-2s.
**Index definition.** Uses the existing `HoodieIndexDefinition` mechanism with
`sourceFields = ["<blob_col>", "reference", "external_path"]`. The nested
field path is supported
@@ -388,7 +404,7 @@ sequenceDiagram
Note over C: Step 1: Batch prefix scan
C->>SI: All candidate paths (N paths, single call)
- SI-->>C: Map<path, List<recordKey>>
+ SI-->>C: Map<path, List<recordKey>>
Note over C: Step 2: Batch record index lookup
C->>C: Collect all record keys from all candidates
@@ -679,20 +695,20 @@ cleaning:
### Back-of-Envelope: Example 7 (50K FGs, 2K External Candidates)
-| Parameter | Value | Notes
|
-|-------------------------------------|-----------|-----------------------------------------------|
-| FGs cleaned this cycle | 500 | 1% of table
|
-| Stage 1: reads per FG | ~6 | 3 retained + 3 expired
slices |
-| Stage 1: total reads | 3,000 | Parallelized across
executors, ~20s |
-| External blob candidates | 2,000 | Locally orphaned in
cleaned FGs |
-| Avg refs per candidate | 3 | Typical: video in a few
playlists |
-| Total record keys | 6,000 | 2,000 * 3
|
-| **Stage 2 cost** | |
|
-| Step 1: batched prefix scan | 1 call | Returns 6K record keys,
~2-5s |
-| Step 2: batched record index lookup | 1 call | 6K keys sorted, single
HFile scan, ~1-2s |
-| Step 3: in-memory resolution | 6K checks | Hash set lookups against
cleaned_fg_ids, ~0ms |
-| **Total Stage 2** | **~3-7s** |
|
-| Comparison: naive full-table scan | 12.5TB | 50K FGs * 5 slices * 50MB
= prohibitive |
+| Parameter | Value | Notes
|
+|-------------------------------------|-----------|------------------------------------------------------|
+| FGs cleaned this cycle | 500 | 1% of table
|
+| Stage 1: reads per FG | ~6 | 3 retained + 3 expired
slices |
+| Stage 1: total reads | 3,000 | Parallelized across
executors, ~20s |
+| External blob candidates | 2,000 | Locally orphaned in
cleaned FGs |
+| Avg refs per candidate | 3 | Typical: video in a few
playlists |
+| Total record keys | 6,000 | 2,000 * 3
|
+| **Stage 2 cost (estimated)** | |
|
+| Step 1: batched prefix scan | 1 call | Returns 6K record keys,
~2-5s estimated |
+| Step 2: batched record index lookup | 1 call | 6K keys sorted, single
HFile scan, ~1-2s estimated |
+| Step 3: in-memory resolution | 6K checks | Hash set lookups against
cleaned_fg_ids, ~0ms |
+| **Total Stage 2** | **~3-7s** | Estimated; see I/O
assumptions in Stage 2 cost model |
+| Comparison: naive full-table scan | 12.5TB | 50K FGs * 5 slices * 50MB
= prohibitive |
### Memory Budget
diff --git a/rfc/rfc-100/rfc-100-blob-cleaner.md
b/rfc/rfc-100/rfc-100-blob-cleaner.md
index 6365f6807d4d..4fe0e3dee124 100644
--- a/rfc/rfc-100/rfc-100-blob-cleaner.md
+++ b/rfc/rfc-100/rfc-100-blob-cleaner.md
@@ -358,26 +358,37 @@ adjacent blocks. No per-key random I/O.
**Back-of-envelope for Example 7 (revised):**
-| Parameter | Value | Notes
|
-|------------------------------------|--------------|--------------------------------------------------|
-| Candidate paths (N) | 2,000 | Locally orphaned in 500
cleaned FGs |
-| Avg refs per candidate (R_avg) | 3 | Typical: video in a few
playlists |
-| Total record keys (R_total) | 6,000 | N * R_avg
|
-| Popular blob refs (viral example) | 10,000 | A viral video in many
playlists |
-| **Step 1: batched prefix scan** | 1 MDT call | Returns 6K record keys
(or 10K+ with viral), ~2-5s |
-| **Step 2: batched record index** | 1 MDT call | 6K keys sorted, single
HFile scan, ~1-2s |
-| **Step 3: in-memory resolution** | 6K checks | Hash set lookups, ~0ms
|
-| **Total Stage 2** | **~3-7s** | Two I/O passes +
in-memory |
-| **Viral blob (10K refs)** | |
|
-| Step 1: prefix scan | 1 MDT call | Returns 10K record keys,
~2-3s |
-| Step 2: record index | 1 MDT call | 10K keys sorted, single
HFile scan, ~2-3s |
-| Step 3: in-memory resolution | 10K checks | ~0ms
|
-| **Total for viral blob** | **~4-6s** | Sequential scan, not 10K
random seeks |
+| Parameter | Value | Notes
|
+|-----------------------------------|------------|--------------------------------------------------------------|
+| Candidate paths (N) | 2,000 | Locally orphaned in 500
cleaned FGs |
+| Avg refs per candidate (R_avg) | 3 | Typical: video in a few
playlists |
+| Total record keys (R_total) | 6,000 | N * R_avg
|
+| Popular blob refs (viral example) | 10,000 | A viral video in many
playlists |
+| **Step 1: batched prefix scan** | 1 MDT call | Returns 6K record keys (or
10K+ with viral), ~2-5s estimated |
+| **Step 2: batched record index** | 1 MDT call | 6K keys sorted, single
HFile scan, ~1-2s estimated |
+| **Step 3: in-memory resolution** | 6K checks | Hash set lookups, ~0ms
|
+| **Total Stage 2** | **~3-7s** | Estimated; two I/O passes +
in-memory |
+| **Viral blob (10K refs)** | |
|
+| Step 1: prefix scan | 1 MDT call | Returns 10K record keys,
~2-3s estimated |
+| Step 2: record index | 1 MDT call | 10K keys sorted, single
HFile scan, ~2-3s estimated |
+| Step 3: in-memory resolution | 10K checks | ~0ms
|
+| **Total for viral blob** | **~4-6s** | Estimated; sequential scan,
not 10K random seeks |
Both steps use batch APIs (`readSecondaryIndexDataTableRecordKeysWithKeys` and
`readRecordIndexLocations`) that sort keys internally and perform sequential
forward-scans
through HFile blocks. No per-key random I/O.
+**I/O assumptions and derivation.** Estimates assume cloud object storage
(S3/GCS/ADLS) with
+~10-100ms per-read latency, ~50-200 MB/s sequential throughput, and HFile
block size of
+64-256KB (Hudi defaults). Step 1: a secondary index HFile for 250K external
blob references at
+~200 bytes/entry is ~50MB. A sorted forward scan for 2K prefixes reads
~200-500 blocks. With
+read-ahead amortization: 1 HFile open (~100-200ms) + block reads (~50MB at
50-200 MB/s) +
+per-block latency overhead (~0.5-2s cold cache) = ~1-3s; with margin for
larger indexes or
+multiple HFiles: ~2-5s. Step 2: record index HFile is larger but forward scan
touches only
+blocks containing the 6K sorted keys, similar I/O profile: ~1-2s. These are
estimates pending
+benchmarking; actual latency depends on storage backend, block cache state,
and HFile size.
+(Need proper benchmarking to actually confirm these numbers)
+
Even in the viral blob case, this is far cheaper than the fallback table scan
(12.5TB for Example 7).
**Index definition.** The index is defined as a secondary index on the blob
reference's
@@ -1444,22 +1455,22 @@ HFile forward-scan, (3) in-memory resolution with
per-candidate short-circuit.
The I/O is sequential (sorted forward-scans through HFiles), not random
per-key lookups. See
Section 3.2.1 for the detailed cost model and analysis.
-**Scaling analysis for Example 7 (media company, revised):**
-
-| Parameter | Value | Notes
|
-|---------------------------------|--------------|----------------------------------------------|
-| Total videos | 10M | Stored in
s3://media-library/ |
-| Total file groups | 50K | Across 1K partitions
|
-| FGs cleaned this cycle | 500 | 1% of table
|
-| External blob candidates | 2,000 | Locally orphaned in cleaned
FGs |
-| Avg refs per candidate | 3 | Typical: video in a few
playlists |
-| Total record keys (R_total) | 6,000 | N * R_avg
|
-| **Stage 2 (batched pipeline)** | |
|
-| Step 1: batched prefix scan | 1 MDT call | Returns 6K record keys,
~2-5s |
-| Step 2: batched record index | 1 MDT call | 6K keys sorted, single
HFile scan, ~1-2s |
-| Step 3: in-memory resolution | 6K checks | Hash set lookups, ~0ms
|
-| **Total Stage 2 time** | **~3-7s** |
|
-| Comparison: naive table scan | 12.5TB read | 50K FGs * 5 slices * 50MB =
prohibitive |
+**Scaling analysis for Example 7:**
+
+| Parameter | Value | Notes
|
+|--------------------------------|-------------|-----------------------------------------------|
+| Total videos | 10M | Stored in s3://media-library/
|
+| Total file groups | 50K | Across 1K partitions
|
+| FGs cleaned this cycle | 500 | 1% of table
|
+| External blob candidates | 2,000 | Locally orphaned in cleaned
FGs |
+| Avg refs per candidate | 3 | Typical: video in a few
playlists |
+| Total record keys (R_total) | 6,000 | N * R_avg
|
+| **Stage 2 (batched pipeline)** | |
|
+| Step 1: batched prefix scan | 1 MDT call | Returns 6K record keys, ~2-5s
estimated |
+| Step 2: batched record index | 1 MDT call | 6K keys sorted, single HFile
scan, ~1-2s est. |
+| Step 3: in-memory resolution | 6K checks | Hash set lookups, ~0ms
|
+| **Total Stage 2 time** | **~3-7s** | Estimated; see Section 3.2.1
for derivation |
+| Comparison: naive table scan | 12.5TB read | 50K FGs * 5 slices * 50MB =
prohibitive |
The index maintenance cost is borne by the existing MDT write pipeline via
`SecondaryIndexRecordGenerationUtils.convertWriteStatsToSecondaryIndexRecords()`.
When a record
@@ -1812,20 +1823,20 @@ The lookup is a three-step pipeline with two batched
I/O passes and one in-memor
lookup for all record keys in a single sorted HFile forward-scan, (3)
in-memory resolution with
per-candidate short-circuit.
-| Parameter | Value | Notes
|
-|---------------------------------------|----------------------|------------------------------------------------------|
-| Candidates (N) | N | Only
locally-orphaned external blobs |
-| Step 1: batched prefix scan | 1 MDT call |
readSecondaryIndex...WithKeys for N paths |
-| Record keys returned (R_total) | N * R_avg | R_avg = avg
refs per candidate |
-| Step 2: batched record index | 1 MDT call |
readRecordIndexLocations: sorted forward-scan |
-| Step 3: in-memory resolution | R_total checks | Hash set
lookups against cleaned_fg_ids |
-| Memory | O(N * R_avg) | Record keys +
lookup results (in-memory after batch) |
+| Parameter | Value | Notes
|
+|--------------------------------|----------------|------------------------------------------------------|
+| Candidates (N) | N | Only locally-orphaned
external blobs |
+| Step 1: batched prefix scan | 1 MDT call |
readSecondaryIndex...WithKeys for N paths |
+| Record keys returned (R_total) | N * R_avg | R_avg = avg refs per
candidate |
+| Step 2: batched record index | 1 MDT call | readRecordIndexLocations:
sorted forward-scan |
+| Step 3: in-memory resolution | R_total checks | Hash set lookups against
cleaned_fg_ids |
+| Memory | O(N * R_avg) | Record keys + lookup
results (in-memory after batch) |
For Example 7 (2,000 candidates, R_avg=3, R_total=6K):
-- Step 1: batched prefix scan returning 6K record keys, ~2-5s.
-- Step 2: batched record index lookup, 6K keys sorted, single HFile scan,
~1-2s.
+- Step 1: batched prefix scan returning 6K record keys, ~2-5s (estimated).
+- Step 2: batched record index lookup, 6K keys sorted, single HFile scan,
~1-2s (estimated).
- Step 3: in-memory resolution, ~0ms.
-- **Total: ~3-7s.**
+- **Total: ~3-7s (estimated; see Section 3.2.1 for I/O assumptions and
derivation).**
For a viral blob (10K refs):
- Step 1: prefix scan for that blob returns 10K keys, ~2-3s.