Re: [D] RocksDB as The Replica of MDT/RLI [hudi]

via GitHub Mon, 16 Mar 2026 01:04:41 -0700


GitHub user danny0405 edited a comment on the discussion: RocksDB as The 
Replica of MDT/RLI


I'm assuming you already read the [Flink RLI 
RFC](https://github.com/apache/hudi/pull/17610).

1. we did some local benchmark with MDT(plus hotspot cache in memory) as the 
index backend, and it turns out to be an average 50ms access latency per record 
query which does not scale well for steaming. Thus a local fast lookup replica 
is necessary, we have two solutions on table:

    1. use RocksDB replica as the mirror image of the MDT, and always replicate 
the index payloads from MDT to local RocksDB to ensure the performance;
    2. still uses the MDT, but introduces more caches: data block cache, local 
index metadata, bloomfilter cache, local files cache on SSD(secondary cache) 
and a fallback to remote MDT queries.

The 2nd one would take a lot of efforts and we deem it as long-term solution, 
as of now, we prefer 1.

2. NBCC is only working with simple bucket index, so not very related.

3. MDT does not clean the legacy files on each compaction immediately so the 
estimation still makes sense, based on the compression frequency 
configured(both can adjust the cleaning/compaction strategy), the RocksDB block 
cache utilizes off-heap memory I think. But yes, there could be resouce 
contention just like the Flink-State index which proves good perf in production 
already. And we will definitely do a lot more benchmarks for long-running job, 
that's a collaboration with Uber team.

4. the bootstrap op parallelism is scalable, the bottleneck might be the full 
scan load time for a single file group of MDT. As long as the time is less than 
the checkpoint timeout, it works well and the bootstrap only happens once for a 
job restart. Why not persistent RocksDB instances across jobs recovery: a). the 
RocksDB local storage is within local container which will be release once the 
task is killed(we can not use remote storage which kills the perf). b). there 
is no good ways to main the consistency between the RocksDB and MDT based on 
the complexities of DT and MDT consistency.

5. The RocksDB is updated per-record in the `BucketAssign` op, and the 
`RocksDB` itself has memtable to manage the buffer and flush of the index 
records. Each time the task failsover, the bootstrap retriggers to ensure the 
integrity.

6. we want the write scalable as independent and we don't want to write to 
every file group from a single write task to limit the small files.





GitHub link: 
https://github.com/apache/hudi/discussions/18296#discussioncomment-16154189

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] RocksDB as The Replica of MDT/RLI [hudi]

Reply via email to