GitHub user danny0405 edited a comment on the discussion: RocksDB as The Replica of MDT/RLI
I'm assuming you already read the [Flink RLI RFC](https://github.com/apache/hudi/pull/17610). 1. we did some local benchmark with MDT(plus hotspot cache in memory) as the index backend, and it turns out to be an average 50ms access latency per record query which does not scale well for steaming. Thus a local fast lookup replica is necessary, we have two solutions on table: 1. use RocksDB replica as the mirror image of the MDT, and always replicate the index payloads from MDT to local RocksDB to ensure the performance; 2. still uses the MDT, but introduces more caches: data block cache, local index metadata, bloomfilter cache, local files cache on SSD(secondary cache) and a fallback to remote MDT queries. The 2nd one would take a lot of efforts and we deem it as long-term solution, as of now, we prefer 1. 2. NBCC is only working with simple bucket index, so not very related. 3. MDT does not clean the legacy files on each compaction immediately so the estimation still makes sense, based on the compression frequency configured(both can adjust the cleaning/compaction strategy), the RocksDB block cache utilizes off-heap memory I think. But yes, there could be resouce contention just like the Flink-State index which proves good perf in production already. And we will definitely do a lot more benchmarks for long-running job, that's a collaboration with Uber team. 4. the bootstrap op parallelism is scalable, the bottleneck might be the full scan load time for a single file group of MDT. As long as the time is less than the checkpoint timeout, it works well and the bootstrap only happens once for a job restart. Why not persistent RocksDB instances across jobs recovery: a). the RocksDB local storage is within local container which will be release once the task is killed(we can not use remote storage which kills the perf). b). there is no good ways to main the consistency between the RocksDB and MDT based on the complexities of DT and MDT consistency. 5. The RocksDB is updated per-record in the `BucketAssign` op, and the `RocksDB` itself has memtable to manage the buffer and flush of the index records. Each time the task failsover, the bootstrap retriggers to ensure the integrity. 6. we want the write scalable as independent and we don't want to write to every file group from a single write task to limit the small files. GitHub link: https://github.com/apache/hudi/discussions/18296#discussioncomment-16154189 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
