sjwiesman commented on a change in pull request #10498: [FLINK-14495][docs] Add documentation for memory control of RocksDB state backend URL: https://github.com/apache/flink/pull/10498#discussion_r355603402
########## File path: docs/ops/state/large_state_tuning.md ########## @@ -210,6 +210,28 @@ and not from the JVM. Any memory you assign to RocksDB will have to be accounted of the TaskManagers by the same amount. Not doing that may result in YARN/Mesos/etc terminating the JVM processes for allocating more memory than configured. +#### Bound total memory usage of RocksDB instance(s) per slot + +RocksDB allocates native memory without control of JVM, and might lead the process to exceed total memory budget of the container to get killed in container environment (e.g. Kubernetes). +From Flink-1.10, we provide a solution to limit total memory usage for RocksDb instance(s) per slot by leveraging RocksDB's mechanism to +share [cache](https://github.com/facebook/rocksdb/wiki/Block-Cache) and [write buffer manager](https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager) among instance(s). +Generally speaking, we mainly have three parts of memory usage for RocksDB in Flink scenario: block cache, index & bloom filters and memtables +(refer to [memory-usage-in-rocksdb](https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB)). +The basic idea is to share a `Cache` object with desired capacity among all RocksDB instances, +and [cost memory used in memtable to that cache](https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache) via write buffer manager. +Besides, we also cache index & filters into that cache, then the major use of memory would be well capped. +There exist two ways to enable this feature: + - Turn `state.backend.rocksdb.memory.managed` as true. If so, RocksDB state backend will use the managed memory budget of the task slot to set the capacity of that shared cache object. + - Configure the memory size of `state.backend.rocksdb.memory.fixed-per-slot` to set the fixed total amount of memory per slot. + This option will override `state.backend.rocksdb.memory.managed` option when configured. + +We also provide two parameters to tune the memory fraction of memtable and index & filters: + - `state.backend.rocksdb.memory.write-buffer-ratio`, by default `0.5`. If RocksDB memory bounded feature is turned on, 50% of memory size would be used by write buffer manager by default. + - `state.backend.rocksdb.memory.high-prio-pool-ratio`, by default `0.1`. + If RocksDB memory bounded feature is turned on, 10% 0f memory size would be set as high priority for index and filters in shared block cache by default. + By enabling this, index and filters would not need to compete against data blocks for staying in cache to minimize performance problem if those index and filters are evicted by data blocks frequently. + Moreover, we also pin L0 level filter and index into cache by default to mitigate performance problem, more details could refer to [RocksDB-doc](https://github.com/facebook/rocksdb/wiki/Block-Cache#caching-index-filter-and-compression-dictionary-blocks). Review comment: Another way to remove "we" that it doesn't get repetitive. ```suggestion Moreover, the L0 level filter and index are pinned into the cache by default to mitigate performance problems, more details could refer to the [RocksDB documentation](https://github.com/facebook/rocksdb/wiki/Block-Cache#caching-index-filter-and-compression-dictionary-blocks). ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
