[GitHub] [flink] carp84 commented on a change in pull request #10498: [FLINK-14495][docs] Add documentation for memory control of RocksDB state backend

GitBox Fri, 31 Jan 2020 01:24:42 -0800

carp84 commented on a change in pull request #10498: [FLINK-14495][docs] Add 
documentation for memory control of RocksDB state backend
URL: https://github.com/apache/flink/pull/10498#discussion_r373383683

##########
File path: docs/ops/state/large_state_tuning.md
##########
@@ -210,6 +211,71 @@ and not from the JVM. Any memory you assign to RocksDB
will have to be accounted
of the TaskManagers by the same amount. Not doing that may result in
YARN/Mesos/etc terminating the JVM processes for
allocating more memory than configured.

+### Bounding RocksDB Memory Usage
+
+RocksDB allocates native memory outside of the JVM, which could lead the
process to exceed the total memory budget.
+This can be especially problematic in containerized environments such as
Kubernetes that kill processes who exceed their memory budgets.
+Flink limit total memory usage of RocksDB instance(s) per slot by leveraging
shareable [cache](https://github.com/facebook/rocksdb/wiki/Block-Cache)
+and [write buffer
manager](https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager) among
all instances in a single slot by default.
+The shared cache will place an upper limit on the [three
components](https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB)
that use the majority of memory
+when RocksDB is deployed as a state backend: block cache, index and bloom
filters, and MemTables.
+This feature is enabled by default and could be controlled by two ways:
+ - Integrate with managed memory of task manager: turn
`state.backend.rocksdb.memory.managed` as true. If so, RocksDB state backend
will use the managed memory budget of the task slot to set the capacity of that
shared cache object.
+ This operation is enabled by default, which means Flink would always choose
to integrate RocksDB memory usage with the managed memory first.
+ - Not integrated with managed memory: configure the memory size of
`state.backend.rocksdb.memory.fixed-per-slot` to set the fixed total amount of
memory per slot.
+ This option will override `state.backend.rocksdb.memory.managed` option when
configured and ignore calculated managed memory per slot from task manager.
+ User could also configure `taskmanager.memory.task.off-heap.size` to set
additional quota in off-heap memory, which should be equal to
`taskmanager.numberOfTaskSlots` *
``state.backend.rocksdb.memory.fixed-per-slot``, to fit in Flink's memory model.
+
+Flink also provides two parameters to tune the memory fraction of MemTable and
index & filters:
+ - `state.backend.rocksdb.memory.write-buffer-ratio`, by default `0.5`. If
RocksDB memory bounded feature is turned on, 50% of memory size would be used
by write buffer manager by default.
+ - `state.backend.rocksdb.memory.high-prio-pool-ratio`, by default `0.1`.
+ If RocksDB memory bounded feature is turned on, 10% 0f memory size would be
set as high priority for index and filters in shared block cache by default.
+ By enabling this, index and filters would not need to compete against data
blocks for staying in cache to minimize performance problem if those index and
filters are evicted by data blocks frequently.
+ Moreover, the L0 level filter and index are pinned into the cache by default
to mitigate performance problems,
+ more details could refer to the
[RocksDB-documentation](https://github.com/facebook/rocksdb/wiki/Block-Cache#caching-index-filter-and-compression-dictionary-blocks).
+
+<span class="label label-info">Note</span> The shared `cahe` and `write buffer
manager` will override customized settings of block cache and write buffer via
`PredefinedOptions` and `OptionsFactory`.
+
+#### Tune performance when bounding RocksDB memory usage.
+
+There might existed performance regression compared with previous
no-memory-limit case if you have too many states per slot.
+If you observed this behavior and not running jobs in containerized
environment or does not care about the over-limit memory usage.
+The easiest way to wipe out the performance regression is to disable memory
bound for RocksDB, e.g. turn `state.backend.rocksdb.memory.managed` as false.
+Otherwise you need to increase the upper memory for RocksDB:
+ - Increase the managed memory size of `taskmanager.memory.managed.size` or
the fraction via `taskmanager.memory.managed.fraction`.
+ - Increase the memory size of `state.backend.rocksdb.memory.fixed-per-slot`,
which is not integrated with managed memory of task manager, as well as
increasing `taskmanager.memory.task.off-heap.size` accordingly.
+
+Apart from increasing total memory, you could also tune some options:
+ - Decrease the arena block size, which is 1/8 of write-buffer size by
default. This targets to decrease the possibility of turning mutable mem-table
as immutable when write buffer manager reserve memory at the granularity of
arena block.
+ Increase the max background flush threads for each DB instance. This targets
to flush immutable mem-tables as fast as possible to not block or stall writes.
+ User could use options factory below to tune:

Review comment:
The information given here are all RocksDB internals. As a user, I could
hardly tell in what case I should decrease the arena block size or increase the
background flush threads.

What's more, why do these two configurations value most? Why not tune other
configurations? Are we giving the suggestion based on our own experimental
case? Does it fit for all cases?

I insist that giving recommendation on tuning Flink settings instead of
RocksDB ones would help more.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

With regards,
Apache Git Services

[GitHub] [flink] carp84 commented on a change in pull request #10498: [FLINK-14495][docs] Add documentation for memory control of RocksDB state backend

Reply via email to