carp84 commented on a change in pull request #10987: [FLINK-14495][docs] 
(EXTENDED) Add documentation for memory control of RocksDB state backend
URL: https://github.com/apache/flink/pull/10987#discussion_r373756867
 
 

 ##########
 File path: docs/ops/state/large_state_tuning.md
 ##########
 @@ -118,98 +118,70 @@ Other state like keyed state is still snapshotted 
asynchronously. Please note th
 The state storage workhorse of many large scale Flink streaming applications 
is the *RocksDB State Backend*.
 The backend scales well beyond main memory and reliably stores large [keyed 
state](../../dev/stream/state/state.html).
 
-Unfortunately, RocksDB's performance can vary with configuration, and there is 
little documentation on how to tune
-RocksDB properly. For example, the default configuration is tailored towards 
SSDs and performs suboptimal
-on spinning disks.
+RocksDB's performance can vary with configuration, this section outlines some 
best-practices for tuning jobs that use the RocksDB State Backend.
 
-**Incremental Checkpoints**
+### Incremental Checkpoints
 
-Incremental checkpoints can dramatically reduce the checkpointing time in 
comparison to full checkpoints, at the cost of a (potentially) longer
-recovery time. The core idea is that incremental checkpoints only record all 
changes to the previous completed checkpoint, instead of
-producing a full, self-contained backup of the state backend. Like this, 
incremental checkpoints build upon previous checkpoints. Flink leverages
-RocksDB's internal backup mechanism in a way that is self-consolidating over 
time. As a result, the incremental checkpoint history in Flink
-does not grow indefinitely, and old checkpoints are eventually subsumed and 
pruned automatically.
+When it comes to reducing the time that checkpoints take, activating 
incremental checkpoints should be one of the first considerations.
+Incremental checkpoints can dramatically reduce the checkpointing time in 
comparison to full checkpoints, because incremental checkpoints only record the 
changes compared to the previous completed checkpoint, instead of producing a 
full, self-contained backup of the state backend.
 
-While we strongly encourage the use of incremental checkpoints for large 
state, please note that this is a new feature and currently not enabled 
-by default. To enable this feature, users can instantiate a 
`RocksDBStateBackend` with the corresponding boolean flag in the constructor 
set to `true`, e.g.:
+See [Incremental Checkpoints in RocksDB]({{ site.baseurl 
}}/ops/state/state_backends.html#incremental-checkpoints) for more background 
information.
 
-{% highlight java %}
-    RocksDBStateBackend backend =
-        new RocksDBStateBackend(filebackend, true);
-{% endhighlight %}
-
-**RocksDB Timers**
+### Timers in RocksDB or on JVM Heap
 
-For RocksDB, a user can chose whether timers are stored on the heap or inside 
RocksDB (default). Heap-based timers can have a better performance for smaller 
numbers of
-timers, while storing timers inside RocksDB offers higher scalability as the 
number of timers in RocksDB can exceed the available main memory (spilling to 
disk).
+Timers are stored in RocksDB by default, which is the more robust and scalable 
choice.
 
-When using RockDB as state backend, the type of timer storage can be selected 
through Flink's configuration via option key 
`state.backend.rocksdb.timer-service.factory`.
-Possible choices are `heap` (to store timers on the heap, default) and 
`rocksdb` (to store timers in RocksDB).
+When performance-tuning jobs that have few timers only (no windows, not using 
timers in ProcessFunction), putting those timers on the heap can increase 
performance.
+Use this feature carefully, as heap-based timers may increase checkpointing 
times and naturally cannot scale beyond memory.
 
-<span class="label label-info">Note</span> *The combination RocksDB state 
backend with heap-based timers currently does NOT support asynchronous 
snapshots for the timers state.
-Other state like keyed state is still snapshotted asynchronously. Please note 
that this is not a regression from previous versions and will be resolved with 
`FLINK-10026`.*
+See [this section]({{ site.baseurl 
}}/ops/state/state_backends.html#timers-heap-vs-rocksdb) for details how to 
configure heap-based timers.
 
-**Predefined Options**
+### Tuning RocksDB Memory
 
-Flink provides some predefined collections of option for RocksDB for different 
settings, and there existed two ways
-to pass these predefined options to RocksDB:
-  - Configure the predefined options through `flink-conf.yaml` via option key 
`state.backend.rocksdb.predefined-options`.
-    The default value of this option is `DEFAULT` which means 
`PredefinedOptions.DEFAULT`.
-  - Set the predefined options programmatically, e.g. 
`RocksDBStateBackend.setPredefinedOptions(PredefinedOptions.SPINNING_DISK_OPTIMIZED_HIGH_MEM)`.
+The performance of the RocksDB State Backend much depends on the amount of 
memory that it has available. To increase performance, adding memory can help a 
lot, or adjusting to which functions memory goes.
 
-We expect to accumulate more such profiles over time. Feel free to contribute 
such predefined option profiles when you
-found a set of options that work well and seem representative for certain 
workloads.
+By default, the RocksDB State Backend uses Flink's managed memory budget for 
RocksDBs buffers and caches (`state.backend.rocksdb.memory.managed: true`). 
Please refer to the [RocksDB Memory Management]({{ site.baseurl 
}}/ops/state/state_backends.html#memory-management) for background on how that 
mechanism works.
 
-<span class="label label-info">Note</span> Predefined options which set 
programmatically would override the one configured via `flink-conf.yaml`.
+To tune memory-related performance issues, the following steps may be helpful:
 
-**Passing Options Factory to RocksDB**
-
-There existed two ways to pass options factory to RocksDB in Flink:
+  - The first step to try and increase performance should be to increase the 
amout of managed memory. This usually improves the situation a lot, without 
opening up the complexity of tuning low-level RocksDB options.
+    
+    Especially with large container/process sizes, much of the total memory 
can typically go to RocksDB, unless the application logic requires a lot of JVM 
heap itself. The default managed memory fraction *(0.4)* is conservative and 
can often be increased when using TaskManagers with multi-GB process sizes.
 
-  - Configure options factory through `flink-conf.yaml`. You could set the 
options factory class name via option key 
`state.backend.rocksdb.options-factory`.
-    The default value for this option is 
`org.apache.flink.contrib.streaming.state.DefaultConfigurableOptionsFactory`, 
and all candidate configurable options are defined in 
`RocksDBConfigurableOptions`.
-    Moreover, you could also define your customized and configurable options 
factory class like below and pass the class name to 
`state.backend.rocksdb.options-factory`.
+  - The number of write buffers in RocksDB depends on the number of states you 
have in your application (states across all operators in the pipeline). Each 
state corresponds to one ColumnFamily, which needs its own write buffers. 
Hence, applications with many states typically need more memory for the same 
performance.
 
-    {% highlight java %}
+  - You can try and compare the performance of RocksDB with managed memory to 
RocksDB with per-column-family memory by setting 
`state.backend.rocksdb.memory.managed: false`. Especially to test against a 
baseline (assuming no- or gracious container memory limits) or to test for 
regressions compared to earlier versions of Flink, this can be useful.
+  
+    Compared to the managed memory setup (constant memory pool), not using 
managed memory means that RocksDB allocates memory proportional to the number 
of states in the application (memory footprint changes with application 
changes). As a rule of thumb, the non-managed mode uses (unless ColumnFamily 
options are applied) roughly "140MB * num-states-across-all-tasks * num-slots". 
Timers count as state as well!
 
 Review comment:
   Does `140MB` comes from the default `max_write_buffer_number * 
write_buffer_size + arena_block_size`, say `2 * 64MB + 8MB = 136MB`? If so, 
this is roughly the **_upper bound_** of the memory usage, to be explicit.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to