Juha Mynttinen created FLINK-19238:
--------------------------------------
Summary: RocksDB performance issue with low managed memory and
high parallelism
Key: FLINK-19238
URL: https://issues.apache.org/jira/browse/FLINK-19238
Project: Flink
Issue Type: Improvement
Affects Versions: 1.11.1
Reporter: Juha Mynttinen
h2. The issue
When using {{RocksDBStateBackend}}, it's possible to configure RocksDB so that
it almost constantly flushes the active memtable, causing high IO and CPU usage.
This happens because this check will be true essentially always
[https://github.com/dataArtisans/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/include/rocksdb/write_buffer_manager.h#L47].
h2. Reproducing the issue
To reproduce the issue, the following needs to happen:
* Use RocksDB state backend
* Use managed memory
* have "low" managed memory size
* have "high" parallelism (e.g. 5) OR have enough operators (the exact count
unknown)
The easiest way to do all this is to do
{{StreamExecutionEnvironment.createLocalEnvironment}} and creating a simple
Flink job and setting the parallelism "high enough". Nothing else is needed.
h2. Background
Arena memory block size is by default 1/8 of the memtable size
[https://github.com/dataArtisans/frocksdb/blob/49bc897d5d768026f1eb816d960c1f2383396ef4/db/column_family.cc#L196].
When the memtable has any data, it'll consume one arena block. The arena block
size will be higher the "mutable limit". The mutable limit is calculated from
the shared write buffer manager size. Having low managed memory and high
parallelism pushes the mutable limit to a too low value.
h2. Documentation
In docs
([https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html):]
"An advanced option (expert mode) to reduce the number of MemTable flushes in
setups with many states, is to tune RocksDB’s ColumnFamily options (arena block
size, max background flush threads, etc.) via a RocksDBOptionsFactory".
This snippet in the docs is probably talking about the issue I'm witnessing. I
think there are two issues here:
1) it's hard/impossible to know what kind of performance one can expect from a
Flink application. Thus, it's hard to know if one is suffering from e.g. from
this performance issue, or if the system is performing normally (and inherently
being slow).
2) even if one suspects a performance issue, it's very hard to find the root
cause of the performance issue (memtable flush happening frequently). To find
out this one would need to know what's the normal flush frequency.
Also the doc says "in setups with many states". The same problem is hit when
using just one state, but "high" parallelism (5).
If the arena block size _ever_ needs to be configured only to "fix" this
issue, it'd be best if there _never_ was a need to modify arena block size.
What if we forget even mentioning arena block size in the docs and focus on the
managed memory size, since managed memory size is something the user does tune.
h1. The proposed fix
The proposed fix is to log the issue on WARN level and tell the user clearly
what is happening and how to fix.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)