[
https://issues.apache.org/jira/browse/FLINK-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17004002#comment-17004002
]
Yun Tang commented on FLINK-15368:
----------------------------------
Updated progress:
* Why memory cannot be controlled well under capacity?
** My previous experiments set
[optimizeForPointLookup|https://github.com/facebook/rocksdb/blob/e8263dbdaad0546c54bddd01a8454c2e750a86c2/java/src/main/java/org/rocksdb/ColumnFamilyOptionsInterface.java#L25]
by mistake and that option caused more memory usage in cache.
** Even without the {{optimizeForPointLookup}}, the write buffer manager would
still exceed the capacity as write buffer would *not* be flushed if active
mutable write buffer size is less than half of capacity of write buffer
manager, please refer to [code
here|https://github.com/facebook/rocksdb/blob/e8263dbdaad0546c54bddd01a8454c2e750a86c2/include/rocksdb/write_buffer_manager.h#L55].
In other words, if we set a LRUCache of capacity 400MB and write buffer
manager of capacity 200MB, we could use mem tables as 200*1.5=300MB which might
lead us to exceed the total cache usage.
** Actually, there still exists another out-of-capacity risk since we would
pin L0 index & filter in block cache.
* Why java process would crash and core dump if we set LRUCache with
{{strict_capacity_limit}} enabled?
** From my point of view, this is due to the behavior of costing write buffer
memory usage to strict limited cache is still not deterministic. RocksDB lacks
of any reasonable action if we fail to insert dummy entries into cache and I
have already created a issue to report this problem:
[RocksDB-issue/6247|https://github.com/facebook/rocksdb/issues/6247]. Besides,
[RocksDB-pr/5175|https://github.com/facebook/rocksdb/pull/5175] could help to
reduce the exceed of memory budget of LRUCache shard.
* What shall we do to help mitigate the risk of out-of-capacity?
** I'm afraid we cannot rely on strict limited capacity cache in Flink-1.10
before its final release. As I have described above for why write buffer could
exceed capacity, we could introduce a buffer space between off-heap memory
capacity and actual block cache capacity to consider the extra half write
buffer manager capacity. User could also configure this buffer space if extra
memory in pinned iterator or index.
Last but not least, I noticed there exist obvious performance regression if we
enable memory control for RocksDB. One main point is that write buffer would
flush more frequently then before in small cache scenario (1GB TM would only
have 300MB off-heap space, and we have 4 slots each TM in test that means
rocksDB instances would only share a cache less than 80MB per slot).
The root cause is
[arena_block_size|https://github.com/dataArtisans/frocksdb/blob/958f191d3f7276ae59b270f9db8390034d549ee0/db/column_family.cc#L196]
is a bit large when we share cache among RocksDB instances. The arena block
size would be set as 1/8 of write buffer size if we did not configure it. As
the default write buffer size is 64MB, rocksDB would use 64/8=8MB to allocate
memory if needed. As [write buffer manage
doc|https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#limit-total-memory-of-memtables]
said: "In version 5.6 or higher, the memory is counted as total memory
allocated in arena, even if some of them may not yet be used by memtable."
Memtable is more easily to hit the limit when we use a quite larger arena
block, however, the actual usage of memtable is only several KBs. One possible
solution to resolve this is to decrease the arena block size explicitly.
> Add end-to-end test for controlling RocksDB memory usage
> --------------------------------------------------------
>
> Key: FLINK-15368
> URL: https://issues.apache.org/jira/browse/FLINK-15368
> Project: Flink
> Issue Type: Sub-task
> Components: Runtime / State Backends
> Affects Versions: 1.10.0
> Reporter: Yu Li
> Assignee: Yun Tang
> Priority: Critical
> Fix For: 1.10.0
>
>
> We need to add an end-to-end test to make sure the RocksDB memory usage
> control works well, especially under the slot sharing case.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)