[jira] [Comment Edited] (FLINK-7289) Memory allocation of RocksDB can be problematic in container environments

Stefan Richter (JIRA) Tue, 01 Aug 2017 11:03:04 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-7289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16109411#comment-16109411
 ]


Stefan Richter edited comment on FLINK-7289 at 8/1/17 6:01 PM:
---------------------------------------------------------------

Hi,

I understand that cache memory is not freed, and this sounds exactly like what 
I would expect. I would also expect that the consumed cache memory will not go 
down as long as it is still below available cache memory configured from the OS 
perspective. If this becomes a problem under YARN, this sounds like a problem 
of the cluster setup to me. Take this opinion with a grain of sand, I am not an 
expert on YARN or container setups. 

Then this should not be an end user problem, but also not a Flink problem. It 
sounds like an administrator and configuration problem to me. For example, this 
caching scenario should also apply to all other filesystem reads / writes and 
not only to RocksDB. Manually dropping OS file caches should never been 
required from any application or user, and if so it seems like this is fixing 
the symptoms of a different problem. We can try to figure out more about root 
problem and then maybe come up with a better documentation for the setup and 
configuration. But I would disagree about introducing cache cleaning to Flink 
because writing to {{/proc/sys/vm/drop_caches}} is an operation that requires 
root privileges and can affect other processes performance. From the 
documentation:

{quote}
drop_caches

Writing to this will cause the kernel to drop clean caches, as well as
reclaimable slab objects like dentries and inodes.  Once dropped, their
memory becomes free.

To free pagecache:
        echo 1 > /proc/sys/vm/drop_caches
To free reclaimable slab objects (includes dentries and inodes):
        echo 2 > /proc/sys/vm/drop_caches
To free slab objects and pagecache:
        echo 3 > /proc/sys/vm/drop_caches

This is a non-destructive operation and will not free any dirty objects.
To increase the number of objects freed by this operation, the user may run
`sync' prior to writing to /proc/sys/vm/drop_caches.  This will minimize the
number of dirty objects on the system and create more candidates to be
dropped.

This file is not a means to control the growth of the various kernel caches
(inodes, dentries, pagecache, etc...)  These objects are automatically
reclaimed by the kernel when memory is needed elsewhere on the system.

Use of this file can cause performance problems.  Since it discards cached
objects, it may cost a significant amount of I/O and CPU to recreate the
dropped objects, especially if they were under heavy use.  Because of this,
use outside of a testing or debugging environment is not recommended.
{quote}

Can you provide us with some more information, e.g. a detailed breakdown of the 
os memory consumption after the process ended and the logs about the killed 
containers?


was (Author: srichter):
Hi,

I understand that cache memory is not freed, and this sounds exactly like what 
I would expect. I would also expect that the consumed cache memory will not go 
down as long as it is still below available cache memory configured from the OS 
perspective. If this becomes a problem under YARN, this sounds like a problem 
of the cluster setup to me. Take this opinion with a grain of sand, I am not an 
expert on YARN or container setups. 

Then this should not be an end user problem, but also not a Flink problem. It 
sounds like an administrator and configuration problem to me. For example, this 
caching scenario should also apply to all other filesystem reads / writes and 
not only to RocksDB. Manually dropping OS file caches should never been 
required from any application or user, and if so it seems like this is fixing 
the symptoms of a different problem. We can try to figure out more about root 
problem and then maybe come up with a better documentation for the setup and 
configuration. But I would disagree about introducing cache cleaning to Flink 
because writing to {{/proc/sys/vm/drop_caches}} is an operation that requires 
root privileges and can affect other processes performance.

Can you provide us with some more information, e.g. a detailed breakdown of the 
os memory consumption after the process ended and the logs about the killed 
containers?

> Memory allocation of RocksDB can be problematic in container environments
> -------------------------------------------------------------------------
>
>                 Key: FLINK-7289
>                 URL: https://issues.apache.org/jira/browse/FLINK-7289
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.2.0, 1.3.0, 1.4.0
>            Reporter: Stefan Richter
>
> Flink's RocksDB based state backend allocates native memory. The amount of 
> allocated memory by RocksDB is not under the control of Flink or the JVM and 
> can (theoretically) grow without limits.
> In container environments, this can be problematic because the process can 
> exceed the memory budget of the container, and the process will get killed. 
> Currently, there is no other option than trusting RocksDB to be well behaved 
> and to follow its memory configurations. However, limiting RocksDB's memory 
> usage is not as easy as setting a single limit parameter. The memory limit is 
> determined by an interplay of several configuration parameters, which is 
> almost impossible to get right for users. Even worse, multiple RocksDB 
> instances can run inside the same process and make reasoning about the 
> configuration also dependent on the Flink job.
> Some information about the memory management in RocksDB can be found here:
> https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
> https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
> We should try to figure out ways to help users in one or more of the 
> following ways:
> - Some way to autotune or calculate the RocksDB configuration.
> - Conservative default values.
> - Additional documentation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (FLINK-7289) Memory allocation of RocksDB can be problematic in container environments

Reply via email to