Grzegorz Liter created FLINK-38212:
--------------------------------------

             Summary: OOM during savepoint caused by potential memory leak 
issue in RocksDB related to jemalloc
                 Key: FLINK-38212
                 URL: https://issues.apache.org/jira/browse/FLINK-38212
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.20.2, 2.1.0
         Environment: Flink 2.1.0 running in Application mode with Flink 
Operator 1.12.1.

Memory and savepoint related settings:
{code:java}
env.java.opts.taskmanager: ' -XX:+UnlockExperimentalVMOptions 
-XX:+UseStringDeduplication
      -XX:+AlwaysPreTouch  -XX:G1HeapRegionSize=16m 
-Xlog:gc*:file=/tmp/gc.log:time,uptime,level,tags
      -XX:SurvivorRatio=6 -XX:G1NewSizePercent=40
execution.checkpointing.max-concurrent-checkpoints: "1"
execution.checkpointing.snapshot-compression: "true"
fs.s3a.aws.credentials.provider: 
com.amazonaws.auth.WebIdentityTokenCredentialsProvider
fs.s3a.block.size: 
fs.s3a.experimental.input.fadvise: sequential
fs.s3a.path.style.access: "true" 
state.backend.incremental: "true"
state.backend.type: rocksdb 
state.checkpoints.dir: s3p://bucket/checkpoints
state.savepoints.dir: s3p://bucket/savepoints 
taskmanager.memory.jvm-overhead.fraction: "0.1"
taskmanager.memory.jvm-overhead.max: 6g
taskmanager.memory.managed.fraction: "0.4"
taskmanager.memory.network.fraction: "0.05"
taskmanager.network.memory.buffer-debloat.enabled: "true"
taskmanager.numberOfTaskSlots: "12"
...
resource:
  memory: 16g{code}
 
            Reporter: Grzegorz Liter


I am running a job with snapshot size about ~17 GB with compression enabled. I 
have observed that savepoints often fails due to TM getting killed by 
Kubernetes due to exceeding memory limit on pod that had 30 GB of memory limit 
assigned.

Flink metrics nor detailed VM metrics taken with `jcmd <PID> VM.native_memory 
detail` does not indicate any unusual memory increase. Consumed memory is 
visible only in Kubernetes metrics and RSS.

When enough memory set (+ potentially setting enough jvm overhead) to leave 
some breathing room one snapshot could be taken but taking subsequent full 
snapshots reliably leads to OOM.

This documentation: 
[switching-the-memory-allocator|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/docker/#switching-the-memory-allocator]
 have lead me to trying 
{code:java}
MALLOC_ARENA_MAX=1
DISABLE_JEMALLOC=true {code}
This configuration helped to make savepoint reliably pass without OOM. I have 
trying setting only one of each options at once but that was not fixing the 
issue.

I also tried downscaling pod down to 16 GB of memory and with these options 
savepoint was reliably created without any issue. Without them every savepoint 
fails.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to