[
https://issues.apache.org/jira/browse/FLINK-16267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044401#comment-17044401
]
Xintong Song commented on FLINK-16267:
--------------------------------------
Another possible cause:
AFAIK, due to some RocksDB flaw, Flink cannot fully guarantee that the memory
consumption of RocksDBStateBackend is limited to the managed memory size. The
current approach is based on some conservative approximate estimation, which is
expected to work well in most cases.
That means it is possible that RocksDBStateBackend uses more memory than
configured managed memory, which also might be the cause of this issue.
[~liyu], [~yunta], is there any way to quickly verify whether RocksDB memory
consumption has exceeded?
> Flink uses more memory than taskmanager.memory.process.size in Kubernetes
> -------------------------------------------------------------------------
>
> Key: FLINK-16267
> URL: https://issues.apache.org/jira/browse/FLINK-16267
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.10.0
> Reporter: ChangZhuo Chen (陳昌倬)
> Priority: Major
> Attachments: oomkilled_taskmanager.log
>
>
> This issue is from
> [https://stackoverflow.com/questions/60336764/flink-uses-more-memory-than-taskmanager-memory-process-size-in-kubernetes]
> h1. Description
> * In Flink 1.10.0, we try to use `taskmanager.memory.process.size` to limit
> the resource used by taskmanager to ensure they are not killed by Kubernetes.
> However, we still get lots of taskmanager `OOMKilled`. The setup is in the
> following section.
> * The taskmanager log is in attachment [^oomkilled_taskmanager.log].
> h2. Kubernete
> * The Kubernetes setup is the same as described in
> [https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/kubernetes.html].
> * The following is resource configuration for taskmanager deployment in
> Kubernetes:
> {{resources:}}
> {{ requests:}}
> {{ cpu: 1000m}}
> {{ memory: 4096Mi}}
> {{ limits:}}
> {{ cpu: 1000m}}
> {{ memory: 4096Mi}}
> h2. Flink Docker
> * The Flink docker is built by the following Docker file.
> {{FROM flink:1.10-scala_2.11}}
> {{RUN mkdir -p /opt/flink/plugins/s3 && \}}
> {{ ln -s /opt/flink/opt/flink-s3-fs-presto-1.10.0.jar /opt/flink/plugins/s3/}}
> {{RUN ln -s /opt/flink/opt/flink-metrics-prometheus-1.10.0.jar
> /opt/flink/lib/}}
> h2. Flink Configuration
> * The following are all memory related configurations in `flink-conf.yaml`
> in 1.10.0:
> {{jobmanager.heap.size: 820m}}
> {{taskmanager.memory.jvm-metaspace.size: 128m}}
> {{taskmanager.memory.process.size: 4096m}}
> * We use RocksDB and we don't set `state.backend.rocksdb.memory.managed` in
> `flink-conf.yaml`.
> ** Use S3 as checkpoint storage.
> * The code uses DateStream API
> ** input/output are both Kafka.
> h2. Project Dependencies
> * The following is our dependencies.
> {{val flinkVersion = "1.10.0"}}{{libraryDependencies +=
> "com.squareup.okhttp3" % "okhttp" % "4.2.2"}}
> {{libraryDependencies += "com.typesafe" % "config" % "1.4.0"}}
> {{libraryDependencies += "joda-time" % "joda-time" % "2.10.5"}}
> {{libraryDependencies += "org.apache.flink" %% "flink-connector-kafka" %
> flinkVersion}}
> {{libraryDependencies += "org.apache.flink" % "flink-metrics-dropwizard" %
> flinkVersion}}
> {{libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion
> % "provided"}}
> {{libraryDependencies += "org.apache.flink" %% "flink-statebackend-rocksdb"
> % flinkVersion % "provided"}}
> {{libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" %
> flinkVersion % "provided"}}
> {{libraryDependencies += "org.json4s" %% "json4s-jackson" % "3.6.7"}}
> {{libraryDependencies += "org.log4s" %% "log4s" % "1.8.2"}}
> {{libraryDependencies += "org.rogach" %% "scallop" % "3.3.1"}}
> h2. Previous Flink 1.9.1 Configuration
> * The configuration we used in Flink 1.9.1 are the following. It does not
> have `OOMKilled`.
> h3. Kubernetes
> {{resources:}}
> {{ requests:}}
> {{ cpu: 1200m}}
> {{ memory: 2G}}
> {{ limits:}}
> {{ cpu: 1500m}}
> {{ memory: 2G}}
> h3. Flink 1.9.1
> {{jobmanager.heap.size: 820m}}
> {{taskmanager.heap.size: 1024m}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)