[
https://issues.apache.org/jira/browse/FLINK-16267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17045095#comment-17045095
]
Yu Li edited comment on FLINK-16267 at 2/26/20 2:44 AM:
--------------------------------------------------------
[~czchen] Thanks for the quick response. Could you set
{{state.backend.rocksdb.memory.managed: false}} in your 1.10.0 yaml and check
whether the issue remains? This could help us to judge whether the problem lies
in RocksDB memory management (introduced in 1.10.0) or not. If it does, we will
give more suggestions about how to locate the root cause. Thanks.
Besides, two questions about the configuration:
# From both the K8S resource spec and yaml configuration we could tell the
memory set for TM increases from 2GB to 4GB, could you share the reason behind?
Is it because of the `OOMKilled` issue and you tried to resolve it by
increasing the memory setting? Or maybe the job parallelism has been changed
accordingly (reduced to half of the before value)?
# For the 1.9.1 yaml settings, from the description {{taskmanager.heap.size}}
was set to 1024m while 2000m in the yaml file attached. Could you double check
and confirm which one is accurate?
Thanks.
was (Author: carp84):
[~czchen] Thanks for the quick response. Could you set
{{state.backend.rocksdb.memory.managed: false}} in your 1.10.0 yaml and check
whether the issue remains? This could help us to judge whether the problem lies
in RocksDB memory management (introduced in 1.10.0 or not. If it does, we will
give more suggestions about how to locate the root cause. Thanks.
Besides, two questions about the configuration:
# From both the K8S resource spec and yaml configuration we could tell the
memory set for TM increases from 2GB to 4GB, could you share the reason behind?
Is it because of the `OOMKilled` issue and you tried to resolve it by
increasing the memory setting? Or maybe the job parallelism has been changed
accordingly (reduced to half of the before value)?
# For the 1.9.1 yaml settings, from the description {{taskmanager.heap.size}}
was set to 1024m while 2000m in the yaml file attached. Could you double check
and confirm which one is accurate?
Thanks.
> Flink uses more memory than taskmanager.memory.process.size in Kubernetes
> -------------------------------------------------------------------------
>
> Key: FLINK-16267
> URL: https://issues.apache.org/jira/browse/FLINK-16267
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.10.0
> Reporter: ChangZhuo Chen (陳昌倬)
> Priority: Major
> Attachments: flink-conf_1.10.0.yaml, flink-conf_1.9.1.yaml,
> oomkilled_taskmanager.log
>
>
> This issue is from
> [https://stackoverflow.com/questions/60336764/flink-uses-more-memory-than-taskmanager-memory-process-size-in-kubernetes]
> h1. Description
> * In Flink 1.10.0, we try to use `taskmanager.memory.process.size` to limit
> the resource used by taskmanager to ensure they are not killed by Kubernetes.
> However, we still get lots of taskmanager `OOMKilled`. The setup is in the
> following section.
> * The taskmanager log is in attachment [^oomkilled_taskmanager.log].
> h2. Kubernete
> * The Kubernetes setup is the same as described in
> [https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/kubernetes.html].
> * The following is resource configuration for taskmanager deployment in
> Kubernetes:
> {{resources:}}
> {{ requests:}}
> {{ cpu: 1000m}}
> {{ memory: 4096Mi}}
> {{ limits:}}
> {{ cpu: 1000m}}
> {{ memory: 4096Mi}}
> h2. Flink Docker
> * The Flink docker is built by the following Docker file.
> {{FROM flink:1.10-scala_2.11}}
> RUN mkdir -p /opt/flink/plugins/s3 &&
> ln -s /opt/flink/opt/flink-s3-fs-presto-1.10.0.jar /opt/flink/plugins/s3/
> {{RUN ln -s /opt/flink/opt/flink-metrics-prometheus-1.10.0.jar
> /opt/flink/lib/}}
> h2. Flink Configuration
> * The following are all memory related configurations in `flink-conf.yaml`
> in 1.10.0:
> {{jobmanager.heap.size: 820m}}
> {{taskmanager.memory.jvm-metaspace.size: 128m}}
> {{taskmanager.memory.process.size: 4096m}}
> * We use RocksDB and we don't set `state.backend.rocksdb.memory.managed` in
> `flink-conf.yaml`.
> ** Use S3 as checkpoint storage.
> * The code uses DateStream API
> ** input/output are both Kafka.
> h2. Project Dependencies
> * The following is our dependencies.
> {{val flinkVersion = "1.10.0"}}{{libraryDependencies +=
> "com.squareup.okhttp3" % "okhttp" % "4.2.2"}}
> {{libraryDependencies += "com.typesafe" % "config" % "1.4.0"}}
> {{libraryDependencies += "joda-time" % "joda-time" % "2.10.5"}}
> {{libraryDependencies += "org.apache.flink" %% "flink-connector-kafka" %
> flinkVersion}}
> {{libraryDependencies += "org.apache.flink" % "flink-metrics-dropwizard" %
> flinkVersion}}
> {{libraryDependencies += "org.apache.flink" %% "flink-scala" % flinkVersion
> % "provided"}}
> {{libraryDependencies += "org.apache.flink" %% "flink-statebackend-rocksdb"
> % flinkVersion % "provided"}}
> {{libraryDependencies += "org.apache.flink" %% "flink-streaming-scala" %
> flinkVersion % "provided"}}
> {{libraryDependencies += "org.json4s" %% "json4s-jackson" % "3.6.7"}}
> {{libraryDependencies += "org.log4s" %% "log4s" % "1.8.2"}}
> {{libraryDependencies += "org.rogach" %% "scallop" % "3.3.1"}}
> h2. Previous Flink 1.9.1 Configuration
> * The configuration we used in Flink 1.9.1 are the following. It does not
> have `OOMKilled`.
> h3. Kubernetes
> {{resources:}}
> {{ requests:}}
> {{ cpu: 1200m}}
> {{ memory: 2G}}
> {{ limits:}}
> {{ cpu: 1500m}}
> {{ memory: 2G}}
> h3. Flink 1.9.1
> {{jobmanager.heap.size: 820m}}
> {{taskmanager.heap.size: 1024m}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)