Hi Chen,

You are right that Flink changed its memory model with Flink 1.10. Now the
memory model is better defined and stricter. You can find information about
it here [1]. For some pointers towards potential problems please take a
look at [2].

What you need to do is to figure out where the non-heap memory is
allocated. Maybe you are using a library which leaks some memory. Maybe
your code requires more non-heap memory than you have configured the system
with.

If you are using the per-job mode on Yarn without
yarn.per-job-cluster.include-user-jar: disabled, then you should not have
any classloader leak problems because the user code should be part of the
system classpath.

If you set yarn.per-job-cluster.include-user-jar: disabled, then the
TaskExecutor will create a user code class loader and keep it as long as
the TaskExecutor still has some slots allocated for the job.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup_tm.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_trouble.html

Cheers,
Till

On Sun, Mar 14, 2021 at 12:22 AM Chen Qin <qinnc...@gmail.com> wrote:

> Hi there,
>
> We were using flink 1.11.2 in production with a large setting. The job
> runs fine for a couple of days and ends up with a restart loop caused by
> YARN container memory kill. This is not observed while running against
> 1.9.1 with the same setting.
> Here is JVM environment passed to 1.11 as well as 1.9.1 job
>
>
> env.java.opts.taskmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500
>> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
>> -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1
>> -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails
>> -XX:+PrintGCApplicationStoppedTime -Xloggc:<LOG_DIR>/gc.log'
>> env.java.opts.jobmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500
>> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
>> -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1
>> -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails
>> -XX:+PrintGCApplicationStoppedTime -Xloggc:<LOG_DIR>/gc.log'
>>
>
> After primitive investigation, we found this might not be related to jvm
> heap space usage nor gc issue. Meanwhile, we observed jvm non heap usage on
> some containers keep rising while job fails into restart loop as stated
> below.
> [image: image.png]
>
> From a configuration perspective, we would like to learn how the task
> manager handles classloading and (unloading?) when we set include-user-jar
> to first. Is there suggestions how we can have a better understanding of
> how the new memory model introduced in 1.10 affects this issue?
>
>
> cluster.evenly-spread-out-slots: true
> zookeeper.sasl.disable: true
> yarn.per-job-cluster.include-user-jar: first
> yarn.properties-file.location: /usr/local/hadoop/etc/hadoop/
>
>
> Thanks,
> Chen
>
>

Reply via email to