Re: flink 1.11 class loading question

Till Rohrmann Wed, 17 Mar 2021 07:25:51 -0700

Hi Chen,

You are right that Flink changed its memory model with Flink 1.10. Now the
memory model is better defined and stricter. You can find information about
it here [1]. For some pointers towards potential problems please take a
look at [2].


What you need to do is to figure out where the non-heap memory is
allocated. Maybe you are using a library which leaks some memory. Maybe
your code requires more non-heap memory than you have configured the system
with.

If you are using the per-job mode on Yarn without
yarn.per-job-cluster.include-user-jar: disabled, then you should not have
any classloader leak problems because the user code should be part of the
system classpath.

If you set yarn.per-job-cluster.include-user-jar: disabled, then the
TaskExecutor will create a user code class loader and keep it as long as
the TaskExecutor still has some slots allocated for the job.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup_tm.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_trouble.html

Cheers,
Till

On Sun, Mar 14, 2021 at 12:22 AM Chen Qin <qinnc...@gmail.com> wrote:

> Hi there,
>
> We were using flink 1.11.2 in production with a large setting. The job
> runs fine for a couple of days and ends up with a restart loop caused by
> YARN container memory kill. This is not observed while running against
> 1.9.1 with the same setting.
> Here is JVM environment passed to 1.11 as well as 1.9.1 job
>
>
> env.java.opts.taskmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500
>> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
>> -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1
>> -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails
>> -XX:+PrintGCApplicationStoppedTime -Xloggc:<LOG_DIR>/gc.log'
>> env.java.opts.jobmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500
>> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
>> -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1
>> -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails
>> -XX:+PrintGCApplicationStoppedTime -Xloggc:<LOG_DIR>/gc.log'
>>
>
> After primitive investigation, we found this might not be related to jvm
> heap space usage nor gc issue. Meanwhile, we observed jvm non heap usage on
> some containers keep rising while job fails into restart loop as stated
> below.
> [image: image.png]
>
> From a configuration perspective, we would like to learn how the task
> manager handles classloading and (unloading?) when we set include-user-jar
> to first. Is there suggestions how we can have a better understanding of
> how the new memory model introduced in 1.10 affects this issue?
>
>
> cluster.evenly-spread-out-slots: true
> zookeeper.sasl.disable: true
> yarn.per-job-cluster.include-user-jar: first
> yarn.properties-file.location: /usr/local/hadoop/etc/hadoop/
>
>
> Thanks,
> Chen
>
>

Re: flink 1.11 class loading question

Reply via email to