Hi Chen, You are right that Flink changed its memory model with Flink 1.10. Now the memory model is better defined and stricter. You can find information about it here [1]. For some pointers towards potential problems please take a look at [2].
What you need to do is to figure out where the non-heap memory is allocated. Maybe you are using a library which leaks some memory. Maybe your code requires more non-heap memory than you have configured the system with. If you are using the per-job mode on Yarn without yarn.per-job-cluster.include-user-jar: disabled, then you should not have any classloader leak problems because the user code should be part of the system classpath. If you set yarn.per-job-cluster.include-user-jar: disabled, then the TaskExecutor will create a user code class loader and keep it as long as the TaskExecutor still has some slots allocated for the job. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_setup_tm.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/memory/mem_trouble.html Cheers, Till On Sun, Mar 14, 2021 at 12:22 AM Chen Qin <qinnc...@gmail.com> wrote: > Hi there, > > We were using flink 1.11.2 in production with a large setting. The job > runs fine for a couple of days and ends up with a restart loop caused by > YARN container memory kill. This is not observed while running against > 1.9.1 with the same setting. > Here is JVM environment passed to 1.11 as well as 1.9.1 job > > > env.java.opts.taskmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500 >> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5 >> -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1 >> -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails >> -XX:+PrintGCApplicationStoppedTime -Xloggc:<LOG_DIR>/gc.log' >> env.java.opts.jobmanager: '-XX:+UseG1GC -XX:MaxGCPauseMillis=500 >> -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5 >> -XX:InitiatingHeapOccupancyPercent=45 -XX:NewRatio=1 >> -XX:+PrintClassHistogram -XX:+PrintGCDateStamps -XX:+PrintGCDetails >> -XX:+PrintGCApplicationStoppedTime -Xloggc:<LOG_DIR>/gc.log' >> > > After primitive investigation, we found this might not be related to jvm > heap space usage nor gc issue. Meanwhile, we observed jvm non heap usage on > some containers keep rising while job fails into restart loop as stated > below. > [image: image.png] > > From a configuration perspective, we would like to learn how the task > manager handles classloading and (unloading?) when we set include-user-jar > to first. Is there suggestions how we can have a better understanding of > how the new memory model introduced in 1.10 affects this issue? > > > cluster.evenly-spread-out-slots: true > zookeeper.sasl.disable: true > yarn.per-job-cluster.include-user-jar: first > yarn.properties-file.location: /usr/local/hadoop/etc/hadoop/ > > > Thanks, > Chen > >