[ https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16922545#comment-16922545 ]
Alex edited comment on FLINK-13958 at 9/4/19 2:35 PM: ------------------------------------------------------ I think this has the same root cause as FLINK-11402. Specifically, that we cannot load a native library more than once in the same JVM process. was (Author: 1u0): I think this has the the same root cause as FLINK-11402. Specifically, that we cannot load a native library more than once in the same JVM process. > Job class loader may not be reused after batch job recovery > ----------------------------------------------------------- > > Key: FLINK-13958 > URL: https://issues.apache.org/jira/browse/FLINK-13958 > Project: Flink > Issue Type: Bug > Components: Runtime / Task > Affects Versions: 1.9.0 > Reporter: David Moravek > Priority: Major > > [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com] > 1) We have a per-job flink cluster > 2) We use BATCH execution mode + region failover strategy > Point 1) should imply single user code class loader per task manager (because > there is only single pipeline, that reuses class loader cached in > BlobLibraryCacheManager). We need this property, because we have UDFs that > access C libraries using JNI (I think this may be fairly common use-case when > dealing with legacy code). [JDK > internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466] > make sure that single library can be only loaded by a single class loader > per JVM. > When region recovery is triggered, vertices that need recover are first reset > back to CREATED stated and then rescheduled. In case all tasks in a task > manager are reset, this results in [cached class loader being > released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338]. > This unfortunately causes job failure, because we try to reload a native > library in a newly created class loader. > I believe the correct approach would be not to release cached class loader if > the job is recovering, even though there are no tasks currently registered > with TM. -- This message was sent by Atlassian Jira (v8.3.2#803003)