[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

Till Rohrmann (Jira) Thu, 05 Sep 2019 04:50:10 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923320#comment-16923320
 ]


Till Rohrmann commented on FLINK-13958:
---------------------------------------

I think you are describing a valid problem here. Unfortunately, I don't have 
good idea for a general solution at the moment. 

For the per-job mode, it could mean to not create a new user code class loader. 
There have been ideas to bind the user code class loader to the lifecycle of a 
slot. As long as the slot is still allocated to a {{JobMaster}}, then the 
system should not free the class loader. However, this would also not solve all 
problems, because the {{TaskExecutor}} could lose its connection to the 
{{JobMaster}} which causes the slot to be freed. After reconnecting to the 
{{JobMaster}} it would then create a new class loader.

For the session mode I think it is super tricky because the system could try to 
deploy tasks, belonging to two jobs, to the same {{TaskExecutor}} both of which 
trying to load the same C library.

> Job class loader may not be reused after batch job recovery
> -----------------------------------------------------------
>
>                 Key: FLINK-13958
>                 URL: https://issues.apache.org/jira/browse/FLINK-13958
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.9.0
>            Reporter: David Moravek
>            Priority: Major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (FLINK-13958) Job class loader may not be reused after batch job recovery

Reply via email to