[ 
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923160#comment-16923160
 ] 

Till Rohrmann commented on FLINK-13958:
---------------------------------------

Thanks for reporting this issue [~davidmoravek]. I think this issue should also 
arise with any other restart settings and also with streaming if I'm not 
mistaken.

A quick question concerning the per-job mode. Are you using the per job mode on 
Yarn? If yes, do you submit the job in detached or attached mode? If it should 
be the latter, then Flink actually deploys a session cluster underneath. This 
is for legacy reasons. I'm asking because at the moment, the per job mode 
(submitting a job in detached mode on Yarn or using the container per job mode) 
should place all dependencies on the system class path (this has other problems 
as it does not support child first class loading atm).

> Job class loader may not be reused after batch job recovery
> -----------------------------------------------------------
>
>                 Key: FLINK-13958
>                 URL: https://issues.apache.org/jira/browse/FLINK-13958
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.9.0
>            Reporter: David Moravek
>            Priority: Major
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because 
> there is only single pipeline, that reuses class loader cached in 
> BlobLibraryCacheManager). We need this property, because we have UDFs that 
> access C libraries using JNI (I think this may be fairly common use-case when 
> dealing with legacy code). [JDK 
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
>  make sure that single library can be only loaded by a single class loader 
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset 
> back to CREATED stated and then rescheduled. In case all tasks in a task 
> manager are reset, this results in [cached class loader being 
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
>  This unfortunately causes job failure, because we try to reload a native 
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if 
> the job is recovering, even though there are no tasks currently registered 
> with TM.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to