[
https://issues.apache.org/jira/browse/FLINK-13958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-13958:
-----------------------------------
Labels: auto-deprioritized-major auto-deprioritized-minor (was:
auto-deprioritized-major stale-minor)
Priority: Not a Priority (was: Minor)
This issue was labeled "stale-minor" 7 days ago and has not received any
updates so it is being deprioritized. If this ticket is actually Minor, please
raise the priority and ask a committer to assign you the issue or revive the
public discussion.
> Job class loader may not be reused after batch job recovery
> -----------------------------------------------------------
>
> Key: FLINK-13958
> URL: https://issues.apache.org/jira/browse/FLINK-13958
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.9.0
> Reporter: David Morávek
> Priority: Not a Priority
> Labels: auto-deprioritized-major, auto-deprioritized-minor
>
> [https://lists.apache.org/thread.html/e241be9a1a10810a1203786dff3b7386d265fbe8702815a77bad42eb@%3Cdev.flink.apache.org%3E|http://example.com]
> 1) We have a per-job flink cluster
> 2) We use BATCH execution mode + region failover strategy
> Point 1) should imply single user code class loader per task manager (because
> there is only single pipeline, that reuses class loader cached in
> BlobLibraryCacheManager). We need this property, because we have UDFs that
> access C libraries using JNI (I think this may be fairly common use-case when
> dealing with legacy code). [JDK
> internals|https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/lang/ClassLoader.java#L2466]
> make sure that single library can be only loaded by a single class loader
> per JVM.
> When region recovery is triggered, vertices that need recover are first reset
> back to CREATED stated and then rescheduled. In case all tasks in a task
> manager are reset, this results in [cached class loader being
> released|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/BlobLibraryCacheManager.java#L338].
> This unfortunately causes job failure, because we try to reload a native
> library in a newly created class loader.
> I believe the correct approach would be not to release cached class loader if
> the job is recovering, even though there are no tasks currently registered
> with TM.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)