AHeise commented on a change in pull request #11303: [FLINK-16245] Decoupling user classloader from context classloader. URL: https://github.com/apache/flink/pull/11303#discussion_r393553966
########## File path: flink-runtime/src/main/java/org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders.java ########## @@ -82,4 +89,61 @@ public static ResolveOrder fromString(String resolveOrder) { super(urls, parent); } } + + /** + * Ensures that holding a reference on the context class loader outliving the scope of user code does not prevent + * the user classloader to be garbage collected (FLINK-16245). + * + * <p>This classloader delegates to the actual user classloader. Upon {@link #close()}, the delegate is nulled + * and can be garbage collected. Additional class resolution will be resolved solely through the bootstrap + * classloader and most likely result in ClassNotFound exceptions. + */ + private static class SafetyNetWrapperClassLoader extends URLClassLoader + implements Closeable { + private static final Logger LOG = LoggerFactory.getLogger(SafetyNetWrapperClassLoader.class); + + private FlinkUserCodeClassLoader inner; + + SafetyNetWrapperClassLoader(FlinkUserCodeClassLoader inner) { + super(new URL[0], null); + this.inner = inner; + } + + @Override + public void close() { + if (inner != null) { + try { + inner.close(); + } catch (IOException e) { + LOG.warn("Could not close user classloader", e); + } + } + inner = null; + } + + @Override + protected Class<?> loadClass(String name, boolean resolve) throws ClassNotFoundException { + if (inner == null) { + return super.loadClass(name, resolve); Review comment: > Given that the CL is only closed when all tasks on the TE have stopped, how could it affect a job? Leftover threads, sure, but what happens to those shouldn't affect job execution, and failing as early as possible might actually be a benefit. I actually meant it the other way around. What happens if we have job1 and job2, where the CL of job1 is leaked forward while executing job2. The TM would die taking job2 with it with a somewhat unrelated error. In particular, I'd think it would be pretty confusing/inperformant for larger batch clusters. @StephanEwen WDYT? But since I'm also a fan of failing early, I'd go that way once Stephan agrees (I kinda had the impression that he wanted it to be graceful). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services