azagrebin opened a new pull request #12446: URL: https://github.com/apache/flink/pull/12446
## What is the purpose of the change Currently, if - the JVM Metaspace OOM happens in user class loader in user threads - and the OOM is not handled properly and forwarded to the task thread - or the JVM Metaspace OOM is suppressed e.g. by catching and just logging `Throwable` then the TM will not fail, although the situation is basically unrecoverable and Flink failover should kick in. Ideally, we should not catch broad `Throwable` exception and let errors to be handled properly in a central place for all threads. This is a big effort. Therefore, this PR suggests a smaller change for now. We can wrap the user class loading with try/catch (because this is the most probable place for the JVM Metaspace OOM) and call the TM fatal handler on the JVM Metaspace OOM. The PR suggests to discuss two ways to do that: - inherit a base user class loader with error handling by existing ones (first commit) - decorate the existing user class loaders with a wrapping class loader with error handling (second commit) The third commit is temporary as I [cannot run](https://developercommunity.visualstudio.com/content/problem/1060902/i-have-lost-access-to-my-organisation-and-project.html#) custom CI builds in my azure account atm. The commit loops a test to exclude the test failure described in [FLINK-16917|https://issues.apache.org/jira/browse/FLINK-16917]. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
