azagrebin opened a new pull request #12446:
URL: https://github.com/apache/flink/pull/12446


   ## What is the purpose of the change
   
   Currently, if 
   - the JVM Metaspace OOM happens in user class loader in user threads
   - and the OOM is not handled properly and forwarded to the task thread
   - or the JVM Metaspace OOM is suppressed e.g. by catching and just logging 
`Throwable`
   then the TM will not fail, although the situation is basically unrecoverable 
and Flink failover should kick in.
   
   Ideally, we should not catch broad `Throwable` exception and let errors to 
be handled properly in a central place for all threads. This is a big effort. 
Therefore, this PR suggests a smaller change for now.
   
   We can wrap the user class loading with try/catch (because this is the most 
probable place for the JVM Metaspace OOM) and call the TM fatal handler on the 
JVM Metaspace OOM.
   
   The PR suggests to discuss two ways to do that:
   - inherit a base user class loader with error handling by existing ones 
(first commit)
   - decorate the existing user class loaders with a wrapping class loader with 
error handling (second commit)
   
   The third commit is temporary as I [cannot 
run](https://developercommunity.visualstudio.com/content/problem/1060902/i-have-lost-access-to-my-organisation-and-project.html#)
 custom CI builds in my azure account atm. The commit loops a test to exclude 
the test failure described in 
[FLINK-16917|https://issues.apache.org/jira/browse/FLINK-16917].


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to