azagrebin commented on issue #11408: [FLINK-15989][FLINK-16225] Improve direct and metaspace out-of-memory error handling URL: https://github.com/apache/flink/pull/11408#issuecomment-601657949 Thanks for the explanation @tillrohrmann I agree we can address other cases of failure handling separately and PR was supposed to focus on OOMs in user code loading and invocation in `Task#doRun`. The problem in PR was that as you mentioned, the `FatalErrorHandler` is called for user code failure case, only cleanup. I fixed this. I checked how it looks in logs with a [simple job](https://github.com/azagrebin/flink/commit/a09b077cc60c8c0194f2d2242f6352a8b7ac7915#diff-ec75df84b8550ffec1ac164ffb9c0909). Generating a real Metaspace OOM is tricky because depending on limit it really crashes in various places, including error handling itself. `taskmanager.jvm-exit-on-oom` is false by default. Do you think we should change this? I agree that different OOMs can differently affect other tasks but not sure about complicating logic more with options.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
