azagrebin commented on issue #11408: [FLINK-15989][FLINK-16225] Improve direct 
and metaspace out-of-memory error handling
URL: https://github.com/apache/flink/pull/11408#issuecomment-601657949
 
 
   Thanks for the explanation @tillrohrmann 
   
   I agree we can address other cases of failure handling separately and PR was 
supposed to focus on OOMs in user code loading and invocation in `Task#doRun`. 
The problem in PR was that as you mentioned, the `FatalErrorHandler` is called 
for user code failure case, only cleanup. I fixed this.
   
   I checked how it looks in logs with a [simple 
job](https://github.com/azagrebin/flink/commit/a09b077cc60c8c0194f2d2242f6352a8b7ac7915#diff-ec75df84b8550ffec1ac164ffb9c0909).
 Generating a real Metaspace OOM is tricky because depending on limit it really 
crashes in various places, including error handling itself.
   
   `taskmanager.jvm-exit-on-oom` is false by default. Do you think we should 
change this? I agree that different OOMs can differently affect other tasks but 
not sure about complicating logic more with options.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to