tillrohrmann commented on issue #11408: [FLINK-15989][FLINK-16225] Improve direct and metaspace out-of-memory error handling URL: https://github.com/apache/flink/pull/11408#issuecomment-600520719 I think you are right that OOMs can happen everywhere and we won't most likely cover all susceptible places with this PR. But this is also ok I believe. The goal of this PR should be to setup a common framework how to handle these exceptions and how to tell the user about it. Once we have this, it is matter of finding the most problematic places and to add the respective handling logic. With setting up a framework I mean to have a long term vision how things should eventually work. ### How do we want to react to OOMs in general? In order to answer this question I believe we need to distinguish between the different types of OOMs (heap, direct memory, metaspace) and where they are happening (user code, framework). I believe that this point is out of scope for this PR but is still important in order to evolve Flink into the right direction. #### Framework At the moment we don't handle OOMs in the framework consistently. Sometimes we call the `FatalErrorHandler` and sometimes we simply return the OOM as a response to RPCs. I think it would be safest to fail the process if an OOM occurs within the framework, because it is hard to guarantee that the framework is still in a consistent state. One question could be whether we want to treat different OOMs differently. For example, continuing in case of metaspace OOM. This might be an optimization but in the first version I would be in favour of treating OOMs all the same way. #### User code At the moment, we treat OOMs as user code failures unless `taskmanager.jvm-exit-on-oom` has been set to true. In the latter case, we exit the JVM hard. If there is a OOM originating from user code, the least we need to do is to fail the task. I think one can make cases for either killing the process or not. If the task failed because it exceeded the direct memory limit, then other tasks might not be affected by it. On the other hand, there should always be an option to tell Flink to fail the process in case of OOMs. Maybe a robust solution would be to fail the process by default and allow to turn it off. ### What do we want to tell the user? For certain errors, we want to provide better error messages pointing the user towards a potential solution. This warrants for a general facility to add additional diagnostics to an exception. Maybe one could introduce `ExceptionUtils#enrichErrorMessage` which takes an exception and enriches the error message. Alternatively, we could have `ExceptionUtils#logDiagnostics`. ### Where do we react? Usually, error handling should be done on the level where the error is recoverable or where it leaves the component (before sending it to another component). In the case of OOMs occurring in the context of the framework, we are talking about the fatal error handler. In order to treat OOMs consistently across RPCs and method which directly calling the fatal error handler, we might need to extend the `RpcEndpoint` to allow passing in a `FatalErrorHandler`. For user code exceptions, the natural fit would be the `Task#doRun` where we react to user code exceptions and fail the task. One thing to add here is that if we trade the user code OOM as non-fatal, then it would also be good to enrich the error message of the exception we are sending to the JM where it is also logged. This would be an argument for wrapping the original exception/enriching the exception message instead of logging some statements wrt the exception. So what I would suggest to do for this PR is to cover user code OOMs. It could be as easy as wrapping the user code exception with a more meaningful error message/enriching the error message and making sure that it is logged in a way we want it to happen.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
