[GitHub] [flink] tillrohrmann commented on issue #11408: [FLINK-15989][FLINK-16225] Improve direct and metaspace out-of-memory error handling

GitBox Wed, 18 Mar 2020 02:42:50 -0700

tillrohrmann commented on issue #11408: [FLINK-15989][FLINK-16225] Improve 
direct and metaspace out-of-memory error handling
URL: https://github.com/apache/flink/pull/11408#issuecomment-600520719
 
 
   I think you are right that OOMs can happen everywhere and we won't most 
likely cover all susceptible places with this PR. But this is also ok I 
believe. The goal of this PR should be to setup a common framework how to 
handle these exceptions and how to tell the user about it. Once we have this, 
it is matter of finding the most problematic places and to add the respective 
handling logic. With setting up a framework I mean to have a long term vision 
how things should eventually work.
   
   ### How do we want to react to OOMs in general?
   
   In order to answer this question I believe we need to distinguish between 
the different types of OOMs (heap, direct memory, metaspace) and where they are 
happening (user code, framework).
   
   I believe that this point is out of scope for this PR but is still important 
in order to evolve Flink into the right direction.
   
   #### Framework
   
   At the moment we don't handle OOMs in the framework consistently. Sometimes 
we call the `FatalErrorHandler` and sometimes we simply return the OOM as a 
response to RPCs.
   
   I think it would be safest to fail the process if an OOM occurs within the 
framework, because it is hard to guarantee that the framework is still in a 
consistent state.
   
   One question could be whether we want to treat different OOMs differently. 
For example, continuing in case of metaspace OOM. This might be an optimization 
but in the first version I would be in favour of treating OOMs all the same way.
   
   #### User code
   
   At the moment, we treat OOMs as user code failures unless 
`taskmanager.jvm-exit-on-oom` has been set to true. In the latter case, we exit 
the JVM hard.
   
   If there is a OOM originating from user code, the least we need to do is to 
fail the task. I think one can make cases for either killing the process or 
not. If the task failed because it exceeded the direct memory limit, then other 
tasks might not be affected by it. On the other hand, there should always be an 
option to tell Flink to fail the process in case of OOMs.
   
   Maybe a robust solution would be to fail the process by default and allow to 
turn it off.
   
   ### What do we want to tell the user?
   
   For certain errors, we want to provide better error messages pointing the 
user towards a potential solution. This warrants for a general facility to add 
additional diagnostics to an exception. Maybe one could introduce 
`ExceptionUtils#enrichErrorMessage` which takes an exception and enriches the 
error message. Alternatively, we could have `ExceptionUtils#logDiagnostics`.
   
   ### Where do we react?
   
   Usually, error handling should be done on the level where the error is 
recoverable or where it leaves the component (before sending it to another 
component). In the case of OOMs occurring in the context of the framework, we 
are talking about the fatal error handler. In order to treat OOMs consistently 
across RPCs and method which directly calling the fatal error handler, we might 
need to extend the `RpcEndpoint` to allow passing in a `FatalErrorHandler`.
   
   For user code exceptions, the natural fit would be the `Task#doRun` where we 
react to user code exceptions and fail the task. One thing to add here is that 
if we trade the user code OOM as non-fatal, then it would also be good to 
enrich the error message of the exception we are sending to the JM where it is 
also logged. This would be an argument for wrapping the original 
exception/enriching the exception message instead of logging some statements 
wrt the exception.  
   
   So what I would suggest to do for this PR is to cover user code OOMs. It 
could be as easy as wrapping the user code exception with a more meaningful 
error message/enriching the error message and making sure that it is logged in 
a way we want it to happen.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [flink] tillrohrmann commented on issue #11408: [FLINK-15989][FLINK-16225] Improve direct and metaspace out-of-memory error handling

Reply via email to