Hello,

we use Flink with its native Kubernetes integration, and we have an issue
detecting OOM kills from Task Managers: since their pods are completely
managed by the Job Manager, it's hard to track their error reasons. The
resource manager driver logs something like this in such cases:

Worker foo-taskmanager-1-1 is terminated. Diagnostics: Pod terminated,
container termination statuses: [flink-main-container(exitCode=137, reason=
OOMKilled, message=null)]

so there is already a component that receives relevant information, but I
don't know if it's possible to somehow expose this information through
metrics? I briefly looked through the source code but I don't think the
driver has access to Flink's metric system, and I wanted to ask here if it
might make sense to add something like that.

Regards,
Alexis.

Reply via email to