Hello, we use Flink with its native Kubernetes integration, and we have an issue detecting OOM kills from Task Managers: since their pods are completely managed by the Job Manager, it's hard to track their error reasons. The resource manager driver logs something like this in such cases:
Worker foo-taskmanager-1-1 is terminated. Diagnostics: Pod terminated, container termination statuses: [flink-main-container(exitCode=137, reason= OOMKilled, message=null)] so there is already a component that receives relevant information, but I don't know if it's possible to somehow expose this information through metrics? I briefly looked through the source code but I don't think the driver has access to Flink's metric system, and I wanted to ask here if it might make sense to add something like that. Regards, Alexis.
