[ https://issues.apache.org/jira/browse/FLINK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weijie Guo updated FLINK-35103: ------------------------------- Affects Version/s: 2.1.0 > [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic > Termination Log Integration > -------------------------------------------------------------------------------------------------- > > Key: FLINK-35103 > URL: https://issues.apache.org/jira/browse/FLINK-35103 > Project: Flink > Issue Type: Improvement > Components: API / Core > Affects Versions: 2.1.0 > Reporter: SwathiChandrashekar > Priority: Not a Priority > Labels: pull-request-available > Fix For: 2.0.0 > > Attachments: Status-pod.png, screenshot-1.png > > > Currently, whenever we have flink failures, we need to manually do the > triaging by looking into the flink logs even for the initial analysis. It > would have been better, if the user/admin directly gets the initial failure > information even before looking into the logs. > To address this, we've developed a comprehensive solution via a plugin aimed > at helping fetch the Flink failures, ensuring critical data is preserved for > subsequent analysis and action. > > In Kubernetes environments, troubleshooting pod failures can be challenging > without checking the pod/flink logs. Fortunately, Kubernetes offers a robust > mechanism to enhance debugging capabilities by leveraging the > /dev/termination-log file. > [https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/] > By writing failure information to this log, Kubernetes automatically > incorporates it into the container status, providing administrators and > developers with valuable insights into the root cause of failures. > Our solution capitalizes on this Kubernetes feature to seamlessly integrate > Flink failure reporting within the container ecosystem. Whenever a Flink > encounters an issue, our plugin dynamically captures and logs the pertinent > failure information into the /dev/termination-log file. This ensures that > Kubernetes recognizes and propagates the failure status throughout the > container ecosystem, enabling efficient monitoring and response mechanisms. > By leveraging Kubernetes' native functionality in this manner, our plugin > ensures that Flink failure incidents are promptly identified and reflected in > the pod status. This technical integration streamlines the debugging process, > empowering operators to swiftly diagnose and address issues, thereby > minimizing downtime and maximizing system reliability. > > In-order to make this plugin generic, by default it doesn't do any action. > We can configure this by using > *external.log.factory.class : > org.apache.flink.externalresource.log.K8SSupportTerminationLog* > in our flink-conf file. > This will be present in the plugins directory > Sample output of the flink pod container status when there is a flink failure. > !screenshot-1.png! > here, we can see that , the user can clearly understand there was a Auth > issue and resolve it instead of checking the complete underlying logs. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)