[ 
https://issues.apache.org/jira/browse/FLINK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weijie Guo updated FLINK-35103:
-------------------------------
    Affects Version/s: 2.1.0

> [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic 
> Termination Log Integration
> --------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-35103
>                 URL: https://issues.apache.org/jira/browse/FLINK-35103
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / Core
>    Affects Versions: 2.1.0
>            Reporter: SwathiChandrashekar
>            Priority: Not a Priority
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>         Attachments: Status-pod.png, screenshot-1.png
>
>
> Currently, whenever we have flink failures, we need to manually do the 
> triaging by looking into the flink logs even for the initial analysis. It 
> would have been better, if the user/admin directly gets the initial failure 
> information even before looking into the logs.
> To address this, we've developed a comprehensive solution via a plugin aimed 
> at helping fetch the Flink failures, ensuring critical data is preserved for 
> subsequent analysis and action.
>  
> In Kubernetes environments, troubleshooting pod failures can be challenging 
> without checking the pod/flink logs. Fortunately, Kubernetes offers a robust 
> mechanism to enhance debugging capabilities by leveraging the 
> /dev/termination-log file.
> [https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/]
> By writing failure information to this log, Kubernetes automatically 
> incorporates it into the container status, providing administrators and 
> developers with valuable insights into the root cause of failures.
> Our solution capitalizes on this Kubernetes feature to seamlessly integrate 
> Flink failure reporting within the container ecosystem. Whenever a Flink 
> encounters an issue, our plugin dynamically captures and logs the pertinent 
> failure information into the /dev/termination-log file. This ensures that 
> Kubernetes recognizes and propagates the failure status throughout the 
> container ecosystem, enabling efficient monitoring and response mechanisms.
> By leveraging Kubernetes' native functionality in this manner, our plugin 
> ensures that Flink failure incidents are promptly identified and reflected in 
> the pod status. This technical integration streamlines the debugging process, 
> empowering operators to swiftly diagnose and address issues, thereby 
> minimizing downtime and maximizing system reliability.
>  
> In-order to make this plugin generic, by default it doesn't do any action.  
> We can configure this by using
> *external.log.factory.class : 
> org.apache.flink.externalresource.log.K8SSupportTerminationLog*
> in our flink-conf file.
> This will be present in the plugins directory
> Sample output of the flink pod container status when there is a flink failure.
>  !screenshot-1.png! 
> here, we can see that , the user can clearly understand there was a Auth 
> issue and resolve it instead of checking the complete underlying logs.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to