[jira] [Commented] (FLINK-35103) [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

SwathiChandrashekar (Jira) Tue, 16 Apr 2024 00:09:14 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837541#comment-17837541
 ]


SwathiChandrashekar commented on FLINK-35103:
---------------------------------------------

Thanks [~martijnvisser] . Have started a discussion for FLIP-XXX .
[https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc%2Fedit%3Fusp%3Dsharing&data=05%7C02%7Ccswathi%40microsoft.com%7Ce982d402ae5c48aa986e08dc5ddd71eb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638488453000029058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=UFYkYcpWPgaHhBagGKkrLPHZXh%2FivLd05YdmQcbVZaY%3D&reserved=0]
[ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in 
Kubernetes with Dynamic Termination Log Integration
Please share your inputs for the proposal.

> [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic 
> Termination Log Integration
> --------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-35103
>                 URL: https://issues.apache.org/jira/browse/FLINK-35103
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / Core
>            Reporter: SwathiChandrashekar
>            Priority: Not a Priority
>              Labels: pull-request-available
>             Fix For: 1.20.0
>
>         Attachments: Status-pod.png, screenshot-1.png
>
>
> Currently, whenever we have flink failures, we need to manually do the 
> triaging by looking into the flink logs even for the initial analysis. It 
> would have been better, if the user/admin directly gets the initial failure 
> information even before looking into the logs.
> To address this, we've developed a comprehensive solution via a plugin aimed 
> at helping fetch the Flink failures, ensuring critical data is preserved for 
> subsequent analysis and action.
>  
> In Kubernetes environments, troubleshooting pod failures can be challenging 
> without checking the pod/flink logs. Fortunately, Kubernetes offers a robust 
> mechanism to enhance debugging capabilities by leveraging the 
> /dev/termination-log file.
> [https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/]
> By writing failure information to this log, Kubernetes automatically 
> incorporates it into the container status, providing administrators and 
> developers with valuable insights into the root cause of failures.
> Our solution capitalizes on this Kubernetes feature to seamlessly integrate 
> Flink failure reporting within the container ecosystem. Whenever a Flink 
> encounters an issue, our plugin dynamically captures and logs the pertinent 
> failure information into the /dev/termination-log file. This ensures that 
> Kubernetes recognizes and propagates the failure status throughout the 
> container ecosystem, enabling efficient monitoring and response mechanisms.
> By leveraging Kubernetes' native functionality in this manner, our plugin 
> ensures that Flink failure incidents are promptly identified and reflected in 
> the pod status. This technical integration streamlines the debugging process, 
> empowering operators to swiftly diagnose and address issues, thereby 
> minimizing downtime and maximizing system reliability.
>  
> In-order to make this plugin generic, by default it doesn't do any action.  
> We can configure this by using
> *external.log.factory.class : 
> org.apache.flink.externalresource.log.K8SSupportTerminationLog*
> in our flink-conf file.
> This will be present in the plugins directory
> Sample output of the flink pod container status when there is a flink failure.
>  !screenshot-1.png! 
> here, we can see that , the user can clearly understand there was a Auth 
> issue and resolve it instead of checking the complete underlying logs.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-35103) [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration

Reply via email to