[
https://issues.apache.org/jira/browse/FLINK-35103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17837541#comment-17837541
]
SwathiChandrashekar commented on FLINK-35103:
---------------------------------------------
Thanks [~martijnvisser] . Have started a discussion for FLIP-XXX .
[https://docs.google.com/document/d/1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc/edit?usp=sharing|https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1tWR0Fi3w7VQeD_9VUORh8EEOva3q-V0XhymTkNaXHOc%2Fedit%3Fusp%3Dsharing&data=05%7C02%7Ccswathi%40microsoft.com%7Ce982d402ae5c48aa986e08dc5ddd71eb%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C638488453000029058%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=UFYkYcpWPgaHhBagGKkrLPHZXh%2FivLd05YdmQcbVZaY%3D&reserved=0]
[ DISCUSS ] FLIP-XXX : [Plugin] Enhancing Flink Failure Management in
Kubernetes with Dynamic Termination Log Integration
Please share your inputs for the proposal.
> [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic
> Termination Log Integration
> --------------------------------------------------------------------------------------------------
>
> Key: FLINK-35103
> URL: https://issues.apache.org/jira/browse/FLINK-35103
> Project: Flink
> Issue Type: Improvement
> Components: API / Core
> Reporter: SwathiChandrashekar
> Priority: Not a Priority
> Labels: pull-request-available
> Fix For: 1.20.0
>
> Attachments: Status-pod.png, screenshot-1.png
>
>
> Currently, whenever we have flink failures, we need to manually do the
> triaging by looking into the flink logs even for the initial analysis. It
> would have been better, if the user/admin directly gets the initial failure
> information even before looking into the logs.
> To address this, we've developed a comprehensive solution via a plugin aimed
> at helping fetch the Flink failures, ensuring critical data is preserved for
> subsequent analysis and action.
>
> In Kubernetes environments, troubleshooting pod failures can be challenging
> without checking the pod/flink logs. Fortunately, Kubernetes offers a robust
> mechanism to enhance debugging capabilities by leveraging the
> /dev/termination-log file.
> [https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/]
> By writing failure information to this log, Kubernetes automatically
> incorporates it into the container status, providing administrators and
> developers with valuable insights into the root cause of failures.
> Our solution capitalizes on this Kubernetes feature to seamlessly integrate
> Flink failure reporting within the container ecosystem. Whenever a Flink
> encounters an issue, our plugin dynamically captures and logs the pertinent
> failure information into the /dev/termination-log file. This ensures that
> Kubernetes recognizes and propagates the failure status throughout the
> container ecosystem, enabling efficient monitoring and response mechanisms.
> By leveraging Kubernetes' native functionality in this manner, our plugin
> ensures that Flink failure incidents are promptly identified and reflected in
> the pod status. This technical integration streamlines the debugging process,
> empowering operators to swiftly diagnose and address issues, thereby
> minimizing downtime and maximizing system reliability.
>
> In-order to make this plugin generic, by default it doesn't do any action.
> We can configure this by using
> *external.log.factory.class :
> org.apache.flink.externalresource.log.K8SSupportTerminationLog*
> in our flink-conf file.
> This will be present in the plugins directory
> Sample output of the flink pod container status when there is a flink failure.
> !screenshot-1.png!
> here, we can see that , the user can clearly understand there was a Auth
> issue and resolve it instead of checking the complete underlying logs.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)