[jira] [Created] (FLINK-35103) [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration
SwathiChandrashekar created FLINK-35103: --- Summary: [Plugin] Enhancing Flink Failure Management in Kubernetes with Dynamic Termination Log Integration Key: FLINK-35103 URL: https://issues.apache.org/jira/browse/FLINK-35103 Project: Flink Issue Type: Improvement Components: API / Core Reporter: SwathiChandrashekar Fix For: 1.20.0 Attachments: Status-pod.png Currently, whenever we have flink failures, we need to manually do the triaging by looking into the flink logs even for the initial analysis. It would have been better, if the user/admin directly gets the initial failure information even before looking into the logs. To address this, we've developed a comprehensive solution via a plugin aimed at helping fetch the Flink failures, ensuring critical data is preserved for subsequent analysis and action. In Kubernetes environments, troubleshooting pod failures can be challenging without checking the pod/flink logs. Fortunately, Kubernetes offers a robust mechanism to enhance debugging capabilities by leveraging the /dev/termination-log file. [https://kubernetes.io/docs/tasks/debug/debug-application/determine-reason-pod-failure/] By writing failure information to this log, Kubernetes automatically incorporates it into the container status, providing administrators and developers with valuable insights into the root cause of failures. Our solution capitalizes on this Kubernetes feature to seamlessly integrate Flink failure reporting within the container ecosystem. Whenever a Flink encounters an issue, our plugin dynamically captures and logs the pertinent failure information into the /dev/termination-log file. This ensures that Kubernetes recognizes and propagates the failure status throughout the container ecosystem, enabling efficient monitoring and response mechanisms. By leveraging Kubernetes' native functionality in this manner, our plugin ensures that Flink failure incidents are promptly identified and reflected in the pod status. This technical integration streamlines the debugging process, empowering operators to swiftly diagnose and address issues, thereby minimizing downtime and maximizing system reliability. In-order to make this plugin generic, by default it doesn't do any action. We can configure this by using *external.log.factory.class : org.apache.flink.externalresource.log.K8SSupportTerminationLog* This will be present in the plugins directory PFA for the pod status -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-30361) Flink cluster deleted while updating the replicas
SwathiChandrashekar created FLINK-30361: --- Summary: Flink cluster deleted while updating the replicas Key: FLINK-30361 URL: https://issues.apache.org/jira/browse/FLINK-30361 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.2.0 Reporter: SwathiChandrashekar Whenever we try to update the replicas of the task manager for a flink standalone cluster using the flink CR, any change in CR triggers a redeploy of the flink cluster ( delete + create of all the components - JM and TM ). This might not be required for replica update and this should not affect the existing pods and only a new TM pod will added during a scale up and a TM pod should be deleted during a scale down. Example tried --> Change the TM replicas from 2 to 3. ``` PS C:\Users\cswathi\Documents\hilo\flink-OSS-operator> kubectl get pods -w NAME READY STATUS RESTARTS AGE basic-session-deployment-only-example-5dbbdf5dd8-cq8nb 0/1 ContainerCreating 0 1s basic-session-deployment-only-example-taskmanager-77854fbb7vzvd 0/1 ContainerCreating 0 1s basic-session-deployment-only-example-taskmanager-77854fbbg6vzs 0/1 ContainerCreating 0 1s flink-kubernetes-operator-676897686f-5fc8r 2/2 Running 0 18m basic-session-deployment-only-example-5dbbdf5dd8-cq8nb 1/1 Running 0 1s basic-session-deployment-only-example-taskmanager-77854fbb7vzvd 1/1 Running 0 1s basic-session-deployment-only-example-taskmanager-77854fbbg6vzs 1/1 Running 0 13s basic-session-deployment-only-example-taskmanager-77854fbb7vzvd 1/1 Terminating 0 65s basic-session-deployment-only-example-5dbbdf5dd8-cq8nb 1/1 Terminating 0 65s basic-session-deployment-only-example-taskmanager-77854fbbg6vzs 1/1 Terminating 0 65s basic-session-deployment-only-example-taskmanager-77854fbb7vzvd 1/1 Terminating 0 66s basic-session-deployment-only-example-5dbbdf5dd8-cq8nb 1/1 Terminating 0 66s basic-session-deployment-only-example-taskmanager-77854fbbg6vzs 1/1 Terminating 0 66s basic-session-deployment-only-example-taskmanager-77854fbb7vzvd 0/1 Terminating 0 66s basic-session-deployment-only-example-taskmanager-77854fbb7vzvd 0/1 Terminating 0 66s basic-session-deployment-only-example-taskmanager-77854fbb7vzvd 0/1 Terminating 0 66s basic-session-deployment-only-example-taskmanager-77854fbbg6vzs 0/1 Terminating 0 67s basic-session-deployment-only-example-taskmanager-77854fbbg6vzs 0/1 Terminating 0 67s basic-session-deployment-only-example-taskmanager-77854fbbg6vzs 0/1 Terminating 0 67s basic-session-deployment-only-example-5dbbdf5dd8-cq8nb 0/1 Terminating 0 67s basic-session-deployment-only-example-5dbbdf5dd8-cq8nb 0/1 Terminating 0 67s basic-session-deployment-only-example-5dbbdf5dd8-cq8nb 0/1 Terminating 0 67s basic-session-deployment-only-example-588474bf97-nng85 0/1 Pending 0 0s basic-session-deployment-only-example-588474bf97-nng85 0/1 Pending 0 0s basic-session-deployment-only-example-588474bf97-nng85 0/1 ContainerCreating 0 0s basic-session-deployment-only-example-taskmanager-77854fbb5ddxv 0/1 Pending 0 0s basic-session-deployment-only-example-taskmanager-77854fbb5ddxv 0/1 Pending 0 0s basic-session-deployment-only-example-taskmanager-77854fbbrfgvz 0/1 Pending 0 0s basic-session-deployment-only-example-taskmanager-77854fbb57v4t 0/1 Pending 0 0s basic-session-deployment-only-example-taskmanager-77854fbbrfgvz 0/1 Pending 0 1s basic-session-deployment-only-example-taskmanager-77854fbb57v4t 0/1 Pending 0 1s basic-session-deployment-only-example-taskmanager-77854fbb5ddxv 0/1 ContainerCreating 0 1s basic-session-deployment-only-example-taskmanager-77854fbbrfgvz 0/1 ContainerCreating 0 1s basic-session-deployment-only-example-taskmanager-77854fbb57v4t 0/1 ContainerCreating 0 1s basic-session-deployment-only-example-588474bf97-nng85 0/1 ContainerCreating 0 1s basic-sess