Hi Everyone, I would like to start a discussion on FLINK-38290[1] - Application cluster: FINISHED FlinkDeployment falls back to RECONCILING if JM pod is lost/recreated.
Currently after a job is finished , and the job manager pod restarts due to a spot node going down or something similar, it causes the flink deployment to get stuck on RECONCILING even though it knows the flink job is in its terminal state. After exploring , found this FLINK-38845[2] which added ArchivedApplicationStore to store completed job information, the problem with this is i but it stores data to /tmp (via io.tmp.dirs config)/ this is ephemeral pod-local storage that gets wiped on pod restart, so completed job information is lost when the JM pod is recreated. Also from the operator code, onTargetNotFound there is no check to see if the job has already completed , it directly moves from any state -> reconciling. Proposed solutions: Add persistent storage option to ArchivedApplicationStore so that it reads from s3/gcs/hdfs etc to ensure proper recovery instead of /tmp dirs, behind a config option of course An additional fix could be to check if the job is terminated before moving it to reconciling For more details, please refer to the JIRA tickets [1][2]. Looking forward to your feedback and thoughts! References: [1] FLINK-38290: https://issues.apache.org/jira/browse/FLINK-38290 [2] FLINK-38845: https://issues.apache.org/jira/browse/FLINK-38845 Best regards, Royston E Tauro
