[
https://issues.apache.org/jira/browse/FLINK-38290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18051145#comment-18051145
]
Royston Tauro commented on FLINK-38290:
---------------------------------------
I have started a mail thread with proposed fix details, reposting it here
After exploring , found this FLINK-38845 which added ArchivedApplicationStore
to store completed job information, the problem with this is i but it stores
data to /tmp (via io.tmp.dirs config), this is ephemeral pod-local storage that
gets wiped on pod restart, so completed job information is lost when the JM pod
is recreated.
Also from the operator code, onTargetNotFound there is no check to see if the
job has already completed , it directly moves from any state -> reconciling.
Proposed solutions:
Add persistent storage option to ArchivedApplicationStore so that it reads from
s3/gcs/hdfs etc to ensure proper recovery instead of /tmp dirs, behind a config
option of course
An additional fix could be to check if the job is terminated before moving it
to reconciling - in the operator
Looking forward to your feedback and thoughts!
If the proposed solution is agreed upon, I would be happy to contribute the
fix—please feel free to assign the ticket to me.
> Application cluster: FINISHED FlinkDeployment falls back to RECONCILING if JM
> pod is lost/recreated
> ---------------------------------------------------------------------------------------------------
>
> Key: FLINK-38290
> URL: https://issues.apache.org/jira/browse/FLINK-38290
> Project: Flink
> Issue Type: Bug
> Components: Client / Job Submission, Deployment / Kubernetes
> Affects Versions: 1.20.1, kubernetes-operator-1.12.1
> Reporter: Urs Schoenenberger
> Priority: Major
>
> Hi folks,
> we are encountering the following issue, and I believe it's a bug.
> One-line Summary: In an ApplicationCluster, the Operator queries JobManager
> REST API for job status. This API does not have information about FINISHED
> jobs if the JM leader changed / JM restarted. This leads to the Job being
> reset to RECONCILING where it gets stuck.
> Steps to reproduce:
> * Deploy the example FlinkDeployment (
> [https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.12/examples/basic.yaml]
> ) with a bounded job (e.g. examples/streaming/WordCount.jar) and configure
> high-availability.type: "kubernetes" and a high-availability.storageDir.
> * Wait for the FlinkDeployment to reach FINISHED.
> * Kill the JobManager pod. (The way this happens in production use cases is
> e.g. if a node is tainted and scheduled for deletion due to being underused /
> a spot instance goes down / etc).
> Observed behaviour:
> * A new JobManager is started.
> * The new pod checks the HA dir and realizes that the job is already
> completed. Log from StandaloneDispatcher: "Ignoring JobGraph submission (...)
> because the job already reached a globally-terminal state (...).
> * The operator tries to reconcile the job. In JobStatusObserver, it queries
> the JobManager's REST API (/jobs/overview), but it receives a "not found".
> ** This is because the backend here does not check the HA store, but the
> JobStore instead. This is backed by RAM or a local file, so it is not
> recovered on JM restart.
> * This leads the k8s operator to believe something is wrong with the
> FlinkDeployment, and the FlinkDeployment goes back to state RECONCILING and
> gets stuck there.
>
> This messes with monitoring and alerting among other things.
> We are aware of the HistoryServer and have configured it, but since the
> Operator only checks the JM API, this does not resolve the problem. Could we
> make the JM expose the HA store with finished job information for this
> purpose?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)