Hi Everyone,

I would like to start a discussion on FLINK-38290[1] - Application cluster:
FINISHED FlinkDeployment falls back to RECONCILING if JM pod is
lost/recreated.

Currently after a job is finished , and the job manager pod restarts due to
a spot node going down or something similar, it causes the flink deployment
to get stuck on RECONCILING even though it knows the flink job is in its
terminal state.

After exploring , found this FLINK-38845[2]  which added
ArchivedApplicationStore to store completed job information, the problem
with this is i but it stores data to /tmp (via io.tmp.dirs config)/ this is
ephemeral pod-local storage that gets wiped on pod restart, so completed
job information is lost when the JM pod is recreated.

Also from the operator code, onTargetNotFound there is no check to see if
the job has already completed , it directly moves from any state ->
reconciling.

Proposed solutions:
Add persistent storage option to ArchivedApplicationStore so that it reads
from s3/gcs/hdfs etc to ensure proper recovery instead of /tmp dirs, behind
a config option of course
An additional fix could be to check if the job is terminated before moving
it to reconciling

  For more details, please refer to the JIRA tickets [1][2].

  Looking forward to your feedback and thoughts!

  References:

  [1] FLINK-38290: https://issues.apache.org/jira/browse/FLINK-38290

  [2] FLINK-38845: https://issues.apache.org/jira/browse/FLINK-38845

 Best regards,
Royston E Tauro

Reply via email to