chenyuzhi created FLINK-35857:
---------------------------------
Summary: Operator restart failed job without latest checkpoint
Key: FLINK-35857
URL: https://issues.apache.org/jira/browse/FLINK-35857
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: kubernetes-operator-1.6.1
Environment: flink kubernetes operator version: 1.6.1
flink version 1.15.2
flink job config:
*execution.shutdown-on-application-finish=false*
Reporter: chenyuzhi
Attachments: image-2024-07-17-15-03-29-618.png,
image-2024-07-17-15-04-32-913.png
Using flink kubernetes operator, with config:
{code:java}
kubernetes.operator.job.restart.failed=true {code}
We got different failed-job restart result in two case.
Case1:
A job with period checkpoint enable and an intial checkpoint path, when it
failed, the operator will auto redeploy the deployment with the same job_id and
latest checkpoint path
!image-2024-07-17-15-03-29-618.png!
Case2:
A job with period checkpoint enable but no intial checkpoint, when it failed,
the operator will auto redeploy the deployment with different job_id and no
intial checkpoint path.
!image-2024-07-17-15-04-32-913.png!
I think in the case2, the redeploy behaviour may case data inconsitence. For
example the kafka source connector may consume data from earliest/latest offset.
Thus i think a job with period checkpoint enable but no intial checkpoint,
should be restart with the same job_id and latest checkpoint path, just like
case1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)