[
https://issues.apache.org/jira/browse/FLINK-35857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gyula Fora closed FLINK-35857.
------------------------------
Fix Version/s: kubernetes-operator-1.10.0
Assignee: chenyuzhi
Resolution: Fixed
merged to main 5296d63c6a948f598530f10304f7846fa5ee6a6a
> Operator restart failed job without latest checkpoint
> -----------------------------------------------------
>
> Key: FLINK-35857
> URL: https://issues.apache.org/jira/browse/FLINK-35857
> Project: Flink
> Issue Type: Bug
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.6.1
> Environment: flink kubernetes operator version: 1.6.1
> flink version 1.15.2
> flink job config:
> *execution.shutdown-on-application-finish=false*
> Reporter: chenyuzhi
> Assignee: chenyuzhi
> Priority: Major
> Labels: pull-request-available
> Fix For: kubernetes-operator-1.10.0
>
> Attachments: image-2024-07-17-15-03-29-618.png,
> image-2024-07-17-15-04-32-913.png
>
>
> Using flink kubernetes operator, with config:
> {code:java}
> kubernetes.operator.job.restart.failed=true {code}
> We got different failed-job restart result in two case.
> Case1:
> A job with period checkpoint enable and an intial checkpoint path, when it
> failed (with latestCompletedCheckpointId=19434), the operator will auto
> redeploy the deployment with the same job_id and latest checkpoint
> path(CheckpointId=19434) as intial checkpoint path
>
> !image-2024-07-17-15-03-29-618.png|width=763,height=301!
>
> Case2:
> A job with period checkpoint enable but no intial checkpoint, when it
> failed(with latestCompletedCheckpointId=30), the operator will auto redeploy
> the deployment with different job_id and no intial checkpoint path.
> !image-2024-07-17-15-04-32-913.png|width=759,height=287!
>
> In the case2, the redeploy behaviour may case data inconsitence. For example
> the kafka source connector may consume data from earliest/latest offset.
>
> Thus i think a job with period checkpoint enable but no intial checkpoint,
> should be restart with the same job_id and latest checkpoint path, just like
> case1.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)