[jira] [Created] (FLINK-35857) Operator restart failed job without latest checkpoint

chenyuzhi (Jira) Wed, 17 Jul 2024 00:09:17 -0700

chenyuzhi created FLINK-35857:
---------------------------------

             Summary: Operator restart failed job without latest checkpoint
                 Key: FLINK-35857
                 URL: https://issues.apache.org/jira/browse/FLINK-35857
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.6.1
         Environment:  flink kubernetes operator version: 1.6.1


flink version 1.15.2

flink job config:

*execution.shutdown-on-application-finish=false*
            Reporter: chenyuzhi
         Attachments: image-2024-07-17-15-03-29-618.png, 
image-2024-07-17-15-04-32-913.png

Using flink kubernetes operator, with config: 
{code:java}
kubernetes.operator.job.restart.failed=true {code}
We got different failed-job restart result in two case. 

Case1:  

 A job with period checkpoint enable and an intial checkpoint path, when it 
failed, the operator will auto redeploy the deployment with the same job_id and 
latest checkpoint path 

 

!image-2024-07-17-15-03-29-618.png!

 

Case2:

 A job with period checkpoint enable but  no intial checkpoint, when it failed, 
the operator will auto redeploy the deployment with different job_id  and no 
intial checkpoint path.

!image-2024-07-17-15-04-32-913.png!

 

I think in the case2, the redeploy behaviour may case data inconsitence. For 
example the kafka source connector may consume data from earliest/latest offset.

 

Thus i think  a job with period checkpoint enable but  no intial checkpoint, 
should be restart with the same job_id and latest checkpoint path, just like 
case1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-35857) Operator restart failed job without latest checkpoint

Reply via email to