[ 
https://issues.apache.org/jira/browse/FLINK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411809#comment-17411809
 ] 

Piotr Nowojski edited comment on FLINK-23874 at 9/8/21, 9:02 AM:
-----------------------------------------------------------------

The ZK was updated with newer checkpoints, otherwise those checkpoints would 
fail and we would see those errors in the logs. But because you were cancelling 
the job, that caused the ZK to be cleaned up:
{noformat}
2021-08-18 11:09:16,673 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter [] - Removing 
/checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
{noformat}
shortly before the failure happened. This is this ordering problem of ZK/Yarn 
clean up that I quoted above and what Till had in mind in his response. This is 
causing the problem, that if failure happens when shutting down job/cleaning up 
yarn, but after ZK has been cleaned up, your job will restart either with an 
empty state, or in your case from a savepoint that you originally started your 
job.


was (Author: pnowojski):
The ZK was updated with newer checkpoints. But because you were cancelling the 
job, that caused the ZK to be cleaned up shortly before the failure happened
{noformat}
2021-08-18 11:09:16,673 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter [] - Removing 
/checkpoint-counter/00000000000000000000000000000000 from ZooKeeper
{noformat}
This is this ordering problem of ZK/Yarn clean up that I quoted above and what 
Till had in mind in his response. This is causing the problem, that if failure 
happens when shutting down job/cleaning up yarn, but after ZK has been cleaned 
up, your job will restart either with an empty state, or in your case from a 
savepoint that you originally started your job.

> JM did not store latest checkpiont id into Zookeeper, silently
> --------------------------------------------------------------
>
>                 Key: FLINK-23874
>                 URL: https://issues.apache.org/jira/browse/FLINK-23874
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.1
>            Reporter: Youjun Yuan
>            Priority: Major
>         Attachments: container_e04_1628083845581_0254_01_000001_jm.log, 
> container_e04_1628083845581_0254_01_000050_tm.log, 
> container_e04_1628083845581_0254_02_000001_jm.log
>
>
> Job manager did not update the latest successful checkpoint id into zookeeper 
> (with ZK HA setup), at path /flink/\{app_id}/checkpoints/, when JM restart, 
> the job resumed from a very old position.
>  
> We had a job which was resumed from save point 258, after running for a few 
> days, the latest successful checkpoint was about chk 686. When something 
> trigged the JM to restart, it restored state to save point 258, instead of 
> chk 686.
> We checked zookeeper, indeed the stored checkpoint was still 258, which means 
> JM hasn't stored checkpoint id into zookeeper for few days, and without any 
> error message.
>  
> below are the relevant logs around the restart:
> {quote}
> {{2021-08-18 11:09:16,505 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed 
> checkpoint 686 for job 00000000000000000000000000000000 (228296 bytes in 827 
> ms).}}
> {quote}
>  
> {quote}2021-08-18 11:10:13,066 INFO 
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation
>  [] - Finished restoring from state handle: 
> IncrementalRemoteKeyedStateHandle\{backendIdentifier=c11d290c-617b-4ea5-b7ed-4853272f32a3,
>  keyGroupRange=KeyGroupRange{startKeyGroup=47, endKeyGroup=48}, 
> checkpointId=258, sharedState={}, 
> privateState=\{OPTIONS-000016=ByteStreamStateHandle{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/20f35556-5c60-4fca-908c-d05d641c2614',
>  dataBytes=15818}, 
> MANIFEST-000006=ByteStreamStateHandle\{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/3c8e1c2f-616d-4f18-8b07-4a818e3ca110',
>  dataBytes=336}, 
> CURRENT=ByteStreamStateHandle\{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/1dc5f341-8a73-4e69-96fb-4b026653da6d',
>  dataBytes=16}}, 
> metaStateHandle=ByteStreamStateHandle\{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/chk-258/0ef57eb3-0f38-45f5-8f3d-3e7b87f5fd15',
>  dataBytes=1704}, registered=false} without rescaling.
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to