[jira] [Comment Edited] (FLINK-23874) JM did not store latest checkpiont id into Zookeeper, silently

Piotr Nowojski (Jira) Thu, 02 Sep 2021 07:49:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408904#comment-17408904
 ]


Piotr Nowojski edited comment on FLINK-23874 at 9/2/21, 2:48 PM:
-----------------------------------------------------------------

Thanks for reporting the issue.

1. Can you provide JobManager and potential TaskManager logs? 
2. Have you checked what are the checkpoint directories written to the s3?

[~trohrmann], [~yunta] have you seen something like that in the past? Is there 
a way how can we manually check what was stored in the ZooKeeper?


was (Author: pnowojski):
1. Can you provide JobManager and potential TaskManager logs? 
2. Have you checked what are the checkpoint directories written to the s3?

[~trohrmann], [~yunta] have you seen something like that in the past? Is there 
a way how can we manually check what was stored in the ZooKeeper?

> JM did not store latest checkpiont id into Zookeeper, silently
> --------------------------------------------------------------
>
>                 Key: FLINK-23874
>                 URL: https://issues.apache.org/jira/browse/FLINK-23874
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.1
>            Reporter: Youjun Yuan
>            Priority: Major
>
> Job manager did not update the latest successful checkpoint id into zookeeper 
> (with ZK HA setup), at path /flink/\{app_id}/checkpoints/, when JM restart, 
> the job resumed from a very old position.
>  
> We had a job which was resumed from save point 258, after running for a few 
> days, the latest successful checkpoint was about chk 686. When something 
> trigged the JM to restart, it restored state to save point 258, instead of 
> chk 686.
> We checked zookeeper, indeed the stored checkpoint was still 258, which means 
> JM hasn't stored checkpoint id into zookeeper for few days, and without any 
> error message.
>  
> below are the relevant logs around the restart:
> {quote}
> {{2021-08-18 11:09:16,505 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed 
> checkpoint 686 for job 00000000000000000000000000000000 (228296 bytes in 827 
> ms).}}
> {quote}
>  
> {quote}2021-08-18 11:10:13,066 INFO 
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation
>  [] - Finished restoring from state handle: 
> IncrementalRemoteKeyedStateHandle\{backendIdentifier=c11d290c-617b-4ea5-b7ed-4853272f32a3,
>  keyGroupRange=KeyGroupRange{startKeyGroup=47, endKeyGroup=48}, 
> checkpointId=258, sharedState={}, 
> privateState=\{OPTIONS-000016=ByteStreamStateHandle{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/20f35556-5c60-4fca-908c-d05d641c2614',
>  dataBytes=15818}, 
> MANIFEST-000006=ByteStreamStateHandle\{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/3c8e1c2f-616d-4f18-8b07-4a818e3ca110',
>  dataBytes=336}, 
> CURRENT=ByteStreamStateHandle\{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/1dc5f341-8a73-4e69-96fb-4b026653da6d',
>  dataBytes=16}}, 
> metaStateHandle=ByteStreamStateHandle\{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/chk-258/0ef57eb3-0f38-45f5-8f3d-3e7b87f5fd15',
>  dataBytes=1704}, registered=false} without rescaling.
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-23874) JM did not store latest checkpiont id into Zookeeper, silently

Reply via email to