[jira] [Commented] (FLINK-23874) JM did not store latest checkpiont id into Zookeeper, silently

Youjun Yuan (Jira) Wed, 08 Sep 2021 01:53:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411798#comment-17411798
 ]


Youjun Yuan commented on FLINK-23874:
-------------------------------------

thanks for the response.

Though I still don't understand why the chk id in zk wasn't updated for days.

The job was originally resumed from chk 258 on 14th Aug, then we tried to stop 
the job on 18th Aug, which caused the JM restarted. So the content in ZK should 
had been updated to ~chk 686, but on 19th Aug (when we realized the issue), I 
check the content in ZK, it's still chk 258.

> JM did not store latest checkpiont id into Zookeeper, silently
> --------------------------------------------------------------
>
>                 Key: FLINK-23874
>                 URL: https://issues.apache.org/jira/browse/FLINK-23874
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.12.1
>            Reporter: Youjun Yuan
>            Priority: Major
>         Attachments: container_e04_1628083845581_0254_01_000001_jm.log, 
> container_e04_1628083845581_0254_01_000050_tm.log, 
> container_e04_1628083845581_0254_02_000001_jm.log
>
>
> Job manager did not update the latest successful checkpoint id into zookeeper 
> (with ZK HA setup), at path /flink/\{app_id}/checkpoints/, when JM restart, 
> the job resumed from a very old position.
>  
> We had a job which was resumed from save point 258, after running for a few 
> days, the latest successful checkpoint was about chk 686. When something 
> trigged the JM to restart, it restored state to save point 258, instead of 
> chk 686.
> We checked zookeeper, indeed the stored checkpoint was still 258, which means 
> JM hasn't stored checkpoint id into zookeeper for few days, and without any 
> error message.
>  
> below are the relevant logs around the restart:
> {quote}
> {{2021-08-18 11:09:16,505 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed 
> checkpoint 686 for job 00000000000000000000000000000000 (228296 bytes in 827 
> ms).}}
> {quote}
>  
> {quote}2021-08-18 11:10:13,066 INFO 
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation
>  [] - Finished restoring from state handle: 
> IncrementalRemoteKeyedStateHandle\{backendIdentifier=c11d290c-617b-4ea5-b7ed-4853272f32a3,
>  keyGroupRange=KeyGroupRange{startKeyGroup=47, endKeyGroup=48}, 
> checkpointId=258, sharedState={}, 
> privateState=\{OPTIONS-000016=ByteStreamStateHandle{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/20f35556-5c60-4fca-908c-d05d641c2614',
>  dataBytes=15818}, 
> MANIFEST-000006=ByteStreamStateHandle\{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/3c8e1c2f-616d-4f18-8b07-4a818e3ca110',
>  dataBytes=336}, 
> CURRENT=ByteStreamStateHandle\{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/shared/1dc5f341-8a73-4e69-96fb-4b026653da6d',
>  dataBytes=16}}, 
> metaStateHandle=ByteStreamStateHandle\{handleName='s3://dp-flink/prd/checkpoints/3beb4dd5-9008-4fd0-9910-f564759b466a/1628912020683/00000000000000000000000000000000/chk-258/0ef57eb3-0f38-45f5-8f3d-3e7b87f5fd15',
>  dataBytes=1704}, registered=false} without rescaling.
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-23874) JM did not store latest checkpiont id into Zookeeper, silently

Reply via email to