[jira] [Commented] (FLINK-28843) Failed to restore from changelog checkpoint in claim mode

Lihe Ma (Jira) Sun, 07 Aug 2022 08:22:07 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-28843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17576403#comment-17576403
 ]


Lihe Ma commented on FLINK-28843:
---------------------------------

I think the root cause is that IncrementalKeyedStateHandle  is not handled 
properly, only KeyGroupsStateHandle will be cast to absolute path during 
restore, maybe we could fix this by casting  IncrementalRemoteKeyedStateHandle 
in the same way. 

> Failed to restore from changelog checkpoint in claim mode
> ---------------------------------------------------------
>
>                 Key: FLINK-28843
>                 URL: https://issues.apache.org/jira/browse/FLINK-28843
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / State Backends
>    Affects Versions: 1.15.0, 1.15.1
>            Reporter: Lihe Ma
>            Priority: Critical
>
> # When native checkpoint is enabled and incremental checkpointing is enabled 
> in rocksdb statebackend，if state data is greater than 
> state.storage.fs.memory-threshold，it will be stored in a data file 
> (FileStateHandle，RelativeFileStateHandle, etc) rather than stored with 
> ByteStreamStateHandle in checkpoint metadata, like base-path1/chk-1/file1.
>  # Then restore the job from base-path1/chk-1 in claim mode，using changelog 
> statebackend，and the checkpoint path is set to base-path2, then new 
> checkpoint will be saved in base-path2/chk-2, previous checkpoint file 
> (base-path1/chk-1/file1) is needed.
>  # Then restore the job from base-path2/chk-2 in changelog statebackend, 
> flink will try to read base-path2/chk-2/file1, rather than the actual file 
> location base-path1/chk-1/file1, which leads to FileNotFoundException and job 
> failed.
>  
> How to reproduce?
>  # Set state.storage.fs.memory-threshold to a small value, like '20b'.
>  # {{run 
> org.apache.flink.test.checkpointing.ChangelogPeriodicMaterializationSwitchStateBackendITCase#testSwitchFromDisablingToEnablingInClaimMode}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-28843) Failed to restore from changelog checkpoint in claim mode

Reply via email to