[
https://issues.apache.org/jira/browse/FLINK-27127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Abdullah alkhawatrah updated FLINK-27127:
-----------------------------------------
Summary: Local recovery is not triggered on task manager process restart
(was: Local recovery on task manager process restart)
> Local recovery is not triggered on task manager process restart
> ---------------------------------------------------------------
>
> Key: FLINK-27127
> URL: https://issues.apache.org/jira/browse/FLINK-27127
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.15.0
> Reporter: Abdullah alkhawatrah
> Priority: Blocker
>
> Hey,
> I am experimenting with the support of local recovery after process restart
> introduced in 1.15. I am trying this on minikube.
> So far, it seems that every time a pod restarts, remote recovery is triggered.
> I have created a repo with everything needed to test it locally with
> minikube: [https://github.com/akhawatrahTW/flink-local-recovery-test.]
> The readme contains the steps to reproduce.
>
> Based on the documentation, I was expecting to have local recovery triggered
> on pod restarts since the needed configs are set:
> [https://github.com/akhawatrahTW/flink-local-recovery-test/blob/bfef14e45f475ba953a05b50b8829d9d33bdcec6/k8s/flink-configuration-configmap.yaml#L27.]
> So was expecting to see something similar to this in the logs of the
> recreated task manager pod:
> *Expected:*
> {code:java}
> 2022-04-07 09:17:17,637 INFO
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation
> [] - Starting to restore from state handle:
> IncrementalLocalKeyedStateHandle{metaDataState=File State:
> file:/pv/tm_flink-taskmanager-2/localState/aid_e56a834e076a6d8f9dc1a2997e97a91a/jid_f88542b420546fadbc94db66b00cb5a0/vtx_20ba6b65f97481d5570070de90e4e791_sti_2/chk_1208/c2756339-8938-4949-84ff-d7ee3f4c55cf
> [479 bytes]}
> DirectoryKeyedStateHandle{directoryStateHandle=DirectoryStateHandle{directory=/pv/tm_flink-taskmanager-2/localState/aid_e56a834e076a6d8f9dc1a2997e97a91a/jid_f88542b420546fadbc94db66b00cb5a0/vtx_20ba6b65f97481d5570070de90e4e791_sti_2/chk_1208/5455302ce9554a1f81365aee368f267e},
> keyGroupRange=KeyGroupRange{startKeyGroup=86, endKeyGroup=127}} without
> rescaling.{code}
>
>
> But for some reason, remote recovery it triggered:
> *Actual:*
> {code:java}
> 2022-04-07 09:17:18,405 INFO
> org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation
> [] - Finished restoring from state handle:
> IncrementalRemoteKeyedStateHandle{backendIdentifier=544f3300-36bd-40a6-9ee3-f78b0e47dfd6,
> stateHandleId=c2753d01-2f6b-49f0-9ca1-df6b54c61490,
> keyGroupRange=KeyGroupRange{startKeyGroup=0, endKeyGroup=42},
> checkpointId=1208,
> sharedState={001526.sst=ByteStreamStateHandle{handleName='f5a113d0-8094-40e7-a1b1-adc4cfc690c2',
> dataBytes=23107},
> 001527.sst=ByteStreamStateHandle{handleName='3806411e-8213-406a-bbd8-e498ab19d118',
> dataBytes=15579},
> 001528.sst=ByteStreamStateHandle{handleName='4fef6ead-1522-4f61-a6ad-399b334b41ca',
> dataBytes=15839},
> 001529.sst=ByteStreamStateHandle{handleName='f1324a0c-3eae-46b0-acc2-c03d32b0c24a',
> dataBytes=16055}},
> privateState={OPTIONS-001237=ByteStreamStateHandle{handleName='2e36d07b-5f91-4c9d-9778-5a16bb6254d5',
> dataBytes=9924},
> MANIFEST-001234=ByteStreamStateHandle{handleName='4c95b38a-4afa-4154-9c89-9518d6384a25',
> dataBytes=27356},
> CURRENT=ByteStreamStateHandle{handleName='17bd5bab-c369-470a-bf29-e76279cef2ba',
> dataBytes=16}},
> metaStateHandle=ByteStreamStateHandle{handleName='15827f44-0ab2-4562-b8eb-812b8d260206',
> dataBytes=479}, registered=false} without rescaling.{code}
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)