[ 
https://issues.apache.org/jira/browse/FLINK-24621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441118#comment-17441118
 ] 

Dawid Wysakowicz commented on FLINK-24621:
------------------------------------------

I believe the user tries to do a Flink upgrade through HA services (e.g. 
zookeeper). Meaning bringing the 1.13.1 cluster down and starting a 1.13.2/3 
cluster with the intention to pick up the job automatically along with the 
latest checkpoint. AFAICT it is not something we support.

It would explain the exception as we use java serialization to put 
{{StateObject(s)}} into the {{CheckpointStore}}. I could reproduce the 
exception in that scenario. Restoring from a checkpoint works as expected, 
because we use {{MetadataSerializer}} to store and restore contents of 
{{StateObjects(s)}} there.

> JobManager fails to recover 1.13.1 checkpoint due to 
> InflightDataRescalingDescriptor
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-24621
>                 URL: https://issues.apache.org/jira/browse/FLINK-24621
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.13.2, 1.13.3
>            Reporter: Chesnay Schepler
>            Priority: Blocker
>             Fix For: 1.13.4
>
>
> A user reporter on the mailing list of a JM that is unable to read a 1.13.1 
> checkpoint.
> https://lists.apache.org/thread/wnxfpfhr5gkmovjctf7bdf8xmf7qmwlb
> The big question is why the InflightDataRescalingDescriptor is creating 
> problems, because it should not actually be contained in a checkpoint.
> {code}
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve 
> checkpoint 2844 from state handle under checkpointID-0000000000000002844. 
> This indicates that the retrieved state handle is broken. Try cleaning the 
> state handle store.
> at 
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.retrieveCompletedCheckpoint(DefaultCompletedCheckpointStore.java:309)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.recover(DefaultCompletedCheckpointStore.java:151)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1513)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreInitialCheckpointIfPresent(CheckpointCoordinator.java:1476)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:134)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:342)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:190)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:122)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:317) 
> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:107)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
> ~[?:?]
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
> at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>  Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: java.io.InvalidClassException: 
> org.apache.flink.runtime.checkpoint.InflightDataRescalingDescriptor$NoRescalingDescriptor;
>  local class incompatible: stream classdesc serialVersionUID = 
> -5544173933105855751, local class serialVersionUID = 1
> at java.io.ObjectStreamClass.initNonProxy(Unknown Source) ~[?:?]
> at java.io.ObjectInputStream.readNonProxyDesc(Unknown Source) ~[?:?]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to