[ 
https://issues.apache.org/jira/browse/FLINK-24621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17441558#comment-17441558
 ] 

Dawid Wysakowicz commented on FLINK-24621:
------------------------------------------

I am decreasing priority, because as far as I can tell, it was most probably 
caused by a scenario such as:
# Start a 1.13.1 cluster with zookeeper HA
# Kill the 1.13.1 cluster
# Start a 1.13.2/3 cluster configured with zookeeper HA (with the same 
cluster_id)
# The 1.13.2/3 cluster tries to restore the previous job
Such a scenario is not something we support atm.

> JobManager fails to recover 1.13.1 checkpoint due to 
> InflightDataRescalingDescriptor
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-24621
>                 URL: https://issues.apache.org/jira/browse/FLINK-24621
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.13.2, 1.13.3
>            Reporter: Chesnay Schepler
>            Assignee: Dawid Wysakowicz
>            Priority: Blocker
>             Fix For: 1.13.4
>
>
> A user reporter on the mailing list of a JM that is unable to read a 1.13.1 
> checkpoint.
> https://lists.apache.org/thread/wnxfpfhr5gkmovjctf7bdf8xmf7qmwlb
> The big question is why the InflightDataRescalingDescriptor is creating 
> problems, because it should not actually be contained in a checkpoint.
> {code}
> Caused by: org.apache.flink.util.FlinkException: Could not retrieve 
> checkpoint 2844 from state handle under checkpointID-0000000000000002844. 
> This indicates that the retrieved state handle is broken. Try cleaning the 
> state handle store.
> at 
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.retrieveCompletedCheckpoint(DefaultCompletedCheckpointStore.java:309)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore.recover(DefaultCompletedCheckpointStore.java:151)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedStateInternal(CheckpointCoordinator.java:1513)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreInitialCheckpointIfPresent(CheckpointCoordinator.java:1476)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:134)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:342)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:190)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:122)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:132)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:340)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:317) 
> ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:107)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:95)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at 
> org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
>  ~[flink-dist_2.12-1.13.3.jar:1.13.3]
> at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) 
> ~[?:?]
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
> at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>  Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
> at java.lang.Thread.run(Unknown Source) ~[?:?]
> Caused by: java.io.InvalidClassException: 
> org.apache.flink.runtime.checkpoint.InflightDataRescalingDescriptor$NoRescalingDescriptor;
>  local class incompatible: stream classdesc serialVersionUID = 
> -5544173933105855751, local class serialVersionUID = 1
> at java.io.ObjectStreamClass.initNonProxy(Unknown Source) ~[?:?]
> at java.io.ObjectInputStream.readNonProxyDesc(Unknown Source) ~[?:?]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to