uharaqo created FLINK-28206:
-------------------------------
Summary: EOFException on Checkpoint Recovery
Key: FLINK-28206
URL: https://issues.apache.org/jira/browse/FLINK-28206
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing
Affects Versions: 1.14.4
Reporter: uharaqo
We have only one Job Manager in Kubernetes and it suddenly got killed without
any logs. A new Job Manager process could not recover from a checkpoint due to
an EOFException.
Task Managers killed themselves since they could not find any Job Manager.
There were no error logs other than that on the Task Manager side.
It looks to me that the checkpoint is corrupted. Is there a way to identify the
cause? What would you recommend us to do to mitigate this problem?
Here's the logs during the recovery phase. (Removed the stacktrace. Please find
that at the bottom.)
{noformat}
{"timestamp":"2022-06-22T17:21:25.870Z","level":"INFO","logger":"org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils","message":"Recovering
checkpoints from
KubernetesStateHandleStore{configMapName='univex-flink-record-collector-46071c6a64e47d1ce828dfe032f943a6-jobmanager-leader'}."}
{"timestamp":"2022-06-22T17:21:25.875Z","level":"INFO","logger":"org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils","message":"Found
1 checkpoints in
KubernetesStateHandleStore{configMapName='univex-flink-record-collector-46071c6a64e47d1ce828dfe032f943a6-jobmanager-leader'}."}
{"timestamp":"2022-06-22T17:21:25.876Z","level":"INFO","logger":"org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils","message":"Trying
to fetch 1 checkpoints from storage."}
{"timestamp":"2022-06-22T17:21:25.876Z","level":"INFO","logger":"org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils","message":"Trying
to retrieve checkpoint 58130."}
{"timestamp":"2022-06-22T17:21:25.901Z","level":"ERROR","logger":"org.apache.flink.runtime.entrypoint.ClusterEntrypoint","message":"Fatal
error occurred in the cluster
entrypoint.","level":"INFO","logger":"org.apache.flink.runtime.entrypoint.ClusterEntrypoint","message":"Shutting
StandaloneSessionClusterEntrypoint down with application status UNKNOWN.
Diagnostics Cluster entrypoint has been closed externally.."}
{"timestamp":"2022-06-22T17:21:25.921Z","level":"INFO","logger":"org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint","message":"Shutting
down rest endpoint."}
{"timestamp":"2022-06-22T17:21:25.922Z","level":"INFO","logger":"org.apache.flink.runtime.blob.BlobServer","message":"Stopped
BLOB server at 0.0.0.0:6124"}
{noformat}
The stacktrace of the ERROR:
{noformat}
org.apache.flink.util.FlinkException: JobMaster for job
46071c6a64e47d1ce828dfe032f943a6 failed.
at
org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:913)
at
org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:473)
at
org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:450)
at
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$3(Dispatcher.java:427)
at
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
at
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
at
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:455)
at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:455)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213)
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at
java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at
java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: org.apache.flink.runtime.client.JobInitializationException: Could
not start the JobMaster.
at
org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97)
at
java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
at
java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
at
java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705)
at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.CompletionException:
java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Failed to initialize
high-availability completed checkpoint store
at
java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314)
at
java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702)
... 3 more
Caused by: java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Failed to initialize
high-availability completed checkpoint store
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:114)
at
java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
... 3 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Failed to
initialize high-availability completed checkpoint store
at
org.apache.flink.runtime.scheduler.SchedulerUtils.createCompletedCheckpointStoreIfCheckpointingIsEnabled(SchedulerUtils.java:57)
at
org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:180)
at
org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:140)
at
org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:134)
at
org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
at
org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:346)
at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:323)
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:106)
at
org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:94)
at
org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
... 4 more
Caused by: org.apache.flink.util.FlinkException: Could not retrieve checkpoint
58130 from state handle under checkpointID-0000000000000058130. This indicates
that the retrieved state handle is broken. Try cleaning the state handle store.
at
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils.retrieveCompletedCheckpoint(DefaultCompletedCheckpointStoreUtils.java:111)
at
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils.retrieveCompletedCheckpoints(DefaultCompletedCheckpointStoreUtils.java:89)
at
org.apache.flink.kubernetes.utils.KubernetesUtils.createCompletedCheckpointStore(KubernetesUtils.java:314)
at
org.apache.flink.kubernetes.highavailability.KubernetesCheckpointRecoveryFactory.createRecoveredCompletedCheckpointStore(KubernetesCheckpointRecoveryFactory.java:78)
at
org.apache.flink.runtime.scheduler.SchedulerUtils.createCompletedCheckpointStore(SchedulerUtils.java:91)
at
org.apache.flink.runtime.scheduler.SchedulerUtils.createCompletedCheckpointStoreIfCheckpointingIsEnabled(SchedulerUtils.java:54)
... 13 more
Caused by: java.io.EOFException
at
java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2872)
at
java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3367)
at
java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:936)
at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:379)
at
org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.<init>(InstantiationUtil.java:68)
at
org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:612)
at
org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:595)
at
org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:59)
at
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils.retrieveCompletedCheckpoint(DefaultCompletedCheckpointStoreUtils.java:102)
... 18 more
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)