[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

Steven Zhen Wu (Jira) Wed, 04 Nov 2020 18:26:25 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-19816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17226460#comment-17226460
 ]


Steven Zhen Wu commented on FLINK-19816:
----------------------------------------

This happened again for the same job in production. I noticed both failures 
started zookeeper failure. Here are some observations on the sequence of 
events. My hypothesis is that the race condition / interactions between 
recovering from zk failure and failure-rate restart-strategy caused this 
problem of restoring a wrong and very old checkpoint. [~trohrmann] 

1. initially, there were some problems with zookeeper that caused the job to 
fail

{code}
2020-10-25 02:35:59,266 WARN  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn 
[main-SendThread(ip-100-81-150-64.us-west-2.compute.internal:2181)]  - Client 
session timed out, have not heard from server in 26676
ms for sessionid 0x363850d0d034d9d
2020-10-25 02:35:59,268 INFO  
org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn 
[main-SendThread(ip-100-81-150-64.us-west-2.compute.internal:2181)]  - Client 
session timed out, have not heard from server in 26676
ms for sessionid 0x363850d0d034d9d, closing socket connection and attempting 
reconnect
2020-10-25 02:35:59,282 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem              
           [cluster-io-thread-25]  - Deleting path: 
s3://us-west-2.spaas.prod/checkpoints/r7E1/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4
535ac8a868675/chk-210758/16d4e138-4199-4fd2-a014-4b394189f72b
2020-10-25 02:35:59,369 INFO  
org.apache.flink.shaded.curator4.org.apache.curator.framework.state.ConnectionStateManager
 [main-EventThread]  - State change: SUSPENDED
{code}


2. This job is configured with `restart-strategy=failure-rate`. and there are 
enough task restarts to trigger the terminal condition canRestart() to return 
false. This should eventually lead the Flink job to halt/terminal state.

{code}
2020-10-25 02:35:59,641 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph       
[flink-akka.actor.default-dispatcher-81991]  - Job clapp-avro-nontvui 
(233b4938179c06974e4535ac8a868675) switched from state FAILING to FA
ILED.
org.apache.flink.runtime.JobException: Recovery is suppressed by 
FailureRateRestartBackoffTimeStrategy(FailureRateRestartBackoffTimeStrategy(failuresIntervalMS=1800000,backoffTimeMS=30000,maxFailuresPerInterval=20)
        at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
        at 
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
        at 
org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:203)
        at 
org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:185)
        at 
org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:179)
        at 
org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:508)
        at 
org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
        at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1725)
        at 
org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1287)
        at 
org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1255)
        at 
org.apache.flink.runtime.executiongraph.Execution.fail(Execution.java:954)
        at 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.signalPayloadRelease(SingleLogicalSlot.java:173)
        at 
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot.release(SingleLogicalSlot.java:165)
        at 
org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:732)
        at 
org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
        at 
org.apache.flink.runtime.jobmaster.slotpool.AllocatedSlot.releasePayload(AllocatedSlot.java:149)
        at 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManagerInternal(SlotPoolImpl.java:818)
        at 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.releaseTaskManager(SlotPoolImpl.java:777)
        at 
org.apache.flink.runtime.jobmaster.JobMaster.disconnectTaskManager(JobMaster.java:435)
        at 
org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:352)
        at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:216)
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:514)
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:176)
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
        at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
        at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
        at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
        at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
        at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
        at akka.actor.Actor.aroundReceive(Actor.scala:517)
        at akka.actor.Actor.aroundReceive$(Actor.scala:515)
        at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
        at akka.actor.ActorCell.invoke(ActorCell.scala:561)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
        at akka.dispatch.Mailbox.run(Mailbox.scala:225)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
        at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.util.FlinkException: Stopping JobMaster for job 
clapp-avro-nontvui(233b4938179c06974e4535ac8a868675).
        at 
org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:349)
        ... 22 more
2020-10-25 02:35:59,641 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    
[flink-akka.actor.default-dispatcher-81991]  - Stopping checkpoint coordinator 
for job 233b4938179c06974e4535ac8a868675.
2020-10-25 02:35:59,641 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
[flink-akka.actor.default-dispatcher-81991]  - Shutting down
{code}

3. Flink then removes the checkpoint-counter from zk, as it tries to fail/halt 
the job

{code}
2020-10-25 02:36:03,262 INFO  
org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore 
[flink-akka.actor.default-dispatcher-81991]  - Removing 
/spaas/clapp_avro-clapp_avro_nontvui/1593/default/checkpoints/233b4938179c06974e45
35ac8a868675 from ZooKeeper
2020-10-25 02:36:03,269 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter 
[flink-akka.actor.default-dispatcher-81991]  - Shutting down.
2020-10-25 02:36:03,269 INFO  
org.apache.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter 
[flink-akka.actor.default-dispatcher-81991]  - Removing 
/checkpoint-counter/233b4938179c06974e4535ac8a868675 from ZooKeeper
{code}

4. but somehow Flink tries to start the job again even though failure-rate 
restart-strategy reached terminal state for canRestart(). maybe it is because 
the zookeeper reconnect caused another path of job recovery? because the proper 
checkpoint info in zookeeper is already cleaned in previous step, now Flink 
recovered a wrong checkpoint.

{code}
2020-10-25 02:36:03,291 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     
[flink-akka.actor.default-dispatcher-82003]  - Stopped dispatcher 
akka.tcp://[email protected]:36481/user/rpc/dispatcher_1.
2020-10-25 02:36:03,301 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
[cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
2020-10-25 02:36:03,302 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
[cluster-io-thread-5]  - Recover all persisted job graphs.
2020-10-25 02:36:03,307 INFO  
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
[cluster-io-thread-5]  - Trying to recover job with job id 
233b4938179c06974e4535ac8a868675.
{code}

> Flink restored from a wrong checkpoint (a very old one and not the last 
> completed one)
> --------------------------------------------------------------------------------------
>
>                 Key: FLINK-19816
>                 URL: https://issues.apache.org/jira/browse/FLINK-19816
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Steven Zhen Wu
>            Priority: Major
>
> h2. Summary
> Upon failure, it seems that Flink didn't restore from the last completed 
> checkpoint. Instead, it restored from a very old checkpoint. As a result, 
> Kafka offsets are invalid and caused the job to replay from the beginning as 
> Kafka consumer "auto.offset.reset" was set to "EARLIEST".
> This is an embarrassingly parallel stateless job. Parallelism is over 1,000. 
> I have the full log file from jobmanager at INFO level available upon request.
> h2. Sequence of events from the logs
> Just before the failure, checkpoint *210768* completed.
> {code}
> 2020-10-25 02:35:05,970 INFO 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator 
> [jobmanager-future-thread-5] - Completed checkpoint 210768 for job 
> 233b4938179c06974e4535ac8a868675 (4623776 bytes in 120402 ms).
> {code}
> During restart, somehow it decided to restore from a very old checkpoint 
> *203531*.
> {code:java}
> 2020-10-25 02:36:03,301 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-3]  - Start SessionDispatcherLeaderProcess.
> 2020-10-25 02:36:03,302 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Recover all persisted job graphs.
> 2020-10-25 02:36:03,304 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem            
>              [cluster-io-thread-25]  - Deleting path: 
> s3://<bucket>/checkpoints/XM3B/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/c31aec1e-07a7-4193-aa00-3fbe83f9e2e6
> 2020-10-25 02:36:03,307 INFO  
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess 
> [cluster-io-thread-5]  - Trying to recover job with job id 
> 233b4938179c06974e4535ac8a868675.
> 2020-10-25 02:36:03,381 INFO  com.netflix.bdp.s3fs.BdpS3FileSystem            
>              [cluster-io-thread-25]  - Deleting path: 
> s3://<bucket>/checkpoints/Hh86/clapp_avro-clapp_avro_nontvui/1593/233b4938179c06974e4535ac8a868675/chk-210758/4ab92f70-dfcd-4212-9b7f-bdbecb9257fd
> ...
> 2020-10-25 02:36:03,427 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Recovering checkpoints from 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Found 0 checkpoints in 
> ZooKeeper.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore 
> [flink-akka.actor.default-dispatcher-82003]  - Trying to fetch 0 checkpoints 
> from storage.
> 2020-10-25 02:36:03,432 INFO  
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator    
> [flink-akka.actor.default-dispatcher-82003]  - Starting job 
> 233b4938179c06974e4535ac8a868675 from savepoint 
> s3://<bucket>/checkpoints/metadata/clapp_avro-clapp_avro_nontvui/1113/47e2a25a8d0b696c7d0d423722bb6f54/chk-203531/_metadata
>  ()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19816) Flink restored from a wrong checkpoint (a very old one and not the last completed one)

Reply via email to