[
https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164178#comment-17164178
]
Jiangjie Qin commented on FLINK-18641:
--------------------------------------
[~pnowojski] Synchronously waiting for the hook's future is not sufficient
after introduction of OperatorCoordinator. This is because when the master
hooks are triggered on the {{ExternallyInducedSource}}, the tasks may start to
checkpoint the operators before the OperatorCoordinator is checkpointed. This
breaks the checkpoint contract between the OperatorCoordinator and the
Operator, which is that the OperatorCoordinator is always checkpointed before
the subtasks/operators are checkpointed.
Otherwise, yes, we only need to wait for the master hooks to finish before
completing the checkpoint.
> "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
> ------------------------------------------------------------------
>
> Key: FLINK-18641
> URL: https://issues.apache.org/jira/browse/FLINK-18641
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.11.0
> Reporter: Brian Zhou
> Priority: Major
>
> https://github.com/pravega/flink-connectors is a Pravega connector for Flink.
> The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook`
> interface to trigger the Pravega checkpoint during Flink checkpoints to make
> sure the data recovery. The checkpoint recovery tests are running fine in
> Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out.
> Suspect it is related to the checkpoint coordinator thread model changes in
> Flink 1.11
> Error stacktrace:
> {code}
> 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN
> o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint
> acknowledgement message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize
> the pending checkpoint 3. Failure reason: Failure to finalize checkpoint.
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948)
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has
> not been fully acknowledged yet
> at
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
> at
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021)
> ... 9 common frames omitted
> {code}
> More detail in this mailing thread:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html
> Also in https://github.com/pravega/flink-connectors/issues/387
--
This message was sent by Atlassian Jira
(v8.3.4#803005)