[
https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162553#comment-17162553
]
Biao Liu commented on FLINK-18641:
----------------------------------
Thanks [~becket_qin] for analyzing this issue. The asynchronous checkpoint
threading model breaks the assumption of {{ExternallyInducedSource}} could
trigger a checkpoint before {{MasterTriggerRestoreHook}} finishes the trigger
future. We can find a way to guarantee that. However it's not friendly to the
scenario that doing some initialization or preparation in
{{MasterTriggerRestoreHook}}. Because checkpoint might be triggered before
initialization of preparation finishes. Hope nobody uses it like that :(
Currently the semantics of {{ExternallyInducedSource}} is highly bound with the
implementation of Flink checkpoint which should be avoided IMO. I think we
should redesign the {{ExternallyInducedSource}} as a long-term goal.
To [~becket_qin], do you already have any idea for fixing it? If not, I could
help to fix it.
BTW, this change of {{CheckpointCoordinator}} is introduced in 1.10. Is it
possible that the failure of testing case is exposed by the change of
{{OperatorCoordinator}}? Because we add another asynchronous step between
master hook triggering and task triggering. I'm not sure if there must be some
{{OperatorCoordinator}} added or not in the scenario of Pravega connector
testing. If not, there is a work-around way that try to finish future returned
by {{MasterTriggerRestoreHook.triggerCheckpoint}} before trigger task
checkpoint (I assume there is only one master hook in the case).
> "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
> ------------------------------------------------------------------
>
> Key: FLINK-18641
> URL: https://issues.apache.org/jira/browse/FLINK-18641
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.11.0
> Reporter: Brian Zhou
> Priority: Major
>
> https://github.com/pravega/flink-connectors is a Pravega connector for Flink.
> The ReaderCheckpointHook[1] class uses the Flink `MasterTriggerRestoreHook`
> interface to trigger the Pravega checkpoint during Flink checkpoints to make
> sure the data recovery. The checkpoint recovery tests are running fine in
> Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out.
> Suspect it is related to the checkpoint coordinator thread model changes in
> Flink 1.11
> Error stacktrace:
> {code}
> 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN
> o.a.f.runtime.jobmaster.JobMaster - Error while processing checkpoint
> acknowledgement message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize
> the pending checkpoint 3. Failure reason: Failure to finalize checkpoint.
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948)
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has
> not been fully acknowledged yet
> at
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
> at
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021)
> ... 9 common frames omitted
> {code}
> More detail in this mailing thread:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html
> Also in https://github.com/pravega/flink-connectors/issues/387
--
This message was sent by Atlassian Jira
(v8.3.4#803005)