[
https://issues.apache.org/jira/browse/FLINK-25992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490624#comment-17490624
]
Liu commented on FLINK-25992:
-----------------------------
Thanks [~roman] . I have check the code and the log. When the job restores from
the checkpoint 2, the method reportRestoredCheckpoint in CheckpointStatsTracker
is called . So there is update in progress and the member dirty is set true. In
method createSnapshot, only when statsReadWriteLock.tryLock() is false then
snapshot will not be updated. Additionally, the method createSnapshot is also
triggered when awaiting job status in line 138 of JobDispatcherITCase. \
Based on the above info, it is hard to get the root reason. Your suggestion is
valuable. But I am not sure wether it can resolve the problem thoroughly. What
do you think?
> JobDispatcherITCase.testRecoverFromCheckpointAfterLosingAndRegainingLeadership
> fails on azure
> ---------------------------------------------------------------------------------------------
>
> Key: FLINK-25992
> URL: https://issues.apache.org/jira/browse/FLINK-25992
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Tests
> Affects Versions: 1.15.0
> Reporter: Roman Khachatryan
> Priority: Major
> Labels: test-stability
> Fix For: 1.15.0
>
> Attachments: mvn-2.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30871&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=24c3384f-1bcb-57b3-224f-51bf973bbee8&l=9154
> {code}
> 19:41:35,515 [flink-akka.actor.default-dispatcher-9] WARN
> org.apache.flink.runtime.taskmanager.Task [] - jobVertex
> (1/1)#0 (7efdea21f5f95490e02117063ce8a314) switched from RUNNING to FAILED
> with failure cause: java.lang.RuntimeException: Error while notify checkpoint
> ABORT.
> at
> org.apache.flink.runtime.taskmanager.Task.notifyCheckpoint(Task.java:1457)
> at
> org.apache.flink.runtime.taskmanager.Task.notifyCheckpointAborted(Task.java:1407)
> at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.abortCheckpoint(TaskExecutor.java:1021)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> at akka.actor.Actor.aroundReceive(Actor.scala:537)
> at akka.actor.Actor.aroundReceive$(Actor.scala:535)
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
> at akka.actor.ActorCell.invoke(ActorCell.scala:548)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
> at akka.dispatch.Mailbox.run(Mailbox.scala:231)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
> at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> at
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
> at
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
> Caused by: java.lang.UnsupportedOperationException:
> notifyCheckpointAbortAsync not supported by
> org.apache.flink.runtime.dispatcher.JobDispatcherITCase$AtLeastOneCheckpointInvokable
> at
> org.apache.flink.runtime.jobgraph.tasks.AbstractInvokable.notifyCheckpointAbortAsync(AbstractInvokable.java:205)
> at
> org.apache.flink.runtime.taskmanager.Task.notifyCheckpoint(Task.java:1430)
> ... 31 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)