[
https://issues.apache.org/jira/browse/FLINK-23240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488863#comment-17488863
]
Xintong Song commented on FLINK-23240:
--------------------------------------
[~trohrmann],
I think you have understand the code correctly.
Moreover, despite the original RM interface (and some test cases, thus the
`enable-rm-multi-leader-session` property) looks like it was designed to live
through multiple leader sessions, I have an impression that we had never really
supported that. I cannot recall what was the problem before FLINK-21667.
According to the discussion in this
[PR|https://github.com/apache/flink/pull/15524], we agreed to narrow down the
scope of FLINK-21667 to only solve the problem that non-leading RM may
accidentally change the resources, and decided to support multiple leader
sessions in one process later if needed.
I think at least the Yarn deployment cannot support multiple leader sessions.
Maybe we can revisit this for other deployment modes, and do not terminate the
process where multiple leader session is feasible.
>From my side, I would not consider this as a release blocker. Because normally
>RM lost leadership only when 1) there's a problem with the leading master
>process or 2) when the HA services is unstable/unavailable.
- For 1), termination of process is desired.
- For 2), the job fails anyway. The regression only exists when the HA services
has been down for long enough to trigger the leadership lost, and soon come
back online that is faster than the process being restarted.
> ResumeCheckpointManuallyITCase.testExternalizedFSCheckpointsWithLocalRecoveryZookeeper
> fails on azure
> -----------------------------------------------------------------------------------------------------
>
> Key: FLINK-23240
> URL: https://issues.apache.org/jira/browse/FLINK-23240
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.14.0, 1.15.0
> Reporter: Xintong Song
> Priority: Blocker
> Labels: test-stability
> Fix For: 1.15.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=19872&view=logs&j=b0a398c0-685b-599c-eb57-c8c2a771138e&t=d13f554f-d4b9-50f8-30ee-d49c6fb0b3cc&l=10186
> {code}
> Jul 04 22:17:29 [ERROR] Tests run: 12, Failures: 0, Errors: 1, Skipped: 0,
> Time elapsed: 91.407 s <<< FAILURE! - in
> org.apache.flink.test.checkpointing.ResumeCheckpointManuallyITCase
> Jul 04 22:17:29 [ERROR]
> testExternalizedFSCheckpointsWithLocalRecoveryZookeeper(org.apache.flink.test.checkpointing.ResumeCheckpointManuallyITCase)
> Time elapsed: 31.356 s <<< ERROR!
> Jul 04 22:17:29 java.util.concurrent.ExecutionException:
> java.util.concurrent.TimeoutException: Invocation of public abstract
> java.util.concurrent.CompletableFuture
> org.apache.flink.runtime.webmonitor.RestfulGateway.cancelJob(org.apache.flink.api.common.JobID,org.apache.flink.api.common.time.Time)
> timed out.
> Jul 04 22:17:29 at
> java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
> Jul 04 22:17:29 at
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
> Jul 04 22:17:29 at
> org.apache.flink.test.checkpointing.ResumeCheckpointManuallyITCase.runJobAndGetExternalizedCheckpoint(ResumeCheckpointManuallyITCase.java:303)
> Jul 04 22:17:29 at
> org.apache.flink.test.checkpointing.ResumeCheckpointManuallyITCase.testExternalizedCheckpoints(ResumeCheckpointManuallyITCase.java:275)
> Jul 04 22:17:29 at
> org.apache.flink.test.checkpointing.ResumeCheckpointManuallyITCase.testExternalizedFSCheckpointsWithLocalRecoveryZookeeper(ResumeCheckpointManuallyITCase.java:215)
> Jul 04 22:17:29 at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> Jul 04 22:17:29 at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Jul 04 22:17:29 at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Jul 04 22:17:29 at
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
> Jul 04 22:17:29 at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> Jul 04 22:17:29 at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Jul 04 22:17:29 at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> Jul 04 22:17:29 at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Jul 04 22:17:29 at
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> Jul 04 22:17:29 at
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Jul 04 22:17:29 at
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> Jul 04 22:17:29 at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> Jul 04 22:17:29 at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> Jul 04 22:17:29 at
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> Jul 04 22:17:29 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> Jul 04 22:17:29 at org.junit.runners.Suite.runChild(Suite.java:128)
> Jul 04 22:17:29 at org.junit.runners.Suite.runChild(Suite.java:27)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Jul 04 22:17:29 at
> org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> Jul 04 22:17:29 at
> org.apache.maven.surefire.junitcore.JUnitCore.run(JUnitCore.java:55)
> Jul 04 22:17:29 at
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:137)
> Jul 04 22:17:29 at
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeEager(JUnitCoreWrapper.java:107)
> Jul 04 22:17:29 at
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:83)
> Jul 04 22:17:29 at
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:75)
> Jul 04 22:17:29 at
> org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:158)
> Jul 04 22:17:29 at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> Jul 04 22:17:29 at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> Jul 04 22:17:29 at
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> Jul 04 22:17:29 at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> Jul 04 22:17:29 Caused by: java.util.concurrent.TimeoutException: Invocation
> of public abstract java.util.concurrent.CompletableFuture
> org.apache.flink.runtime.webmonitor.RestfulGateway.cancelJob(org.apache.flink.api.common.JobID,org.apache.flink.api.common.time.Time)
> timed out.
> Jul 04 22:17:29 at com.sun.proxy.$Proxy30.cancelJob(Unknown Source)
> Jul 04 22:17:29 at
> org.apache.flink.runtime.minicluster.MiniCluster.lambda$cancelJob$7(MiniCluster.java:716)
> Jul 04 22:17:29 at
> java.base/java.util.concurrent.CompletableFuture.uniApplyNow(CompletableFuture.java:680)
> Jul 04 22:17:29 at
> java.base/java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:658)
> Jul 04 22:17:29 at
> java.base/java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:2094)
> Jul 04 22:17:29 at
> org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:758)
> Jul 04 22:17:29 at
> org.apache.flink.runtime.minicluster.MiniCluster.cancelJob(MiniCluster.java:715)
> Jul 04 22:17:29 at
> org.apache.flink.client.program.MiniClusterClient.cancel(MiniClusterClient.java:83)
> Jul 04 22:17:29 ... 46 more
> Jul 04 22:17:29 Caused by: akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka://flink/user/rpc/dispatcher_2#-1806874751]] after [10000 ms].
> Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A
> typical reason for `AskTimeoutException` is that the recipient actor didn't
> send a reply.
> Jul 04 22:17:29 at
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> Jul 04 22:17:29 at
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> Jul 04 22:17:29 at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> Jul 04 22:17:29 at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> Jul 04 22:17:29 at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> Jul 04 22:17:29 at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> Jul 04 22:17:29 at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> Jul 04 22:17:29 at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> Jul 04 22:17:29 at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> Jul 04 22:17:29 at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> Jul 04 22:17:29 at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> Jul 04 22:17:29 at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)