[
https://issues.apache.org/jira/browse/FLINK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506103#comment-17506103
]
Matthias Pohl commented on FLINK-26630:
---------------------------------------
I'm linking FLINK-24038 and FLINK-25235 here as causes. But essentially, it was
a weakness of the {{TestingMiniCluster}} already beforehand, because even with
a single {{EmbeddedHAServices}} instance, you wouldn't have been able to try a
recovery because of the no-op {{JobGraphStore}} implementation that's used.
> EmbeddedHaServices is not made for recovery on a single instance
> ----------------------------------------------------------------
>
> Key: FLINK-26630
> URL: https://issues.apache.org/jira/browse/FLINK-26630
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Matthias Pohl
> Priority: Critical
>
> {{EmbeddedHaServices}} (and {{EmbeddedHaServicesWithLeadershipControl}})
> provide leader election functionality to work on a single JVM. In FLINK-25235
> we introduced the re-instantiation of {{HighAvailabilityServices}} per
> {{JobManager}} (i.e. {{DispatcherResourceManagerComponent}}) in
> {{TestingMiniCluster}} to be able to close the {{HighAvailabilityServices}}
> during the shutdown of a JM and not only at the end of the HA cluster to get
> closer to a production environment where each JM has its own HAServices
> instance as well (that became crucial as part of the work of FLINK-24038
> which revokes the leadership when it closes the HAServices during a JM
> shutdown).
> The {{EmbeddedHaServices}}, though, provide a no-op
> {{StandaloneJobGraphStore}} implementation, i.e. no real recovery is testable
> with the {{TestingMiniCluster}} (even before the change of FLINK-25235). We
> should still fix that to enable users to use the {{TestingMiniCluster}} for
> such purposes. That means that we should provide a {{JobGraphStore}} and
> {{JobResultStore}} that's shared between the different
> {{HighAvailabilityServices}} instances and probably also the
> Checkpoint-related HA components.
> Right now, the multi-JM setup of the {{TestingMiniCluster}} is only used in
> {{ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange}}
> where it's bound to the {{ZooKeeperHAServices}}. Therefore, it's not a
> pressing issue for 1.15. But we should fix it as a follow-up.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)