[
https://issues.apache.org/jira/browse/FLINK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Pohl updated FLINK-26630:
----------------------------------
Description:
{{EmbeddedHaServices}} (and {{EmbeddedHaServicesWithLeadershipControl}})
provide leader election functionality to work on a single JVM. In FLINK-25235
we introduced the re-instantiation of {{HighAvailabilityServices}} per
{{JobManager}} (i.e. {{DispatcherResourceManagerComponent}}) in
{{TestingMiniCluster}} to be able to close the {{HighAvailabilityServices}}
during the shutdown of a JM and not only at the end of the HA cluster to get
closer to a production environment where each JM has its own HAServices
instance as well (that became crucial as part of the work of FLINK-24038 which
revokes the leadership when it closes the HAServices during a JM shutdown).
The {{EmbeddedHaServices}}, though, provide a no-op {{StandaloneJobGraphStore}}
implementation, i.e. no real recovery is testable with the
{{TestingMiniCluster}} (even before the change of FLINK-25235). We should still
fix that to enable users to use the {{TestingMiniCluster}} for such purposes.
That means that we should provide a {{JobGraphStore}} and {{JobResultStore}}
that's shared between the different {{HighAvailabilityServices}} instances and
probably also the Checkpoint-related HA components.
Right now, the multi-JM setup of the {{TestingMiniCluster}} is only used in
{{ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange}}
where it's bound to the {{ZooKeeperHAServices}}. Therefore, it's not a pressing
issue for 1.15. But we should fix it as a follow-up.
was:{{EmbeddedHaServices}} (and {{EmbeddedHaServicesWithLeadershipControl}})
provide leader election functionality to work on a single JVM. In FLINK-25235
we introduced the re-instantiation of {{HighAvailabilityServices}} per
{{JobManager}} (i.e. {{DispatcherResourceManagerComponent}}) to be able to
close the HighAvailabilityServices during the shutdown of a JM and not only at
the end of the HA cluster to get closer to a production environment where each
JM has its own HAServices instance as well (that became crucial as part of the
work of FLINK-24038 which revokes the leadership when it closes the HAServices
during a JM shutdown).
> EmbeddedHaServices is not made for recovery on a single instance
> ----------------------------------------------------------------
>
> Key: FLINK-26630
> URL: https://issues.apache.org/jira/browse/FLINK-26630
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Matthias Pohl
> Priority: Critical
>
> {{EmbeddedHaServices}} (and {{EmbeddedHaServicesWithLeadershipControl}})
> provide leader election functionality to work on a single JVM. In FLINK-25235
> we introduced the re-instantiation of {{HighAvailabilityServices}} per
> {{JobManager}} (i.e. {{DispatcherResourceManagerComponent}}) in
> {{TestingMiniCluster}} to be able to close the {{HighAvailabilityServices}}
> during the shutdown of a JM and not only at the end of the HA cluster to get
> closer to a production environment where each JM has its own HAServices
> instance as well (that became crucial as part of the work of FLINK-24038
> which revokes the leadership when it closes the HAServices during a JM
> shutdown).
> The {{EmbeddedHaServices}}, though, provide a no-op
> {{StandaloneJobGraphStore}} implementation, i.e. no real recovery is testable
> with the {{TestingMiniCluster}} (even before the change of FLINK-25235). We
> should still fix that to enable users to use the {{TestingMiniCluster}} for
> such purposes. That means that we should provide a {{JobGraphStore}} and
> {{JobResultStore}} that's shared between the different
> {{HighAvailabilityServices}} instances and probably also the
> Checkpoint-related HA components.
> Right now, the multi-JM setup of the {{TestingMiniCluster}} is only used in
> {{ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange}}
> where it's bound to the {{ZooKeeperHAServices}}. Therefore, it's not a
> pressing issue for 1.15. But we should fix it as a follow-up.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)