[jira] [Commented] (FLINK-26630) EmbeddedHaServices is not made for recovery on a single instance

Matthias Pohl (Jira) Mon, 14 Mar 2022 02:40:05 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17506103#comment-17506103
 ]


Matthias Pohl commented on FLINK-26630:
---------------------------------------

I'm linking FLINK-24038 and FLINK-25235 here as causes. But essentially, it was 
a weakness of the {{TestingMiniCluster}} already beforehand, because even with 
a single {{EmbeddedHAServices}} instance, you wouldn't have been able to try a 
recovery because of the no-op {{JobGraphStore}} implementation that's used.

> EmbeddedHaServices is not made for recovery on a single instance
> ----------------------------------------------------------------
>
>                 Key: FLINK-26630
>                 URL: https://issues.apache.org/jira/browse/FLINK-26630
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Matthias Pohl
>            Priority: Critical
>
> {{EmbeddedHaServices}} (and {{EmbeddedHaServicesWithLeadershipControl}}) 
> provide leader election functionality to work on a single JVM. In FLINK-25235 
> we introduced the re-instantiation of {{HighAvailabilityServices}} per 
> {{JobManager}} (i.e. {{DispatcherResourceManagerComponent}}) in 
> {{TestingMiniCluster}} to be able to close the {{HighAvailabilityServices}} 
> during the shutdown of a JM and not only at the end of the HA cluster to get 
> closer to a production environment where each JM has its own HAServices 
> instance as well (that became crucial as part of the work of FLINK-24038 
> which revokes the leadership when it closes the HAServices during a JM 
> shutdown).
> The {{EmbeddedHaServices}}, though, provide a no-op 
> {{StandaloneJobGraphStore}} implementation, i.e. no real recovery is testable 
> with the {{TestingMiniCluster}} (even before the change of FLINK-25235). We 
> should still fix that to enable users to use the {{TestingMiniCluster}} for 
> such purposes. That means that we should provide a {{JobGraphStore}} and 
> {{JobResultStore}} that's shared between the different 
> {{HighAvailabilityServices}} instances and probably also the 
> Checkpoint-related HA components.
> Right now, the multi-JM setup of the {{TestingMiniCluster}} is only used in 
> {{ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange}} 
> where it's bound to the {{ZooKeeperHAServices}}. Therefore, it's not a 
> pressing issue for 1.15. But we should fix it as a follow-up.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (FLINK-26630) EmbeddedHaServices is not made for recovery on a single instance

Reply via email to