[ 
https://issues.apache.org/jira/browse/FLINK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias Pohl updated FLINK-26630:
----------------------------------
    Description: 
{{EmbeddedHaServices}} (and {{EmbeddedHaServicesWithLeadershipControl}}) 
provide leader election functionality to work on a single JVM. In FLINK-25235 
we introduced the re-instantiation of {{HighAvailabilityServices}} per 
{{JobManager}} (i.e. {{DispatcherResourceManagerComponent}}) in 
{{TestingMiniCluster}} to be able to close the {{HighAvailabilityServices}} 
during the shutdown of a JM and not only at the end of the HA cluster to get 
closer to a production environment where each JM has its own HAServices 
instance as well (that became crucial as part of the work of FLINK-24038 which 
revokes the leadership when it closes the HAServices during a JM shutdown).

The {{EmbeddedHaServices}}, though, provide a no-op {{StandaloneJobGraphStore}} 
implementation, i.e. no real recovery is testable with the 
{{TestingMiniCluster}} (even before the change of FLINK-25235). We should still 
fix that to enable users to use the {{TestingMiniCluster}} for such purposes. 
That means that we should provide a {{JobGraphStore}} and {{JobResultStore}} 
that's shared between the different {{HighAvailabilityServices}} instances and 
probably also the Checkpoint-related HA components.

Right now, the multi-JM setup of the {{TestingMiniCluster}} is only used in 
{{ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange}} 
where it's bound to the {{ZooKeeperHAServices}}. Therefore, it's not a pressing 
issue for 1.15. But we should fix it as a follow-up.

  was:{{EmbeddedHaServices}} (and {{EmbeddedHaServicesWithLeadershipControl}}) 
provide leader election functionality to work on a single JVM. In FLINK-25235 
we introduced the re-instantiation of {{HighAvailabilityServices}} per 
{{JobManager}} (i.e. {{DispatcherResourceManagerComponent}}) to be able to 
close the HighAvailabilityServices during the shutdown of a JM and not only at 
the end of the HA cluster to get closer to a production environment where each 
JM has its own HAServices instance as well (that became crucial as part of the 
work of FLINK-24038 which revokes the leadership when it closes the HAServices 
during a JM shutdown).


> EmbeddedHaServices is not made for recovery on a single instance
> ----------------------------------------------------------------
>
>                 Key: FLINK-26630
>                 URL: https://issues.apache.org/jira/browse/FLINK-26630
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Matthias Pohl
>            Priority: Critical
>
> {{EmbeddedHaServices}} (and {{EmbeddedHaServicesWithLeadershipControl}}) 
> provide leader election functionality to work on a single JVM. In FLINK-25235 
> we introduced the re-instantiation of {{HighAvailabilityServices}} per 
> {{JobManager}} (i.e. {{DispatcherResourceManagerComponent}}) in 
> {{TestingMiniCluster}} to be able to close the {{HighAvailabilityServices}} 
> during the shutdown of a JM and not only at the end of the HA cluster to get 
> closer to a production environment where each JM has its own HAServices 
> instance as well (that became crucial as part of the work of FLINK-24038 
> which revokes the leadership when it closes the HAServices during a JM 
> shutdown).
> The {{EmbeddedHaServices}}, though, provide a no-op 
> {{StandaloneJobGraphStore}} implementation, i.e. no real recovery is testable 
> with the {{TestingMiniCluster}} (even before the change of FLINK-25235). We 
> should still fix that to enable users to use the {{TestingMiniCluster}} for 
> such purposes. That means that we should provide a {{JobGraphStore}} and 
> {{JobResultStore}} that's shared between the different 
> {{HighAvailabilityServices}} instances and probably also the 
> Checkpoint-related HA components.
> Right now, the multi-JM setup of the {{TestingMiniCluster}} is only used in 
> {{ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange}} 
> where it's bound to the {{ZooKeeperHAServices}}. Therefore, it's not a 
> pressing issue for 1.15. But we should fix it as a follow-up.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to