[
https://issues.apache.org/jira/browse/FLINK-10255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann closed FLINK-10255.
---------------------------------
Resolution: Fixed
Fixed via
1.7.0: 3e5d07ca349a7b010bc47d1cce9b9ad3208f55a6
1.6.1: b9c89d9a7af45f1b605e46c7d736c3bdc9b0d16f
1.5.4: 5a97f12c339ed3c0b6798c9fc0fd17910689099d
> Standby Dispatcher locks submitted JobGraphs
> --------------------------------------------
>
> Key: FLINK-10255
> URL: https://issues.apache.org/jira/browse/FLINK-10255
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.5.3, 1.6.0, 1.7.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 1.6.1, 1.7.0, 1.5.4
>
>
> Currently, standby {{Dispatchers}} lock submitted {{JobGraphs}} which are
> added to the {{SubmittedJobGraphStore}} if HA mode is enabled. Locking the
> {{JobGraphs}} can prevent their cleanup leaving the system in an inconsistent
> state.
> The problem is that we recover in the
> {{SubmittedJobGraphListener#onAddedJobGraph}} callback which is also called
> if don't have the leadership the newly added {{JobGraph}}. Recovering the
> {{JobGraph}} currently locks the {{JobGraph}}. In case that the
> {{Dispatcher}} is not the leader, then we won't start that job after its
> recovery. However, we also don't release the {{JobGraph}} leaving it locked.
> There are two possible solutions to the problem. Either we check whether we
> are the leader before recovering jobs or we say that recovering jobs does not
> lock them. Only if we can submit the recovered job we lock them. The latter
> approach has the advantage that it follows a quite similar code path as an
> initial job submission. Moreover, jobs are currently also recovered at other
> places. In all these places we currently would need to release the
> {{JobGraphs}} if we cannot submit the recovered {{JobGraph}} (e.g.
> {{Dispatcher#grantLeadership}}).
> An extension of the first solution could be to stop the
> {{SubmittedJobGraphStore}} while the {{Dispatcher}} is not the leader. Then
> we would have to make sure that no concurrent callback from the
> {{SubmittedJobGraphStore#SubmittedJobGraphListener}} can be executed after
> revoking leadership from the {{Dispatcher}}.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)