## What is the purpose of the change

The Dispatcher should only react to the onAddedJobGraph signal if it is the 
leader.
In all other cases the signal should be ignored since the jobs will be 
recovered once
the Dispatcher becomes the leader.

In order to still support non-blocking job recoveries, this commit serializes 
all
recovery operations by introducing a recoveryOperation future which first needs 
to
complete before a subsequent operation is started. That way we can avoid race 
conditions
between granting and revoking leadership as well as the onAddedJobGraph 
signals. This is
important since we can only lock each JobGraph once and, thus, need to make 
sure that
we don't release a lock of a properly recovered job in a concurrent operation.

cc @GJL 

## Brief change log

- Only react to `SubmittedJobGraphListener#onAddedJobGraph` when being the 
leader
- Serialize recovery operations by introducing a `recoveryOperation` future in 
order to avoid wrong unlocking of guarded resources

## Verifying this change

- Added `ZooKeeperHADispatcherTest#testStandbyDispatcherJobExecution` and 
`ZooKeeperHADispatcherTest#testStandbyDispatcherJobRecovery`

## Does this pull request potentially affect one of the following parts:

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (no)
  - The serializers: (no)
  - The runtime per-record code paths (performance sensitive): (no)
  - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
  - The S3 file system connector: (no)

## Documentation

  - Does this pull request introduce a new feature? (no)
  - If yes, how is the feature documented? (not applicable)


[ Full content available at: https://github.com/apache/flink/pull/6678 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to