XComp opened a new pull request, #22769:
URL: https://github.com/apache/flink/pull/22769
## What is the purpose of the change
It can happen that the service is closed (in the RpcEndpoint's main thread)
while an event is being processed (in the HA backend clients event thread).
That can cause a deadlock situation because the close method tries to close the
driver while owning its own lock. The driver's close method will try to acquire
the driver-side lock. The opposite order for the locks happens when the driver
gains leadership and it tries to forward this event to the
`DefaultLeaderElectionService`.
The problem is that the driver is closed within a monitor of the service's
lock. That wasn't the case before FLINK-31733.
Is this a production code issue? No - because we currently used an
intermediate component `DefaultMultipleComponentLeaderElectionService` that the
`DefaultLeaderElectionService` and the `MultipleComponentLeaderElectionDriver`
communicate with which manages its own lock. Therefore, we don't get into the
situation where the RpcEndpoint main thread and the event thread of the HA
backend client try to own the lock of the `DefaultLeaderElectionService` and
the driver.
We tried to be clever in FLINK-31733 by removing the `running` field from
`DefaultLeaderElectionService`. But this prevents us from shutting everything
down within the `DefaultLeaderElectionService.close()` call but outside of the
lock.
## Brief change log
* (Re-)introduces `DefaultLeaderElectionService#running` field
## Verifying this change
* Added unit test to reproduce the deadlock
* Extended `TestingLeaderElectionDriver` to support the new use case
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: no
- The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Kubernetes/Yarn, ZooKeeper: yes
- The S3 file system connector: no
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]