[
https://issues.apache.org/jira/browse/FLINK-30195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weijie Guo updated FLINK-30195:
-------------------------------
Description: As discussed in
[https://github.com/apache/flink/pull/21137|https://github.com/apache/flink/pull/21137,]
, leader election service should not call `contender#grant/revokeLeadership`
under a lock while the same lock can be accessed by the contender. One proposal
to fix this issue is introduce a dedicated executor to get rid of the nested
lock structure. This would affect all contenders and we need to carefully check
that no existing contenders are relying on the current behavior that
`grant/removeLeadership{{{}`{}}} are called under lock. We should also clean up
things like `ResourceManagerServiceImpl.handleLeaderEventExecutor`. Any other
better suggestions are also welcome. (was: As discussed in
[https://github.com/apache/flink/pull/21137|https://github.com/apache/flink/pull/21137,]
, leader election service should not call `contender#grant/revokeLeadership`
under a lock while the same lock can be accessed by the contender. We can fix
this issue with a dedicated executor to get rid of the nested lock structure.
This would affect all contenders and we need to carefully check that no
existing contenders are relying on the current behavior that
`grant/removeLeadership{{{}`{}}} are called under lock. We should also clean up
things like `ResourceManagerServiceImpl.handleLeaderEventExecutor`.)
> LeaderElectionService should avoid potential deadlock with leaderContender
> --------------------------------------------------------------------------
>
> Key: FLINK-30195
> URL: https://issues.apache.org/jira/browse/FLINK-30195
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Reporter: Weijie Guo
> Priority: Major
>
> As discussed in
> [https://github.com/apache/flink/pull/21137|https://github.com/apache/flink/pull/21137,]
> , leader election service should not call `contender#grant/revokeLeadership`
> under a lock while the same lock can be accessed by the contender. One
> proposal to fix this issue is introduce a dedicated executor to get rid of
> the nested lock structure. This would affect all contenders and we need to
> carefully check that no existing contenders are relying on the current
> behavior that `grant/removeLeadership{{{}`{}}} are called under lock. We
> should also clean up things like
> `ResourceManagerServiceImpl.handleLeaderEventExecutor`. Any other better
> suggestions are also welcome.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)