[
https://issues.apache.org/jira/browse/FLINK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683099#comment-17683099
]
Matthias Pohl edited comment on FLINK-27848 at 2/1/23 5:09 PM:
---------------------------------------------------------------
I'm reopening this issue to provide forward(?)ports for 1.16 and 1.17.
Refactoring the leader election for FLIP-285/FLINK-26522 is kind of tricky. I'm
trying to slice the code changes into meaningful commits (and ideally dedicated
PRs) to make the review process easier.
I ran into this issue when refactoring the code and merging classes into one
which also required adapting tests. This revealed the inconsistency/bug in the
ZooKeeperLeaderElectionDriver implementation. Merging the bugfixes into 1.17
and 1.16 makes the other changes more reasonable/consistent.
More specifically, this bug was revealed in
\{{ZooKeeperLeaderElectionTest.testLeaderShouldBeCorrectedWhenOverwritten}}
when changing from the deprecated {{NodeCache}} to {{{}CuratorCache{}}}. The
new {{CuratorCacheListener}} allows to be more selective on whether we expect a
node creation or change which causes a test failure. The previous test
implementation worked because we sent the 2nd write operation after writing the
leaderinformation which caused a node-change event and, after all, made the
test pass.
was (Author: mapohl):
I'm reopening this issue to provide forward(?)ports for 1.16 and 1.17.
Refactoring the leader election for FLIP-285/FLINK-26522 is kind of tricky. I'm
trying to slice the code changes into meaningful commits (and ideally dedicated
PRs) to make the review process easier.
I ran into this issue when refactoring the code and merging classes into one
which also required adapting tests. This revealed the inconsistency/bug in the
ZooKeeperLeaderElectionDriver implementation. Merging the bugfixes into 1.17
and 1.16 makes the other changes more reasonable/consistent.
> ZooKeeperLeaderElectionDriver keeps writing leader information, using up zxid
> -----------------------------------------------------------------------------
>
> Key: FLINK-27848
> URL: https://issues.apache.org/jira/browse/FLINK-27848
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0, 1.17.0, 1.16.1
> Reporter: Xintong Song
> Assignee: Matthias Pohl
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.15.1
>
>
> After a leadership change, the new leader may keeps writing its information
> (which is identical) to ZK, causing the zxid on ZK quickly used up.
> The problem is that, in
> {{ZooKeeperLeaderElectionDriver#retrieveLeaderInformationFromZooKeeper}},
> {{leaderElectionEventHandler.onLeaderInformationChange(LeaderInformation.empty())}}
> is called no matter {{childData}} is {{null}} or not. In case of non-null,
> this will cause the driver keeps re-writing the leader information to ZK.
> The problem was introduced in FLINK-24038, and only affects the legacy
> {{ZooKeeperHaServices}}. Thus, only 1.15 are affected.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)