[jira] [Comment Edited] (FLINK-27848) ZooKeeperLeaderElectionDriver keeps writing leader information, using up zxid

Matthias Pohl (Jira) Wed, 01 Feb 2023 09:11:58 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-27848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683099#comment-17683099
 ]


Matthias Pohl edited comment on FLINK-27848 at 2/1/23 5:09 PM:
---------------------------------------------------------------

I'm reopening this issue to provide forward(?)ports for 1.16 and 1.17.

Refactoring the leader election for FLIP-285/FLINK-26522 is kind of tricky. I'm 
trying to slice the code changes into meaningful commits (and ideally dedicated 
PRs) to make the review process easier.

I ran into this issue when refactoring the code and merging classes into one 
which also required adapting tests. This revealed the inconsistency/bug in the 
ZooKeeperLeaderElectionDriver implementation. Merging the bugfixes into 1.17 
and 1.16 makes the other changes more reasonable/consistent.

More specifically, this bug was revealed in 
\{{ZooKeeperLeaderElectionTest.testLeaderShouldBeCorrectedWhenOverwritten}} 
when changing from the deprecated {{NodeCache}} to {{{}CuratorCache{}}}. The 
new {{CuratorCacheListener}} allows to be more selective on whether we expect a 
node creation or change which causes a test failure. The previous test 
implementation worked because we sent the 2nd write operation after writing the 
leaderinformation which caused a node-change event and, after all, made the 
test pass.


was (Author: mapohl):
I'm reopening this issue to provide forward(?)ports for 1.16 and 1.17.

Refactoring the leader election for FLIP-285/FLINK-26522 is kind of tricky. I'm 
trying to slice the code changes into meaningful commits (and ideally dedicated 
PRs) to make the review process easier.

I ran into this issue when refactoring the code and merging classes into one 
which also required adapting tests. This revealed the inconsistency/bug in the 
ZooKeeperLeaderElectionDriver implementation. Merging the bugfixes into 1.17 
and 1.16 makes the other changes more reasonable/consistent.

> ZooKeeperLeaderElectionDriver keeps writing leader information, using up zxid
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-27848
>                 URL: https://issues.apache.org/jira/browse/FLINK-27848
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.17.0, 1.16.1
>            Reporter: Xintong Song
>            Assignee: Matthias Pohl
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.1
>
>
> After a leadership change, the new leader may keeps writing its information 
> (which is identical) to ZK, causing the zxid on ZK quickly used up.
> The problem is that, in 
> {{ZooKeeperLeaderElectionDriver#retrieveLeaderInformationFromZooKeeper}}, 
> {{leaderElectionEventHandler.onLeaderInformationChange(LeaderInformation.empty())}}
>  is called no matter {{childData}} is {{null}} or not. In case of non-null, 
> this will cause the driver keeps re-writing the leader information to ZK.
> The problem was introduced in FLINK-24038, and only affects the legacy 
> {{ZooKeeperHaServices}}. Thus, only 1.15 are affected.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-27848) ZooKeeperLeaderElectionDriver keeps writing leader information, using up zxid

Reply via email to