[
https://issues.apache.org/jira/browse/ZOOKEEPER-3829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110357#comment-17110357
]
Mate Szalay-Beko edited comment on ZOOKEEPER-3829 at 5/18/20, 2:47 PM:
-----------------------------------------------------------------------
[~keliwang] this is a very nice catch! I was also validating your find from the
other direction: If I add the {{localSessionsEnabled=true}} to the config just
sent by [~sundyli], the zkCli was not hanging (while using his config I
reproduced the original issue). So having {{localSessionsEnabled=true}} in my
config caused that I was unable to reproduce the issue in the first place.
The {{localSessionsEnabled=true}} matters only because when the local sessions
are enabled, then the client will be able to connect without having it's global
session ID committed. The basic problem is indeed with this log line, as you
wrote in ZOOKEEPER-3830:
{code:java}
2020-05-18 14:08:07,051 [myid:4] - INFO
[QuorumPeer[myid=4](plain=/0.0.0.0:2181)(secure=disabled):Leader@1296] - Have
quorum of supporters, sids: [ [4, 1, 3],[1, 3] ]; starting up and setting last
processed zxid: 0x400000000
{code}
This will lead that the designated leader will not be the new leader ([see
here|https://github.com/apache/zookeeper/blob/c11b7e26bc554b8523dc929761dd28808913f091/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L1310]),
and {{allowedToCommit = false}} will be set a few lines later. But the no new
leader election will be started, as the dynamic reconfig is not enabled.
So I think the solution would be to skip the whole {{designatedLeader}} check
when the dynamicReconfig is disabled.
[~sundyli], I was adding some debug logs to the commit processor and found that
when the "{{Configuring CommitProcessor with XX worker threads}}" log is
printed out, we always creating a new {{workerPool}}. I am not sure how
resetting the {{workerPool}} to null would solve this issue. Are you sure that
it helps in your case? Maybe then we are chasing here a different error :) -
The one I just reproduced with your zoo.cfg in docker compose seems to be
unrelated to {{workerPool}} but related to {{lastSeenQuorumVerifier}} and
dynamic-reconfig.
I will create a PR now with the proposed fix (skipping the checking of the
{{designatedLeader}} when dynamic-reconfig is disabled), but I need some time
first to check if the same error affects the master branch and also to see if I
can add some unit tests for this.
was (Author: symat):
[~keliwang] this is a very nice catch! I was also validating your find from the
other direction: If I add the {{localSessionsEnabled=true}} to the config just
sent by [~sundyli], the zkCli was not hanging (while using his config I
reproduced the original issue). So having this config caused that I was unable
to reproduce the issue in the first place.
The {{localSessionsEnabled=true}} matters only because when the local sessions
are enabled, then the client will be able to connect without having it's global
session ID committed. The basic problem is indeed with this log line, as you
wrote in ZOOKEEPER-3830:
{code:java}
2020-05-18 14:08:07,051 [myid:4] - INFO
[QuorumPeer[myid=4](plain=/0.0.0.0:2181)(secure=disabled):Leader@1296] - Have
quorum of supporters, sids: [ [4, 1, 3],[1, 3] ]; starting up and setting last
processed zxid: 0x400000000
{code}
This will lead that the designated leader will not be the new leader ([see
here|https://github.com/apache/zookeeper/blob/c11b7e26bc554b8523dc929761dd28808913f091/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L1310]),
and {{allowedToCommit = false}} will be set a few lines later. But the no new
leader election will be started, as the dynamic reconfig is not enabled.
So I think the solution would be to skip the whole {{designatedLeader}} check
when the dynamicReconfig is disabled.
[~sundyli], I was adding some debug logs to the commit processor and found that
when the "{{Configuring CommitProcessor with XX worker threads}}" log is
printed out, we always creating a new {{workerPool}}. I am not sure how
resetting the {{workerPool}} to null would solve this issue. Are you sure that
it helps in your case? Maybe then we are chasing here a different error :) -
The one I just reproduced with your zoo.cfg in docker compose seems to be
unrelated to {{workerPool}} but related to {{lastSeenQuorumVerifier}} and
dynamic-reconfig.
I will create a PR now with the proposed fix (skipping the checking of the
{{designatedLeader}} when dynamic-reconfig is disabled), but I need some time
first to check if the same error affects the master branch and also to see if I
can add some unit tests for this.
> Zookeeper refuses request after node expansion
> ----------------------------------------------
>
> Key: ZOOKEEPER-3829
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3829
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.5.6
> Reporter: benwang li
> Priority: Major
> Attachments: d.log, screenshot-1.png
>
>
> It's easy to reproduce this bug.
> {code:java}
> //代码占位符
>
> Step 1. Deploy 3 nodes A,B,C with configuration A,B,C .
> Step 2. Deploy node ` D` with configuration `A,B,C,D` , cluster state is ok
> now.
> Step 3. Restart nodes A,B,C with configuration A,B,C,D, then the leader will
> be D, cluster hangs, but it can accept `mntr` command, other command like `ls
> /` will be blocked.
> Step 4. Restart nodes D, cluster state is back to normal now.
>
> {code}
>
> We have looked into the code of 3.5.6 version, and we found it may be the
> issue of `workerPool` .
> The `CommitProcessor` shutdown and make `workerPool` shutdown, but
> `workerPool` still exists. It will never work anymore, yet the cluster still
> thinks it's ok.
>
> I think the bug may still exist in master branch.
> We have tested it in our machines by reset the `workerPool` to null. If it's
> ok, please assign this issue to me, and then I'll create a PR.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)