[
https://issues.apache.org/jira/browse/ZOOKEEPER-3920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185594#comment-17185594
]
Andre Price commented on ZOOKEEPER-3920:
----------------------------------------
So I'm quite far from a zookeeper expert so please understand that when reading
my below inferences.
I was going through the logs I attached and noticed that there were slight
differences in the logs around session creation (which is what appears to be
failing in the negative case).
Here are some logs when things are working (sid:3 is leader in my case):
* Follower (sid:1):
{noformat}
2020-08-26 16:11:21,506 [myid:1] - DEBUG
[WorkerSender[myid=1]:QuorumCnxManager@743] - There is a connection already for
server 2
2020-08-26 16:11:28,223 [myid:1] - TRACE
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):ZooTrace@70] - i
PROPOSAL f00000001 null
2020-08-26 16:11:28,223 [myid:1] - WARN
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):Follower@170] - Got
zxid 0xf00000001 expected 0x1
2020-08-26 16:11:28,226 [myid:1] - TRACE
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):ZooTrace@70] - i
COMMIT f00000001 null
2020-08-26 16:11:28,226 [myid:1] - DEBUG
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):CommitProcessor@580] -
Committing request:: sessionid:0x20170630c4b0000 type:createSession cxid:0x0
zxid:0xf00000001 txntype:-10 reqpath:n/a
2020-08-26 16:11:28,226 [myid:1] - DEBUG
[CommitProcessor:1:FinalRequestProcessor@110] - Processing request::
sessionid:0x20170630c4b0000 type:createSession cxid:0x0 zxid:0xf00000001
txntype:-10 reqpath:n/a
2020-08-26 16:11:28,227 [myid:1] - TRACE [CommitProcessor:1:ZooTrace@82] -
:Esessionid:0x20170630c4b0000 type:createSession cxid:0x0 zxid:0xf00000001
txntype:-10 reqpath:n/a
2020-08-26 16:11:28,227 [myid:1] - INFO
[CommitProcessor:1:LearnerSessionTracker@116] - Committing global session
0x20170630c4b0000
2020-08-26 16:11:28,227 [myid:1] - DEBUG [CommitProcessor:1:DataTree@1840] -
Digests are matching for Zxid: f00000001, Digest in log and actual tree:
1371985504
2020-08-26 16:11:46,804 [myid:1] - DEBUG
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@286]
- Accepted socket connection from /172.17.0.1:55984
2020-08-26 16:11:46,805 [myid:1] - INFO [NIOWorkerThread-4:NIOServerCnxn@507]
- Processing srvr command from /172.17.0.1:55984
2020-08-26 16:11:46,806 [myid:1] - DEBUG [NIOWorkerThread-4:NIOServerCnxn@623]
- Closed socket connection for client /172.17.0.1:55984 (no session established
for client)
2020-08-26 16:12:00,155 [myid:1] - DEBUG
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@286]
- Accepted socket connection from /172.17.0.1:56000
2020-08-26 16:12:00,156 [myid:1] - INFO [NIOWorkerThread-1:NIOServerCnxn@507]
- Processing mntr command from /172.17.0.1:56000
2020-08-26 16:12:00,158 [myid:1] - DEBUG [NIOWorkerThread-1:NIOServerCnxn@623]
- Closed socket connection for client /172.17.0.1:56000 (no session established
for client)
2020-08-26 16:12:00,847 [myid:1] - TRACE
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):ZooTrace@70] - i
PROPOSAL f00000002 null
2020-08-26 16:12:00,851 [myid:1] - TRACE
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):ZooTrace@70] - i
COMMIT f00000002 null
2020-08-26 16:12:00,851 [myid:1] - DEBUG
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):CommitProcessor@580] -
Committing request:: sessionid:0x20170630c4b0000 type:closeSession cxid:0x0
zxid:0xf00000002 txntype:-11 reqpath:n/a
2020-08-26 16:12:00,852 [myid:1] - DEBUG
[CommitProcessor:1:FinalRequestProcessor@110] - Processing request::
sessionid:0x20170630c4b0000 type:closeSession cxid:0x0 zxid:0xf00000002
txntype:-11 reqpath:n/a{noformat}
* Leader (sid:3):
{noformat}
2020-08-26 16:11:28,220 [myid:3] - TRACE [ProcessThread(sid:3
cport:-1)::ZooTrace@82] - :Psessionid:0x20170630c4b0000 type:createSession
cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
2020-08-26 16:11:28,221 [myid:3] - DEBUG [ProcessThread(sid:3
cport:-1)::SessionTrackerImpl@278] - Adding session 0x20170630c4b0000
2020-08-26 16:11:28,221 [myid:3] - TRACE [ProcessThread(sid:3
cport:-1)::ZooTrace@70] - SessionTrackerImpl --- Adding session
0x20170630c4b0000 30000
2020-08-26 16:11:28,222 [myid:3] - DEBUG [ProcessThread(sid:3
cport:-1)::PrepRequestProcessor@1111] - Digest got from data tree is: 1371985504
2020-08-26 16:11:28,222 [myid:3] - DEBUG [ProcessThread(sid:3
cport:-1)::CommitProcessor@592] - Processing request::
sessionid:0x20170630c4b0000 type:createSession cxid:0x0 zxid:0xf00000001
txntype:-10 reqpath:n/a
2020-08-26 16:11:28,222 [myid:3] - DEBUG [ProcessThread(sid:3
cport:-1)::Leader@1250] - Proposing:: sessionid:0x20170630c4b0000
type:createSession cxid:0x0 zxid:0xf00000001 txntype:-10 reqpath:n/a
2020-08-26 16:11:28,225 [myid:3] - TRACE
[LearnerHandler-/10.4.12.167:57912:ZooTrace@70] - i ACK f00000001 null
2020-08-26 16:11:28,225 [myid:3] - TRACE
[LearnerHandler-/10.4.12.167:57912:Leader@983] - Ack zxid: 0xf00000001
2020-08-26 16:11:28,225 [myid:3] - TRACE
[LearnerHandler-/10.4.27.188:37872:ZooTrace@70] - i ACK f00000001 null
2020-08-26 16:11:28,225 [myid:3] - TRACE
[LearnerHandler-/10.4.12.167:57912:Leader@986] - outstanding proposal:
0xf00000001
2020-08-26 16:11:28,225 [myid:3] - TRACE
[LearnerHandler-/10.4.12.167:57912:Leader@988] - outstanding proposals all
2020-08-26 16:11:28,226 [myid:3] - TRACE [SyncThread:3:Leader@983] - Ack zxid:
0xf00000001
2020-08-26 16:11:28,226 [myid:3] - TRACE [SyncThread:3:Leader@986] -
outstanding proposal: 0xf00000001
2020-08-26 16:11:28,226 [myid:3] - TRACE [SyncThread:3:Leader@988] -
outstanding proposals all
2020-08-26 16:11:28,226 [myid:3] - DEBUG [SyncThread:3:CommitProcessor@580] -
Committing request:: sessionid:0x20170630c4b0000 type:createSession cxid:0x0
zxid:0xf00000001 txntype:-10 reqpath:n/a
2020-08-26 16:11:28,226 [myid:3] - TRACE
[LearnerHandler-/10.4.27.188:37872:Leader@983] - Ack zxid: 0xf00000001
2020-08-26 16:11:28,226 [myid:3] - TRACE
[LearnerHandler-/10.4.27.188:37872:Leader@988] - outstanding proposals all
2020-08-26 16:11:28,226 [myid:3] - DEBUG
[LearnerHandler-/10.4.27.188:37872:Leader@1001] - outstanding is 0
2020-08-26 16:11:28,226 [myid:3] - DEBUG
[CommitProcessor:3:FinalRequestProcessor@110] - Processing request::
sessionid:0x20170630c4b0000 type:createSession cxid:0x0 zxid:0xf00000001
txntype:-10 reqpath:n/a
2020-08-26 16:11:28,226 [myid:3] - TRACE [CommitProcessor:3:ZooTrace@82] -
:Esessionid:0x20170630c4b0000 type:createSession cxid:0x0 zxid:0xf00000001
txntype:-10 reqpath:n/a
2020-08-26 16:11:28,227 [myid:3] - INFO
[CommitProcessor:3:LeaderSessionTracker@104] - Committing global session
0x20170630c4b0000
2020-08-26 16:11:28,227 [myid:3] - DEBUG [CommitProcessor:3:DataTree@1840] -
Digests are matching for Zxid: f00000001, Digest in log and actual tree:
1371985504
2020-08-26 16:11:35,265 [myid:3] - DEBUG
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@286]
- Accepted socket connection from /172.17.0.1:38216
2020-08-26 16:11:35,266 [myid:3] - INFO [NIOWorkerThread-3:NIOServerCnxn@507]
- Processing srvr command from /172.17.0.1:38216
2020-08-26 16:11:35,267 [myid:3] - DEBUG [NIOWorkerThread-3:NIOServerCnxn@623]
- Closed socket connection for client /172.17.0.1:38216 (no session established
for client)
2020-08-26 16:12:00,792 [myid:3] - DEBUG
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@286]
- Accepted socket connection from /172.17.0.1:38224
2020-08-26 16:12:00,793 [myid:3] - INFO [NIOWorkerThread-4:NIOServerCnxn@507]
- Processing mntr command from /172.17.0.1:38224
2020-08-26 16:12:00,798 [myid:3] - DEBUG [NIOWorkerThread-4:NIOServerCnxn@623]
- Closed socket connection for client /172.17.0.1:38224 (no session established
for client)
2020-08-26 16:12:00,845 [myid:3] - TRACE
[SessionTracker:SessionTrackerImpl@221] - Session closing: 0x20170630c4b0000
2020-08-26 16:12:00,846 [myid:3] - INFO [SessionTracker:ZooKeeperServer@600] -
Expiring session 0x20170630c4b0000, timeout of 30000ms exceeded
2020-08-26 16:12:00,846 [myid:3] - INFO
[RequestThrottler:QuorumZooKeeperServer@158] - Submitting global closeSession
request for session 0x20170630c4b0000
2020-08-26 16:12:00,846 [myid:3] - TRACE [ProcessThread(sid:3
cport:-1)::ZooTrace@82] - :Psessionid:0x20170630c4b0000 type:closeSession
cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
2020-08-26 16:12:00,846 [myid:3] - TRACE [ProcessThread(sid:3
cport:-1)::SessionTrackerImpl@221] - Session closing: 0x20170630c4b0000
2020-08-26 16:12:00,846 [myid:3] - DEBUG [ProcessThread(sid:3
cport:-1)::PrepRequestProcessor@1111] - Digest got from data tree is: 1371985504
2020-08-26 16:12:00,846 [myid:3] - DEBUG [ProcessThread(sid:3
cport:-1)::CommitProcessor@592] - Processing request::
sessionid:0x20170630c4b0000 type:closeSession cxid:0x0 zxid:0xf00000002
txntype:-11 reqpath:n/a
2020-08-26 16:12:00,847 [myid:3] - DEBUG [ProcessThread(sid:3
cport:-1)::Leader@1250] - Proposing:: sessionid:0x20170630c4b0000
type:closeSession cxid:0x0 zxid:0xf00000002 txntype:-11 reqpath:n/a
2020-08-26 16:12:00,848 [myid:3] - TRACE [SyncThread:3:Leader@983] - Ack zxid:
0xf00000002
2020-08-26 16:12:00,848 [myid:3] - TRACE [SyncThread:3:Leader@986] -
outstanding proposal: 0xf00000002
2020-08-26 16:12:00,848 [myid:3] - TRACE [SyncThread:3:Leader@988] -
outstanding proposals all
2020-08-26 16:12:00,850 [myid:3] - TRACE
[LearnerHandler-/10.4.12.167:57912:ZooTrace@70] - i ACK f00000002 null
2020-08-26 16:12:00,850 [myid:3] - TRACE
[LearnerHandler-/10.4.12.167:57912:Leader@983] - Ack zxid: 0xf00000002
2020-08-26 16:12:00,850 [myid:3] - TRACE
[LearnerHandler-/10.4.12.167:57912:Leader@986] - outstanding proposal:
0xf00000002
2020-08-26 16:12:00,850 [myid:3] - TRACE
[LearnerHandler-/10.4.12.167:57912:Leader@988] - outstanding proposals all
2020-08-26 16:12:00,850 [myid:3] - DEBUG
[LearnerHandler-/10.4.12.167:57912:CommitProcessor@580] - Committing request::
sessionid:0x20170630c4b0000 type:closeSession cxid:0x0 zxid:0xf00000002
txntype:-11 reqpath:n/a
2020-08-26 16:12:00,851 [myid:3] - TRACE
[LearnerHandler-/10.4.27.188:37872:ZooTrace@70] - i ACK f00000002 null
2020-08-26 16:12:00,851 [myid:3] - TRACE
[LearnerHandler-/10.4.27.188:37872:Leader@983] - Ack zxid: 0xf00000002
2020-08-26 16:12:00,851 [myid:3] - TRACE
[LearnerHandler-/10.4.27.188:37872:Leader@988] - outstanding proposals all
2020-08-26 16:12:00,851 [myid:3] - DEBUG
[LearnerHandler-/10.4.27.188:37872:Leader@1001] - outstanding is 0
2020-08-26 16:12:00,851 [myid:3] - DEBUG
[CommitProcessor:3:FinalRequestProcessor@110] - Processing request::
sessionid:0x20170630c4b0000 type:closeSession cxid:0x0 zxid:0xf00000002
txntype:-11 reqpath:n/a
2020-08-26 16:12:00,851 [myid:3] - TRACE [CommitProcessor:3:ZooTrace@82] -
:Esessionid:0x20170630c4b0000 type:closeSession cxid:0x0 zxid:0xf00000002
txntype:-11 reqpath:n/a
2020-08-26 16:12:00,851 [myid:3] - DEBUG
[CommitProcessor:3:SessionTrackerImpl@232] - Removing session 0x20170630c4b0000
2020-08-26 16:12:00,851 [myid:3] - TRACE [CommitProcessor:3:ZooTrace@70] -
SessionTrackerImpl --- Removing session 0x20170630c4b0000{noformat}
This leader log is from the first and to last mention of what appears to be a
successful session id 0x20170630c4b0000.
In the negative case this is what i see:
* Follower (sid:1):
{noformat}
2020-08-26 16:09:53,282 [myid:1] - TRACE
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):ZooTrace@70] - i
PROPOSAL e00000001 null
2020-08-26 16:09:53,282 [myid:1] - WARN
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):Follower@170] - Got
zxid 0xe00000001 expected 0x1{noformat}
*In the follower logs there is no COMMIT log from ZooTrace around this time or
mention of the createSession request at all.*
* Leader (sid:2):
{noformat}
2020-08-26 16:09:53,277 [myid:2] - DEBUG
[NIOWorkerThread-4:SessionTrackerImpl@278] - Adding session 0x201706194f00000
2020-08-26 16:09:53,278 [myid:2] - TRACE [NIOWorkerThread-4:ZooTrace@70] -
SessionTrackerImpl --- Adding session 0x201706194f00000 30000
2020-08-26 16:09:53,278 [myid:2] - DEBUG
[NIOWorkerThread-4:ZooKeeperServer@1394] - Client attempting to establish new
session: session = 0x201706194f00000, zxid = 0x0, timeout = 30000, address =
/127.0.0.1:43616
2020-08-26 16:09:53,278 [myid:2] - TRACE [ProcessThread(sid:2
cport:-1)::ZooTrace@82] - :Psessionid:0x201706194f00000 type:createSession
cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
2020-08-26 16:09:53,278 [myid:2] - TRACE [ProcessThread(sid:2
cport:-1)::ZooTrace@70] - SessionTrackerImpl --- Existing session
0x201706194f00000 30000
2020-08-26 16:09:53,278 [myid:2] - DEBUG [ProcessThread(sid:2
cport:-1)::PrepRequestProcessor@1111] - Digest got from data tree is: 1371985504
2020-08-26 16:09:53,279 [myid:2] - DEBUG [ProcessThread(sid:2
cport:-1)::CommitProcessor@592] - Processing request::
sessionid:0x201706194f00000 type:createSession cxid:0x0 zxid:0xe00000001
txntype:-10 reqpath:n/a
2020-08-26 16:09:53,279 [myid:2] - DEBUG [ProcessThread(sid:2
cport:-1)::Leader@1250] - Proposing:: sessionid:0x201706194f00000
type:createSession cxid:0x0 zxid:0xe00000001 txntype:-10 reqpath:n/a
2020-08-26 16:09:53,284 [myid:2] - TRACE
[LearnerHandler-/10.4.27.76:49898:ZooTrace@70] - i ACK e00000001 null
2020-08-26 16:09:53,284 [myid:2] - TRACE
[LearnerHandler-/10.4.12.167:43880:ZooTrace@70] - i ACK e00000001 null
2020-08-26 16:09:59,838 [myid:2] - DEBUG
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@286]
- Accepted socket connection from /172.17.0.1:59228
2020-08-26 16:09:59,839 [myid:2] - INFO [NIOWorkerThread-1:NIOServerCnxn@507] -
Processing srvr command from /172.17.0.1:59228
2020-08-26 16:09:59,840 [myid:2] - DEBUG [NIOWorkerThread-1:NIOServerCnxn@623]
- Closed socket connection for client /172.17.0.1:59228 (no session established
for client)
2020-08-26 16:10:00,250 [myid:2] - DEBUG
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@286]
- Accepted socket connection from /172.17.0.1:59232
2020-08-26 16:10:00,251 [myid:2] - INFO [NIOWorkerThread-2:NIOServerCnxn@507] -
Processing mntr command from /172.17.0.1:59232
2020-08-26 16:10:00,255 [myid:2] - DEBUG [NIOWorkerThread-2:NIOServerCnxn@623]
- Closed socket connection for client /172.17.0.1:59232 (no session established
for client)
2020-08-26 16:10:16,859 [myid:2] - TRACE
[SessionTracker:SessionTrackerImpl@221] - Session closing: 0x201705b99040002
2020-08-26 16:10:16,859 [myid:2] - INFO [SessionTracker:ZooKeeperServer@600] -
Expiring session 0x201705b99040002, timeout of 30000ms exceeded
2020-08-26 16:10:16,860 [myid:2] - INFO
[RequestThrottler:QuorumZooKeeperServer@158] - Submitting global closeSession
request for session 0x201705b99040002
2020-08-26 16:10:16,860 [myid:2] - TRACE [ProcessThread(sid:2
cport:-1)::ZooTrace@82] - :Psessionid:0x201705b99040002 type:closeSession
cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
2020-08-26 16:10:16,860 [myid:2] - TRACE [ProcessThread(sid:2
cport:-1)::SessionTrackerImpl@221] - Session closing: 0x201705b99040002
2020-08-26 16:10:16,860 [myid:2] - DEBUG [ProcessThread(sid:2
cport:-1)::PrepRequestProcessor@1111] - Digest got from data tree is: 1371985504
2020-08-26 16:10:16,860 [myid:2] - DEBUG [ProcessThread(sid:2
cport:-1)::CommitProcessor@592] - Processing request::
sessionid:0x201705b99040002 type:closeSession cxid:0x0 zxid:0xe00000002
txntype:-11 reqpath:n/a
2020-08-26 16:10:16,860 [myid:2] - DEBUG [ProcessThread(sid:2
cport:-1)::Leader@1250] - Proposing:: sessionid:0x201705b99040002
type:closeSession cxid:0x0 zxid:0xe00000002 txntype:-11 reqpath:n/a
2020-08-26 16:10:16,863 [myid:2] - TRACE
[LearnerHandler-/10.4.27.76:49898:ZooTrace@70] - i ACK e00000002 null
2020-08-26 16:10:16,863 [myid:2] - TRACE
[LearnerHandler-/10.4.12.167:43880:ZooTrace@70] - i ACK e00000002 null
2020-08-26 16:10:24,858 [myid:2] - TRACE
[SessionTracker:SessionTrackerImpl@221] - Session closing: 0x201706194f00000
2020-08-26 16:10:24,859 [myid:2] - INFO [SessionTracker:ZooKeeperServer@600] -
Expiring session 0x201706194f00000, timeout of 30000ms exceeded
2020-08-26 16:10:24,859 [myid:2] - INFO
[RequestThrottler:QuorumZooKeeperServer@158] - Submitting global closeSession
request for session 0x201706194f00000
2020-08-26 16:10:24,859 [myid:2] - TRACE [ProcessThread(sid:2
cport:-1)::ZooTrace@82] - :Psessionid:0x201706194f00000 type:closeSession
cxid:0x0 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
2020-08-26 16:10:24,860 [myid:2] - TRACE [ProcessThread(sid:2
cport:-1)::SessionTrackerImpl@221] - Session closing: 0x201706194f00000
2020-08-26 16:10:24,860 [myid:2] - DEBUG [ProcessThread(sid:2
cport:-1)::PrepRequestProcessor@1111] - Digest got from data tree is: 1371985504
2020-08-26 16:10:24,860 [myid:2] - DEBUG [ProcessThread(sid:2
cport:-1)::CommitProcessor@592] - Processing request::
sessionid:0x201706194f00000 type:closeSession cxid:0x0 zxid:0xe00000003
txntype:-11 reqpath:n/a
2020-08-26 16:10:24,860 [myid:2] - DEBUG [ProcessThread(sid:2
cport:-1)::Leader@1250] - Proposing:: sessionid:0x201706194f00000
type:closeSession cxid:0x0 zxid:0xe00000003 txntype:-11 reqpath:n/a
2020-08-26 16:10:24,859 [myid:2] - DEBUG [ConnnectionExpirer:NIOServerCnxn@623]
- Closed socket connection for client /127.0.0.1:43616 which had sessionid
0x201706194f00000{noformat}
These leader logs are from the first and last mention of what appears to be a
failed session id 0x201706194f00000. What seems to be missing are logs from
*Leader@983* which would indicate a successful acknowledgement of the message
(and likely unblocking the rest of the create session flow).
So i took a look at the code that should be logging that:
[https://github.com/apache/zookeeper/blob/release-3.6.1/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L978-L983].
In that processAck method you can see it exits early if the leader is not
*allowdToCommit.* So i then tried to check out in what cases that might be set
to false. There were two:
* In leader.startZkServer
[https://github.com/apache/zookeeper/blob/release-3.6.1/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L1518]
which looks like it would get invoked when starting to lead.
* And
[https://github.com/apache/zookeeper/blob/release-3.6.1/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Leader.java#L944]
which appears to be when committing a reconfigure (which is disabled in my
case)
Since the repro using the stack.yml from the above docker page was simple
enough I decided to try to build a one off zookeeper with some more logging. I
added logging to indicate which (if any) of the above cases were being hit
disallowing the leader from committing. Which indeed was the case (in
startZKServer). I then added more logging to help diagnose why the designated
leader might be different than the current server. These logs can be seen in
this commit:
[https://github.com/obsoleted/zookeeper/commit/f008d4b0f5345aef409013eef4a23fdd3dba5ddf]
Using this image I ran through the repro using the attached [^stack.yml](I
upped it to 5 zk nodes as part of other experimentation). Here are the logs of
interest:
Working case:
{noformat}
zoo5_1 | 2020-08-27 04:12:42,406 [myid:5] - INFO
[QuorumPeer[myid=5](plain=0.0.0.0:2181)(secure=disabled):Leader@1509] - Have
quorum of supporters, sids: [[1, 2, 5],[1, 2, 5]]; starting up and setting last
processed zxid: 0x100000000
zoo5_1 | 2020-08-27 04:12:42,407 [myid:5] - WARN
[QuorumPeer[myid=5](plain=0.0.0.0:2181)(secure=disabled):Leader@850] - QV:
server.1=zoo1:2888:3888:participant;0.0.0.0:2181
zoo5_1 | server.2=zoo2:2888:3888:participant;0.0.0.0:2181
zoo5_1 | server.3=zoo3:2888:3888:participant;0.0.0.0:2181
zoo5_1 | server.4=zoo4:2888:3888:participant;0.0.0.0:2181
zoo5_1 | server.5=0.0.0.0:2888:3888:participant;0.0.0.0:2181
zoo5_1 | version=100000000
zoo5_1 | 2020-08-27 04:12:42,409 [myid:5] - WARN
[QuorumPeer[myid=5](plain=0.0.0.0:2181)(secure=disabled):Leader@851] -
VotingMembers:
{1=zoo1:2888:3888:participant;0.0.0.0:2181,2=zoo2:2888:3888:participant;0.0.0.0:2181,
3=zoo3:2888:3888:participant;0.0.0.0:2181,
4=zoo4:2888:3888:participant;0.0.0.0:2181,5=0.0.0.0:2888:3888:participant;0.0.0.0:2181}
zoo5_1 | 2020-08-27 04:12:42,410 [myid:5] - WARN
[QuorumPeer[myid=5](plain=0.0.0.0:2181)(secure=disabled):Leader@852] -
OwnAdder: /0.0.0.0:2888
zoo5_1 | 2020-08-27 04:12:42,411 [myid:5] - WARN
[QuorumPeer[myid=5](plain=0.0.0.0:2181)(secure=disabled):Leader@853] - Self
QAddr: /0.0.0.0:2888{noformat}
In this case there were no messages about being disallowed from committing. And
the self address matches the address in the QuorumVerifier for the same id.
Not working case:
{noformat}
zoo4_1 | 2020-08-27 04:15:29,650 [myid:4] - INFO
[QuorumPeer[myid=4](plain=0.0.0.0:2181)(secure=disabled):Leader@1509] - Have
quorum of supporters, sids: [[1, 3, 4],[1, 3, 4]]; starting up and setting last
processed zxid: 0x200000000
zoo4_1 | 2020-08-27 04:15:29,651 [myid:4] - WARN
[QuorumPeer[myid=4](plain=0.0.0.0:2181)(secure=disabled):Leader@850] - QV:
server.1=zoo1:2888:3888:participant;0.0.0.0:2181
zoo4_1 | server.2=zoo2:2888:3888:participant;0.0.0.0:2181
zoo4_1 | server.3=zoo3:2888:3888:participant;0.0.0.0:2181
zoo4_1 | server.4=zoo4:2888:3888:participant;0.0.0.0:2181
zoo4_1 | server.5=0.0.0.0:2888:3888:participant;0.0.0.0:2181
zoo4_1 | version=100000000
zoo4_1 | 2020-08-27 04:15:29,653 [myid:4] - WARN
[QuorumPeer[myid=4](plain=0.0.0.0:2181)(secure=disabled):Leader@851] -
VotingMembers: {1=zoo1:2888:3888:participant;0.0.0.0:2181,
2=zoo2:2888:3888:participant;0.0.0.0:2181,
3=zoo3:2888:3888:participant;0.0.0.0:2181,
4=zoo4:2888:3888:participant;0.0.0.0:2181,
5=0.0.0.0:2888:3888:participant;0.0.0.0:2181}
zoo4_1 | 2020-08-27 04:15:29,654 [myid:4] - WARN
[QuorumPeer[myid=4](plain=0.0.0.0:2181)(secure=disabled):Leader@852] -
OwnAdder: zoo4/172.18.0.4:2888
zoo4_1 | 2020-08-27 04:15:29,655 [myid:4] - WARN
[QuorumPeer[myid=4](plain=0.0.0.0:2181)(secure=disabled):Leader@853] - Self
QAddr: /0.0.0.0:2888
zoo4_1 | 2020-08-27 04:15:29,655 [myid:4] - WARN
[QuorumPeer[myid=4](plain=0.0.0.0:2181)(secure=disabled):Leader@1526] - Not
designated leader. Not allowed to commit. designatedLeader: 1, self: 4{noformat}
In this case the addresses do not match and as a result the leader is not
allowed to commit.
Other interesting notes:
* The QuorumVerifier printout seems to be from the perspective of server .5
(the first leader) as it has its ip set to 0.0.0.0. Even though server.4 is the
leader (and one printing out these logs).
* This matches what is in the zoo.cfg.dynamic.next and /zookeeper/config znode.
* Rebuilding the whole cluster removing the 0.0.0.0 references and restarting
the whole demo works better with leaders being able to change. So one
workaround might be to avoid using the 0.0.0.0 address for the server and
ensure you use proper hostnames or IP addresses that are resolvable by the
other servers. But I'm not sure how to properly ensure that the addresses will
always resolve the same.
* I am at a loss as for how to properly remove this potentially bad config
from the cluster safely or why it is being used at all if dynamic configuration
isn't. I'm curious if there is a way to force it to flush with a better more
"correct" configuration once known. In our case we have a couple clusters that
got into this state trying to upgrade as mentioned. We can roll back versions
to keep it operational but the problematic znode (/zookeeper/config) remains so
I'm not sure if going to a newer version with fixed static config will work or
if it will use the stored value. I'll likely try to test this out to try to get
our clusters to as normal a state as possible.
Again, I don't really understand the internals of zookeeper all that well so
just trying to guess at things. Definitely seems like more understanding and
insight would be appreciated into proper configuration of the addresses
especially in a docker environment where lots of factors can go into how an
address or ip is resolved (And to what).
> Zookeeper clients timeout after leader change
> ---------------------------------------------
>
> Key: ZOOKEEPER-3920
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3920
> Project: ZooKeeper
> Issue Type: Bug
> Components: quorum, server
> Affects Versions: 3.6.1
> Reporter: Andre Price
> Priority: Major
> Attachments: stack.yml, zk_repro.zip
>
>
> [Sorry I believe this is a dupe of
> https://issues.apache.org/jira/browse/ZOOKEEPER-3828 and potentially
> https://issues.apache.org/jira/browse/ZOOKEEPER-3466
> But i am not able to attach files there for some reason so creating a new
> issue which hopefully allows me]
> We are encountering an issue where failing over from the leader results in
> zookeeper clients not being able to connect successfully. They timeout
> waiting for a response from the server. We are attempting to upgrade some
> existing zookeeper clusters from 3.4.14 to 3.6.1 (not sure if relevant but
> stating incase it helps with pinpointing issue) which is effectively blocked
> by this issue. We perform the rolling upgrade (followers first then leader
> last) and it seems to go successfully by all indicators. But we end up in the
> state described in this issue where if the leader changes (either due to
> restart or stopping) the cluster does not seem able to start new sessions.
> I've gathered some TRACE logs from our servers and will attach in the hopes
> they can help figure this out.
> Attached zk_repro.zip which contains the following:
> * zoo.cfg used in one of the instances (they are all the same except for the
> local server's ip being 0.0.0.0 in each)
> * zoo.cfg.dynamic.next (don't think this is used anywhere but is written by
> zookeeper at some point - I think when the first 3.6.1 container becomes
> leader based on the value – the file is in all containers and is the same in
> all as well)
> * s\{1,2,3}_zk.log - logs from each of the 3 servers. Estimated time of
> repro start indicated by "// REPRO START" text and whitespace in logs
> * repro_steps.txt - rough steps executed that result in the server logs
> attached
>
> I'll summarize the repro here also:
> # Initially it appears to be a healthy 3 node ensemble all running 3.6.1.
> Server ids are 1,2,3 and 3 is the leader. Dynamic config/reconfiguration is
> disabled.
> # invoke srvr on each node (to verify setup and also create bookmark in logs)
> # Do a zkCli get of /zookeeper/quota which succeeds
> # Do a restart of the leader (to same image/config) (server 2 now becomes
> leader, 3 is back as follower)
> # Try to perform the same zkCli get which times out (this get is done within
> the container)
> # Try to perform the same zkCli get but from another machine, this also
> times out
> # Invoke srvr on each node again (to verify that 2 is now the
> leader/bookmark)
> # Do a restart of server 2 (3 becomes leader, 2 follower)
> # Do a zkCli get of /zookeeper/quota which succeeds
> # Invoke srvr on each node again (to verify that 3 is leader)
> I tried to keep the other ZK traffic to a minimum but there are likely some
> periodic mntr requests mixed from our metrics scraper.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)