[
https://issues.apache.org/jira/browse/ZOOKEEPER-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664232#comment-16664232
]
Fangmin Lv commented on ZOOKEEPER-3182:
---------------------------------------
[~andorm] I'm trying to go through this problem to see if this is a real
correctness issue or by designed behavior.
So first, let's see if we're on the same page about the designed behavior. The
setWatches request is treated as a read request in Zeus, ZK only guarantees
read after write for the same session on the same server, if it reconnects to a
different server, that server might be slow, either slow syncing with leader or
slow due to GC, it may not get the commits from leader on time, so it doesn't
guarantee you'll see it unless issuing sync even the follower it connect to is
already in broadcasting state.
This seems to be a test problem, not the code, and we should check if we can
issue sync before setWatches to fix the flaky test here.
> Race condition when follower syncing with leader and starting to serve
> requests
> -------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-3182
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3182
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.6.0
> Reporter: Andor Molnar
> Priority: Critical
>
> This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues
> have been implemented in CommitProcessor to improve performance.
> [~abrahamfine] 's analysis is accurate on GitHub:
> https://github.com/apache/zookeeper/pull/300
> He was trying to introduce synchronization between Learner.syncWithLeader()
> and CommitProcessor to wait for in-flight requests to be committed before
> accepting client requests.
> In the affected unit test ({{testNodeDataChanged}}) there's a race between
> reconnecting client's setWatches request and updates coming from the leader
> according to the following logs:
> {noformat}
> 2018-10-25 13:59:58,556 [myid:] - DEBUG
> [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request::
> sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a
> 2018-10-25 13:59:58,556 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a
> ...
> 2018-10-25 13:59:58,557 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x20005d8f8a40000 type:delete cxid:0x1 zxid:0x100000004 txntype:2
> reqpath:n/a
> ...
> 2018-10-25 13:59:58,561 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x20005d8f8a40000 type:create cxid:0x2 zxid:0x100000005 txntype:1
> reqpath:n/a
> 2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG
> [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got
> WatchedEvent state:SyncConnected type:NodeDeleted path:/test-changed for
> sessionid 0x10005d8fc4d0000
> {noformat}
> {{setWatches}} request is processed before {{delete}} and {{create}}, hence
> the client receives NodeDeleted event.
> In the working scenario it looks like:
> {noformat}
> 2018-10-25 14:04:55,247 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x20005dd88110000 type:delete cxid:
> 0x1 zxid:0x100000004 txntype:2 reqpath:n/a
> 2018-10-25 14:04:55,249 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x20005dd88110000 type:create cxid:
> 0x2 zxid:0x100000005 txntype:1 reqpath:n/a
> ...
> 2018-10-25 14:04:56,314 [myid:] - DEBUG
> [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request::
> sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a
> 2018-10-25 14:04:56,315 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a
> ...
> 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG
> [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842] - Got
> notification sessionid:0x10005dd88110000
> 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG
> [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got
> WatchedEvent state:SyncConnected type:NodeDataChanged path:/test-changed for
> sessionid 0x10005dd88110000
> {noformat}
> {{delete}} and {{create}} requests happen way before {{setWatches}} comes in
> (even before the client connection is established) and client receives
> NodeDataChanged event only.
> Abe's approach unfortunately raises the following concerns:
> - modifies CommitProcessor's code which might affect performance and
> correctness ([~shralex] raised on ZOOKEEPER-2807),
> - we experienced deadlocks while testing the patch:
> https://github.com/apache/zookeeper/pull/300
> As a consequence I raised this Jira to capture the experiences and to put the
> unit test on Ignore list, because currently I'm not sure about whether this
> is a real issue or a non-backward compatible change in 3.6 with the gain of a
> huge performance improvement.
> Either way I don't want this flaky test to influence contributions, so I'll
> mark as Ignored on trunk until the issue is resolved.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)