[
https://issues.apache.org/jira/browse/ZOOKEEPER-3182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andor Molnar updated ZOOKEEPER-3182:
------------------------------------
Description:
This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues
have been implemented in CommitProcessor to improve performance. [~abrahamfine]
's analysis is accurate on GitHub: https://github.com/apache/zookeeper/pull/300
He was trying to introduce synchronization between Learner.syncWithLeader() and
CommitProcessor to wait for in-flight requests to be committed before accepting
client requests.
In the affected unit test ({{testNodeDataChanged}}) there's a race between
reconnecting client's setWatches request and updates coming from the leader
according to the following logs:
{noformat}
2018-10-25 13:59:58,556 [myid:] - DEBUG
[FollowerRequestProcessor:1:CommitProcessor@424] - Processing request::
sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
2018-10-25 13:59:58,556 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
...
2018-10-25 13:59:58,557 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x20005d8f8a40000 type:delete cxid:0x1 zxid:0x100000004 txntype:2
reqpath:n/a
...
2018-10-25 13:59:58,561 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x20005d8f8a40000 type:create cxid:0x2 zxid:0x100000005 txntype:1
reqpath:n/a
2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG
[main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got WatchedEvent
state:SyncConnected type:NodeDeleted path:/test-changed for sessionid
0x10005d8fc4d0000
{noformat}
{{setWatches}} request is processed before {{delete}} and {{create}}, hence the
client receives NodeDeleted event.
In the working scenario it looks like:
{noformat}
2018-10-25 14:04:55,247 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x20005dd88110000 type:delete cxid:
0x1 zxid:0x100000004 txntype:2 reqpath:n/a
2018-10-25 14:04:55,249 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x20005dd88110000 type:create cxid:
0x2 zxid:0x100000005 txntype:1 reqpath:n/a
...
2018-10-25 14:04:56,314 [myid:] - DEBUG
[FollowerRequestProcessor:1:CommitProcessor@424] - Processing request::
sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
2018-10-25 14:04:56,315 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
...
2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG
[main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842] - Got notification
sessionid:0x10005dd88110000
2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG
[main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got WatchedEvent
state:SyncConnected type:NodeDataChanged path:/test-changed for sessionid
0x10005dd88110000
{noformat}
{{delete}} and {{create}} requests happen way before {{setWatches}} comes in
(even before the client connection is established) and client receives
NodeDataChanged event only.
Abe's approach unfortunately raises the following concerns:
- modifies CommitProcessor's code which might affect performance and
correctness ([~shralex] raised on ZOOKEEPER-2807),
- we experienced deadlocks while testing the patch:
https://github.com/apache/zookeeper/pull/300
As a consequence I raised this Jira to capture the experiences and to put the
unit test on Ignore list, because currently I'm not sure about whether this is
a real issue or a non-backward compatible change in 3.6 with the gain of a huge
performance improvement.
Either way I don't want this flaky test to influence contributions, so I'll
mark as Ignored on trunk until the issue is resolved.
was:
This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues
have been implemented in CommitProcessor to improve performance.
[~abrahamfine] 's analysis is accurate on GitHub:
https://github.com/apache/zookeeper/pull/300
He was trying to introduce synchronization between Learner.syncWithLeader() and
CommitProcessor to wait for in-flight requests to be committed before accepting
client requests.
In the affected unit test (testNodeDataChanged) there's a race between
reconnecting client's setWatches request and updates coming from the leader
according to the following logs:
{noformat}
2018-10-25 13:59:58,556 [myid:] - DEBUG
[FollowerRequestProcessor:1:CommitProcessor@424] - Processing request::
sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
2018-10-25 13:59:58,556 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
...
2018-10-25 13:59:58,557 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x20005d8f8a40000 type:delete cxid:0x1 zxid:0x100000004 txntype:2
reqpath:n/a
...
2018-10-25 13:59:58,561 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x20005d8f8a40000 type:create cxid:0x2 zxid:0x100000005 txntype:1
reqpath:n/a
2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG
[main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got WatchedEvent
state:SyncConnected type:NodeDeleted path:/test-changed for sessionid
0x10005d8fc4d0000
{noformat}
{{setWatches}} request is processed before {{delete}} and {{create}}, hence the
client receives NodeDeleted event.
In the working scenario it looks like:
{noformat}
2018-10-25 14:04:55,247 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x20005dd88110000 type:delete cxid:
0x1 zxid:0x100000004 txntype:2 reqpath:n/a
2018-10-25 14:04:55,249 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x20005dd88110000 type:create cxid:
0x2 zxid:0x100000005 txntype:1 reqpath:n/a
...
2018-10-25 14:04:56,314 [myid:] - DEBUG
[FollowerRequestProcessor:1:CommitProcessor@424] - Processing request::
sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
2018-10-25 14:04:56,315 [myid:] - DEBUG
[CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
txntype:unknown reqpath:n/a
...
2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG
[main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842] - Got notification
sessionid:0x10005dd88110000
2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG
[main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got WatchedEvent
state:SyncConnected type:NodeDataChanged path:/test-changed for sessionid
0x10005dd88110000
{noformat}
{{delete}} and {{create}} requests happen way before {{setWatches}} comes in
(even before the client connection is established.
Abe's approach unfortunately raises the following concerns:
- modifies CommitProcessor's code which might affect performance and
correctness ([~shralex] raised on ZOOKEEPER-2807),
- we experienced deadlocks while testing the patch:
https://github.com/apache/zookeeper/pull/300
As a consequence I raised this Jira the capture the experiences and to put the
unit test on Ignore list, because currenty I'm not sure about whether this is a
real issue or a non-backward compatible change in 3.6 with the gain of a huge
performance improvement.
Either way I don't want this flaky test to influence contributions, so I'll
mark as Ignored on trunk until the issue is resolved.
> Race condition when follower syncing with leader and starting to serve
> requests
> -------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-3182
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3182
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.6.0
> Reporter: Andor Molnar
> Priority: Critical
>
> This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues
> have been implemented in CommitProcessor to improve performance.
> [~abrahamfine] 's analysis is accurate on GitHub:
> https://github.com/apache/zookeeper/pull/300
> He was trying to introduce synchronization between Learner.syncWithLeader()
> and CommitProcessor to wait for in-flight requests to be committed before
> accepting client requests.
> In the affected unit test ({{testNodeDataChanged}}) there's a race between
> reconnecting client's setWatches request and updates coming from the leader
> according to the following logs:
> {noformat}
> 2018-10-25 13:59:58,556 [myid:] - DEBUG
> [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request::
> sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a
> 2018-10-25 13:59:58,556 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a
> ...
> 2018-10-25 13:59:58,557 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x20005d8f8a40000 type:delete cxid:0x1 zxid:0x100000004 txntype:2
> reqpath:n/a
> ...
> 2018-10-25 13:59:58,561 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x20005d8f8a40000 type:create cxid:0x2 zxid:0x100000005 txntype:1
> reqpath:n/a
> 2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG
> [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got
> WatchedEvent state:SyncConnected type:NodeDeleted path:/test-changed for
> sessionid 0x10005d8fc4d0000
> {noformat}
> {{setWatches}} request is processed before {{delete}} and {{create}}, hence
> the client receives NodeDeleted event.
> In the working scenario it looks like:
> {noformat}
> 2018-10-25 14:04:55,247 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x20005dd88110000 type:delete cxid:
> 0x1 zxid:0x100000004 txntype:2 reqpath:n/a
> 2018-10-25 14:04:55,249 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x20005dd88110000 type:create cxid:
> 0x2 zxid:0x100000005 txntype:1 reqpath:n/a
> ...
> 2018-10-25 14:04:56,314 [myid:] - DEBUG
> [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request::
> sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a
> 2018-10-25 14:04:56,315 [myid:] - DEBUG
> [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request::
> sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe
> txntype:unknown reqpath:n/a
> ...
> 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG
> [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842] - Got
> notification sessionid:0x10005dd88110000
> 2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG
> [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got
> WatchedEvent state:SyncConnected type:NodeDataChanged path:/test-changed for
> sessionid 0x10005dd88110000
> {noformat}
> {{delete}} and {{create}} requests happen way before {{setWatches}} comes in
> (even before the client connection is established) and client receives
> NodeDataChanged event only.
> Abe's approach unfortunately raises the following concerns:
> - modifies CommitProcessor's code which might affect performance and
> correctness ([~shralex] raised on ZOOKEEPER-2807),
> - we experienced deadlocks while testing the patch:
> https://github.com/apache/zookeeper/pull/300
> As a consequence I raised this Jira to capture the experiences and to put the
> unit test on Ignore list, because currently I'm not sure about whether this
> is a real issue or a non-backward compatible change in 3.6 with the gain of a
> huge performance improvement.
> Either way I don't want this flaky test to influence contributions, so I'll
> mark as Ignored on trunk until the issue is resolved.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)