[
https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402819#comment-15402819
]
Artem Golotin commented on ZOOKEEPER-1863:
------------------------------------------
Hi! I was able to reproduce the same issue as mentioned in
[ZOOKEEPER-2151|https://issues.apache.org/jira/browse/ZOOKEEPER-2151]. Here is
the output from the stat command:
{noformat}
Zookeeper version: 3.5.1-alpha--1, built on 07/08/2016 17:08 GMT
Clients:
/0:0:0:0:0:0:0:1:59312[0](queued=0,recved=1,sent=0)
/10.2.29.7:34266[0](queued=0,recved=1,sent=0)
/10.2.29.7:55996[0](queued=0,recved=1,sent=0)
/10.2.81.3:48640[0](queued=0,recved=1,sent=0)
Latency min/avg/max: 0/1/826
Received: 2199007
Sent: 935511
Connections: 4
Outstanding: 48112
Zxid: 0x100011d4e
Mode: leader
Node count: 171
{noformat}
Notice the number of outstanding connections....
After examining the code I found that the patch that have been submitted with
this issue was applied to the code. With my limited experience I do not believe
that I will be able to reproduce the issue, but I am saving the full stdout
zookeeper output and can provide it if needed. However, I cannot provide you
with a jstack dump, because when I try to do so (even with -F option), it
generates the following exception:
{noformat}
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't
attach to the process
{noformat}
Unfortunately, the issue seems to have happened on Sunday when I was not able
to check the health of my cluster or perform any immediate fixes. I do have
several Kazoo clients + Kafka + Secor that are using zookeeper, and as all
those are failing now with zookeeper outputting the following logs:
{noformat}
2016-08-01 20:45:51,828 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50656
2016-08-01 20:45:51,829 - INFO [NIOWorkerThread-4:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50656
2016-08-01 20:45:53,723 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35862
2016-08-01 20:45:53,725 - INFO [NIOWorkerThread-3:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:35862
2016-08-01 20:45:59,171 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50658
2016-08-01 20:45:59,172 - INFO [NIOWorkerThread-1:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50658
2016-08-01 20:46:00,223 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.29.7:35828 which had sessionid
0x10044f682c45bdd
2016-08-01 20:46:00,223 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50656 which had sessionid
0x10044f682c45bdf
2016-08-01 20:46:05,013 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35898
2016-08-01 20:46:05,028 - INFO [NIOWorkerThread-2:NIOServerCnxn@836] -
Processing ruok command from /10.2.29.7:35898
2016-08-01 20:46:05,029 - INFO [NIOWorkerThread-2:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35898 (no session established for
client)
2016-08-01 20:46:05,146 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO [NIOWorkerThread-4:NIOServerCnxn@836] -
Processing srvr command from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO [NIOWorkerThread-4:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35900 (no session established for
client)
2016-08-01 20:46:05,617 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35904
2016-08-01 20:46:05,617 - INFO [NIOWorkerThread-3:NIOServerCnxn@836] -
Processing ruok command from /10.2.29.7:35904
2016-08-01 20:46:05,618 - INFO [NIOWorkerThread-3:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35904 (no session established for
client)
2016-08-01 20:46:05,634 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35906
2016-08-01 20:46:05,634 - INFO [NIOWorkerThread-1:NIOServerCnxn@836] -
Processing srvr command from /10.2.29.7:35906
2016-08-01 20:46:05,636 - INFO [NIOWorkerThread-1:NIOServerCnxn@1007] - Closed
socket connection for client /10.2.29.7:35906 (no session established for
client)
2016-08-01 20:46:05,644 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35908
2016-08-01 20:46:05,644 - INFO [NIOWorkerThread-2:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:35908
2016-08-01 20:46:06,718 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50660
2016-08-01 20:46:06,719 - INFO [NIOWorkerThread-4:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50660
2016-08-01 20:46:10,224 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50658 which had sessionid
0x10044f682c45be1
2016-08-01 20:46:10,225 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.29.7:35862 which had sessionid
0x10044f682c45be0
2016-08-01 20:46:13,933 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50662
2016-08-01 20:46:13,933 - INFO [NIOWorkerThread-3:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50662
2016-08-01 20:46:16,030 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:35942
2016-08-01 20:46:16,032 - INFO [NIOWorkerThread-1:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:35942
2016-08-01 20:46:18,646 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.29.7:57696
2016-08-01 20:46:18,647 - INFO [NIOWorkerThread-2:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.29.7:57696
2016-08-01 20:46:20,223 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.29.7:35908 which had sessionid
0x10044f682c45be2
2016-08-01 20:46:20,224 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50662 which had sessionid
0x10044f682c45be4
2016-08-01 20:46:20,225 - INFO [ConnnectionExpirer:NIOServerCnxn@1007] -
Closed socket connection for client /10.2.81.3:50660 which had sessionid
0x10044f682c45be3
2016-08-01 20:46:21,255 - INFO
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
- Accepted socket connection from /10.2.81.3:50664
2016-08-01 20:46:21,256 - INFO [NIOWorkerThread-4:ZooKeeperServer@931] -
Client attempting to establish new session at /10.2.81.3:50664
{noformat}
Also, I would like to point out that I am running zookeeper as a part of
kubernetes cluster, so the issue may be caused by kubernetes internal proxy,
but I do not know why that would cause an issue...
> Race condition in commit processor leading to out of order request
> completion, xid mismatch on client.
> ------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-1863
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1863
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.5.0
> Reporter: Dutch T. Meyer
> Assignee: Dutch T. Meyer
> Priority: Blocker
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch,
> ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch,
> ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, stack.17512
>
>
> In CommitProcessor.java processor, if we are at the primary request handler
> on line 167:
> {noformat}
> while (!stopped && !isWaitingForCommit() &&
> !isProcessingCommit() &&
> (request = queuedRequests.poll()) != null) {
> if (needCommit(request)) {
> nextPending.set(request);
> } else {
> sendToNextProcessor(request);
> }
> }
> {noformat}
> A request can be handled in this block and be quickly processed and completed
> on another thread. If queuedRequests is empty, we then exit the block. Next,
> before this thread makes any more progress, we can get 2 more requests, one
> get_children(say), and a sync placed on queuedRequests for the processor.
> Then, if we are very unlucky, the sync request can complete and this object's
> commit() routine is called (from FollowerZookeeperServer), which places the
> sync request on the previously empty committedRequests queue. At that point,
> this thread continues.
> We reach line 182, which is a check on sync requests.
> {noformat}
> if (!stopped && !isProcessingRequest() &&
> (request = committedRequests.poll()) != null) {
> {noformat}
> Here we are not processing any requests, because the original request has
> completed. We haven't dequeued either the read or the sync request in this
> processor. Next, the poll above will pull the sync request off the queue, and
> in the following block, the sync will get forwarded to the next processor.
> This is a problem because the read request hasn't been forwarded yet, so
> requests are now out of order.
> I've been able to reproduce this bug reliably by injecting a
> Thread.sleep(5000) between the two blocks above to make the race condition
> far more likely, then in a client program.
> {noformat}
> zoo_aget_children(zh, "/", 0, getchildren_cb, NULL);
> //Wait long enough for queuedRequests to drain
> sleep(1);
> zoo_aget_children(zh, "/", 0, getchildren_cb, &th_ctx[0]);
> zoo_async(zh, "/", sync_cb, &th_ctx[0]);
> {noformat}
> When this bug is triggered, 3 things can happen:
> 1) Clients will see requests complete out of order and fail on xid mismatches.
> 2) Kazoo in particular doesn't handle this runtime exception well, and can
> orphan outstanding requests.
> 3) I've seen zookeeper servers deadlock, likely because the commit cannot be
> completed, which can wedge the commit processor.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)