[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402819#comment-15402819
 ] 

Artem Golotin commented on ZOOKEEPER-1863:
------------------------------------------

Hi! I was able to reproduce the same issue as mentioned in 
[ZOOKEEPER-2151|https://issues.apache.org/jira/browse/ZOOKEEPER-2151]. Here is 
the output from the stat command: 
{noformat}
Zookeeper version: 3.5.1-alpha--1, built on 07/08/2016 17:08 GMT
Clients:
 /0:0:0:0:0:0:0:1:59312[0](queued=0,recved=1,sent=0)
 /10.2.29.7:34266[0](queued=0,recved=1,sent=0)
 /10.2.29.7:55996[0](queued=0,recved=1,sent=0)
 /10.2.81.3:48640[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/1/826
Received: 2199007
Sent: 935511
Connections: 4
Outstanding: 48112
Zxid: 0x100011d4e
Mode: leader
Node count: 171
{noformat}

Notice the number of outstanding connections.... 
After examining the code I found that the patch that have been submitted with 
this issue was applied to the code. With my limited experience I do not believe 
that I will be able to reproduce the issue, but I am saving the full stdout 
zookeeper output and can provide it if needed. However, I cannot provide you 
with a jstack dump, because when I try to do so (even with -F option), it 
generates the following exception:
{noformat}
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't 
attach to the process
{noformat}

Unfortunately, the issue seems to have happened on Sunday when I was not able 
to check the health of my cluster or perform any immediate fixes. I do have 
several Kazoo clients + Kafka + Secor that are using zookeeper, and as all 
those are failing now with zookeeper outputting the following logs:

{noformat}
2016-08-01 20:45:51,828 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.81.3:50656
2016-08-01 20:45:51,829 - INFO  [NIOWorkerThread-4:ZooKeeperServer@931] - 
Client attempting to establish new session at /10.2.81.3:50656
2016-08-01 20:45:53,723 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.29.7:35862
2016-08-01 20:45:53,725 - INFO  [NIOWorkerThread-3:ZooKeeperServer@931] - 
Client attempting to establish new session at /10.2.29.7:35862
2016-08-01 20:45:59,171 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.81.3:50658
2016-08-01 20:45:59,172 - INFO  [NIOWorkerThread-1:ZooKeeperServer@931] - 
Client attempting to establish new session at /10.2.81.3:50658
2016-08-01 20:46:00,223 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - 
Closed socket connection for client /10.2.29.7:35828 which had sessionid 
0x10044f682c45bdd
2016-08-01 20:46:00,223 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - 
Closed socket connection for client /10.2.81.3:50656 which had sessionid 
0x10044f682c45bdf
2016-08-01 20:46:05,013 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.29.7:35898
2016-08-01 20:46:05,028 - INFO  [NIOWorkerThread-2:NIOServerCnxn@836] - 
Processing ruok command from /10.2.29.7:35898
2016-08-01 20:46:05,029 - INFO  [NIOWorkerThread-2:NIOServerCnxn@1007] - Closed 
socket connection for client /10.2.29.7:35898 (no session established for 
client)
2016-08-01 20:46:05,146 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO  [NIOWorkerThread-4:NIOServerCnxn@836] - 
Processing srvr command from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO  [NIOWorkerThread-4:NIOServerCnxn@1007] - Closed 
socket connection for client /10.2.29.7:35900 (no session established for 
client)
2016-08-01 20:46:05,617 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.29.7:35904
2016-08-01 20:46:05,617 - INFO  [NIOWorkerThread-3:NIOServerCnxn@836] - 
Processing ruok command from /10.2.29.7:35904
2016-08-01 20:46:05,618 - INFO  [NIOWorkerThread-3:NIOServerCnxn@1007] - Closed 
socket connection for client /10.2.29.7:35904 (no session established for 
client)
2016-08-01 20:46:05,634 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.29.7:35906
2016-08-01 20:46:05,634 - INFO  [NIOWorkerThread-1:NIOServerCnxn@836] - 
Processing srvr command from /10.2.29.7:35906
2016-08-01 20:46:05,636 - INFO  [NIOWorkerThread-1:NIOServerCnxn@1007] - Closed 
socket connection for client /10.2.29.7:35906 (no session established for 
client)
2016-08-01 20:46:05,644 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.29.7:35908
2016-08-01 20:46:05,644 - INFO  [NIOWorkerThread-2:ZooKeeperServer@931] - 
Client attempting to establish new session at /10.2.29.7:35908
2016-08-01 20:46:06,718 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.81.3:50660
2016-08-01 20:46:06,719 - INFO  [NIOWorkerThread-4:ZooKeeperServer@931] - 
Client attempting to establish new session at /10.2.81.3:50660
2016-08-01 20:46:10,224 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - 
Closed socket connection for client /10.2.81.3:50658 which had sessionid 
0x10044f682c45be1
2016-08-01 20:46:10,225 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - 
Closed socket connection for client /10.2.29.7:35862 which had sessionid 
0x10044f682c45be0
2016-08-01 20:46:13,933 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.81.3:50662
2016-08-01 20:46:13,933 - INFO  [NIOWorkerThread-3:ZooKeeperServer@931] - 
Client attempting to establish new session at /10.2.81.3:50662
2016-08-01 20:46:16,030 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.29.7:35942
2016-08-01 20:46:16,032 - INFO  [NIOWorkerThread-1:ZooKeeperServer@931] - 
Client attempting to establish new session at /10.2.29.7:35942
2016-08-01 20:46:18,646 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.29.7:57696
2016-08-01 20:46:18,647 - INFO  [NIOWorkerThread-2:ZooKeeperServer@931] - 
Client attempting to establish new session at /10.2.29.7:57696
2016-08-01 20:46:20,223 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - 
Closed socket connection for client /10.2.29.7:35908 which had sessionid 
0x10044f682c45be2
2016-08-01 20:46:20,224 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - 
Closed socket connection for client /10.2.81.3:50662 which had sessionid 
0x10044f682c45be4
2016-08-01 20:46:20,225 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - 
Closed socket connection for client /10.2.81.3:50660 which had sessionid 
0x10044f682c45be3
2016-08-01 20:46:21,255 - INFO  
[NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /10.2.81.3:50664
2016-08-01 20:46:21,256 - INFO  [NIOWorkerThread-4:ZooKeeperServer@931] - 
Client attempting to establish new session at /10.2.81.3:50664
{noformat}

Also, I would like to point out that I am running zookeeper as a part of 
kubernetes cluster, so the issue may be caused by kubernetes internal proxy, 
but I do not know why that would cause an issue...

> Race condition in commit processor leading to out of order request 
> completion, xid mismatch on client.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1863
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1863
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0
>            Reporter: Dutch T. Meyer
>            Assignee: Dutch T. Meyer
>            Priority: Blocker
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, 
> ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, 
> ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, stack.17512
>
>
> In CommitProcessor.java processor, if we are at the primary request handler 
> on line 167:
> {noformat}
>                 while (!stopped && !isWaitingForCommit() &&
>                        !isProcessingCommit() &&
>                        (request = queuedRequests.poll()) != null) {
>                     if (needCommit(request)) {
>                         nextPending.set(request);
>                     } else {
>                         sendToNextProcessor(request);
>                     }
>                 }
> {noformat}
> A request can be handled in this block and be quickly processed and completed 
> on another thread. If queuedRequests is empty, we then exit the block. Next, 
> before this thread makes any more progress, we can get 2 more requests, one 
> get_children(say), and a sync placed on queuedRequests for the processor. 
> Then, if we are very unlucky, the sync request can complete and this object's 
> commit() routine is called (from FollowerZookeeperServer), which places the 
> sync request on the previously empty committedRequests queue. At that point, 
> this thread continues.
> We reach line 182, which is a check on sync requests.
> {noformat}
>                 if (!stopped && !isProcessingRequest() &&
>                     (request = committedRequests.poll()) != null) {
> {noformat}
> Here we are not processing any requests, because the original request has 
> completed. We haven't dequeued either the read or the sync request in this 
> processor. Next, the poll above will pull the sync request off the queue, and 
> in the following block, the sync will get forwarded to the next processor.
> This is a problem because the read request hasn't been forwarded yet, so 
> requests are now out of order.
> I've been able to reproduce this bug reliably by injecting a 
> Thread.sleep(5000) between the two blocks above to make the race condition 
> far more likely, then in a client program.
> {noformat}
>         zoo_aget_children(zh, "/", 0, getchildren_cb, NULL);
>         //Wait long enough for queuedRequests to drain
>         sleep(1);
>         zoo_aget_children(zh, "/", 0, getchildren_cb, &th_ctx[0]);
>         zoo_async(zh, "/", sync_cb, &th_ctx[0]);
> {noformat}
> When this bug is triggered, 3 things can happen:
> 1) Clients will see requests complete out of order and fail on xid mismatches.
> 2) Kazoo in particular doesn't handle this runtime exception well, and can 
> orphan outstanding requests.
> 3) I've seen zookeeper servers deadlock, likely because the commit cannot be 
> completed, which can wedge the commit processor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to