ZooKeeper-trunk-solaris - Build # 801 - Still Failing
See https://builds.apache.org/job/ZooKeeper-trunk-solaris/801/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 216696 lines...] [junit] 2014-01-26 09:05:05,073 [myid:] - INFO [main:SyncRequestProcessor@190] - Shutting down [junit] 2014-01-26 09:05:05,073 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@156] - PrepRequestProcessor exited loop! [junit] 2014-01-26 09:05:05,073 [myid:] - INFO [SyncThread:0:SyncRequestProcessor@168] - SyncRequestProcessor exited! [junit] 2014-01-26 09:05:05,074 [myid:] - INFO [main:FinalRequestProcessor@454] - shutdown of request processor complete [junit] 2014-01-26 09:05:05,074 [myid:] - INFO [main:FourLetterWordMain@43] - connecting to 127.0.0.1 11221 [junit] 2014-01-26 09:05:05,075 [myid:] - INFO [main:JMXEnv@142] - ensureOnly:[] [junit] 2014-01-26 09:05:05,076 [myid:] - INFO [main:ClientBase@443] - STARTING server [junit] 2014-01-26 09:05:05,076 [myid:] - INFO [main:ClientBase@364] - CREATING server instance 127.0.0.1:11221 [junit] 2014-01-26 09:05:05,077 [myid:] - INFO [main:NIOServerCnxnFactory@670] - Configuring NIO connection handler with 10s sessionless connection timeout, 2 selector thread(s), 16 worker threads, and 64 kB direct buffers. [junit] 2014-01-26 09:05:05,078 [myid:] - INFO [main:NIOServerCnxnFactory@683] - binding to port 0.0.0.0/0.0.0.0:11221 [junit] 2014-01-26 09:05:05,078 [myid:] - INFO [main:ClientBase@339] - STARTING server instance 127.0.0.1:11221 [junit] 2014-01-26 09:05:05,079 [myid:] - INFO [main:ZooKeeperServer@149] - Created server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 6 datadir /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/trunk/build/test/tmp/test3988846879111801612.junit.dir/version-2 snapdir /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/trunk/build/test/tmp/test3988846879111801612.junit.dir/version-2 [junit] 2014-01-26 09:05:05,080 [myid:] - INFO [main:FileSnap@83] - Reading snapshot /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/trunk/build/test/tmp/test3988846879111801612.junit.dir/version-2/snapshot.b [junit] 2014-01-26 09:05:05,082 [myid:] - INFO [main:FileTxnSnapLog@297] - Snapshotting: 0xb to /zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/trunk/build/test/tmp/test3988846879111801612.junit.dir/version-2/snapshot.b [junit] 2014-01-26 09:05:05,084 [myid:] - INFO [main:FourLetterWordMain@43] - connecting to 127.0.0.1 11221 [junit] 2014-01-26 09:05:05,084 [myid:] - INFO [NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:11221:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /127.0.0.1:62892 [junit] 2014-01-26 09:05:05,085 [myid:] - INFO [NIOWorkerThread-1:NIOServerCnxn@835] - Processing stat command from /127.0.0.1:62892 [junit] 2014-01-26 09:05:05,085 [myid:] - INFO [NIOWorkerThread-1:NIOServerCnxn$StatCommand@684] - Stat command output [junit] 2014-01-26 09:05:05,086 [myid:] - INFO [NIOWorkerThread-1:NIOServerCnxn@1006] - Closed socket connection for client /127.0.0.1:62892 (no session established for client) [junit] 2014-01-26 09:05:05,086 [myid:] - INFO [main:JMXEnv@224] - ensureParent:[InMemoryDataTree, StandaloneServer_port] [junit] 2014-01-26 09:05:05,087 [myid:] - INFO [main:JMXEnv@241] - expect:InMemoryDataTree [junit] 2014-01-26 09:05:05,087 [myid:] - INFO [main:JMXEnv@245] - found:InMemoryDataTree org.apache.ZooKeeperService:name0=StandaloneServer_port-1,name1=InMemoryDataTree [junit] 2014-01-26 09:05:05,088 [myid:] - INFO [main:JMXEnv@241] - expect:StandaloneServer_port [junit] 2014-01-26 09:05:05,088 [myid:] - INFO [main:JMXEnv@245] - found:StandaloneServer_port org.apache.ZooKeeperService:name0=StandaloneServer_port-1 [junit] 2014-01-26 09:05:05,088 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@55] - Memory used 13365 [junit] 2014-01-26 09:05:05,088 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@60] - Number of threads 24 [junit] 2014-01-26 09:05:05,089 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@65] - FINISHED TEST METHOD testQuota [junit] 2014-01-26 09:05:05,089 [myid:] - INFO [main:ClientBase@520] - tearDown starting [junit] 2014-01-26 09:05:05,150 [myid:] - INFO [main:ZooKeeper@954] - Session: 0x143cdcb894e closed [junit] 2014-01-26 09:05:05,150 [myid:] - INFO [main-EventThread:ClientCnxn$EventThread@533] - EventThread shut down [junit] 2014-01-26 09:05:05,151 [myid:] - INFO [main:ClientBase@490] - STOPPING server [junit] 2014-01-26 09:05:05,152 [myid:] - INFO
ZooKeeper-3.4-WinVS2008_java - Build # 418 - Still Failing
See https://builds.apache.org/job/ZooKeeper-3.4-WinVS2008_java/418/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 182575 lines...] [junit] 2014-01-26 09:29:20,625 [myid:] - INFO [main:ClientBase@364] - CREATING server instance 127.0.0.1:11221 [junit] 2014-01-26 09:29:20,626 [myid:] - INFO [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:11221 [junit] 2014-01-26 09:29:20,627 [myid:] - INFO [main:ClientBase@339] - STARTING server instance 127.0.0.1:11221 [junit] 2014-01-26 09:29:20,627 [myid:] - INFO [main:ZooKeeperServer@162] - Created server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 6 datadir f:\hudson\hudson-slave\workspace\ZooKeeper-3.4-WinVS2008_java\branch-3.4\build\test\tmp\test7812772836565187963.junit.dir\version-2 snapdir f:\hudson\hudson-slave\workspace\ZooKeeper-3.4-WinVS2008_java\branch-3.4\build\test\tmp\test7812772836565187963.junit.dir\version-2 [junit] 2014-01-26 09:29:20,630 [myid:] - INFO [main:FourLetterWordMain@43] - connecting to 127.0.0.1 11221 [junit] 2014-01-26 09:29:20,631 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxnFactory@197] - Accepted socket connection from /127.0.0.1:52074 [junit] 2014-01-26 09:29:20,632 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@827] - Processing stat command from /127.0.0.1:52074 [junit] 2014-01-26 09:29:20,632 [myid:] - INFO [Thread-4:NIOServerCnxn$StatCommand@663] - Stat command output [junit] 2014-01-26 09:29:20,634 [myid:] - INFO [Thread-4:NIOServerCnxn@1007] - Closed socket connection for client /127.0.0.1:52074 (no session established for client) [junit] 2014-01-26 09:29:20,635 [myid:] - INFO [main:JMXEnv@225] - ensureParent:[InMemoryDataTree, StandaloneServer_port] [junit] 2014-01-26 09:29:20,636 [myid:] - INFO [main:JMXEnv@242] - expect:InMemoryDataTree [junit] 2014-01-26 09:29:20,637 [myid:] - INFO [main:JMXEnv@246] - found:InMemoryDataTree org.apache.ZooKeeperService:name0=StandaloneServer_port-1,name1=InMemoryDataTree [junit] 2014-01-26 09:29:20,637 [myid:] - INFO [main:JMXEnv@242] - expect:StandaloneServer_port [junit] 2014-01-26 09:29:20,637 [myid:] - INFO [main:JMXEnv@246] - found:StandaloneServer_port org.apache.ZooKeeperService:name0=StandaloneServer_port-1 [junit] 2014-01-26 09:29:20,637 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@55] - Memory used 9365 [junit] 2014-01-26 09:29:20,637 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@60] - Number of threads 21 [junit] 2014-01-26 09:29:20,638 [myid:] - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@65] - FINISHED TEST METHOD testQuota [junit] 2014-01-26 09:29:20,638 [myid:] - INFO [main:ClientBase@520] - tearDown starting [junit] 2014-01-26 09:29:20,784 [myid:] - INFO [main-SendThread(127.0.0.1:11221):ClientCnxn$SendThread@849] - Socket connection established to 127.0.0.1/127.0.0.1:11221, initiating session [junit] 2014-01-26 09:29:20,784 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxnFactory@197] - Accepted socket connection from /127.0.0.1:52066 [junit] 2014-01-26 09:29:20,784 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:ZooKeeperServer@861] - Client attempting to renew session 0x143cde1b6cc at /127.0.0.1:52066 [junit] 2014-01-26 09:29:20,785 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:ZooKeeperServer@617] - Established session 0x143cde1b6cc with negotiated timeout 3 for client /127.0.0.1:52066 [junit] 2014-01-26 09:29:20,785 [myid:] - INFO [main-SendThread(127.0.0.1:11221):ClientCnxn$SendThread@1228] - Session establishment complete on server 127.0.0.1/127.0.0.1:11221, sessionid = 0x143cde1b6cc, negotiated timeout = 3 [junit] 2014-01-26 09:29:20,786 [myid:] - INFO [ProcessThread(sid:0 cport:-1)::PrepRequestProcessor@494] - Processed session termination for sessionid: 0x143cde1b6cc [junit] 2014-01-26 09:29:20,786 [myid:] - INFO [SyncThread:0:FileTxnLog@199] - Creating new log file: log.c [junit] 2014-01-26 09:29:20,789 [myid:] - INFO [main:ZooKeeper@684] - Session: 0x143cde1b6cc closed [junit] 2014-01-26 09:29:20,789 [myid:] - INFO [main:ClientBase@490] - STOPPING server [junit] 2014-01-26 09:29:20,789 [myid:] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@357] - caught end of stream exception [junit] EndOfStreamException: Unable to read additional data from client sessionid 0x143cde1b6cc, likely client has closed socket [junit] at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228) [junit] at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208) [junit] at
[jira] [Commented] (ZOOKEEPER-1830) Support command line shell for removing watches
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882312#comment-13882312 ] Rakesh R commented on ZOOKEEPER-1830: - Thanks again [~rgs] for the review and your time. Hi [~phunt], Please have a look at it and would like to know your thoughts. Support command line shell for removing watches --- Key: ZOOKEEPER-1830 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1830 Project: ZooKeeper Issue Type: Sub-task Reporter: Rakesh R Assignee: Rakesh R Priority: Critical Fix For: 3.5.0 Attachments: ZOOKEEPER-1830.patch, ZOOKEEPER-1830.patch This JIRA to discuss the command line shell for removing watches. Makes it easier to do ad-hoc testing. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: [jira] [Commented] (ZOOKEEPER-1863) Race condition in commit processor leading to out of order request completion, xid mismatch on client.
Actually probably accounts for both places. Sounds like you know this Thawan, question is why do we process a commit that we're not waiting on? When should we do that? On Jan 26, 2014 11:28 AM, cami...@apache.org wrote: I think I've identified the problem. In CommitProcessor line 203, we set the nextPending to null. But we never set it to null in the else case below on lines 205-210. It doesn't appear to be removed anywhere else in the file so the processor thinks it is always waiting for commit and will never continue. That seems to match the stack anyway. Not sure how this causes the xid mismatch yet though. On Jan 24, 2014 10:35 PM, Thawan Kooburat (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881636#comment-13881636] Thawan Kooburat commented on ZOOKEEPER-1863: I have seen a Commit Processor getting stuck in our prod (which run our internal branch) I spent a few days digging into the problem but couldn't locate the root cause. The sequence of action that you put in description is very unlikely to occur in quorum mode. First, the Follower/ObserverReuestProcessor which is in the front of the CommitProcessor put a request into queuedRequests even before sending it out to the leader. It need at least a network round trip ( or a full quorum vote) before the same request will comeback from a leader and get put into commitRequest. This is the assumption that even the original CommitProcessor (prior to ZOOKEEPER-1505) rely on. However, a combination of bad thread scheduling and long GC pause might break this assumption. Sync request is special unlike other write request because it doesn't require quorum voting, but I still don't think it matter in this case. Again, since I saw this in prod but I am unable to repro it. I did add a background thread to detect a request stuck in nextPending for extended period of time and kill the server if it is the case. I can post the patch if we are able unable find the root cause. You can also capture a heap dump of server to inspect which request get stuck (at nextPending) and correlated the possible event. Race condition in commit processor leading to out of order request completion, xid mismatch on client. -- Key: ZOOKEEPER-1863 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1863 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.0 Reporter: Dutch T. Meyer Priority: Blocker Attachments: stack.17512 In CommitProcessor.java processor, if we are at the primary request handler on line 167: {noformat} while (!stopped !isWaitingForCommit() !isProcessingCommit() (request = queuedRequests.poll()) != null) { if (needCommit(request)) { nextPending.set(request); } else { sendToNextProcessor(request); } } {noformat} A request can be handled in this block and be quickly processed and completed on another thread. If queuedRequests is empty, we then exit the block. Next, before this thread makes any more progress, we can get 2 more requests, one get_children(say), and a sync placed on queuedRequests for the processor. Then, if we are very unlucky, the sync request can complete and this object's commit() routine is called (from FollowerZookeeperServer), which places the sync request on the previously empty committedRequests queue. At that point, this thread continues. We reach line 182, which is a check on sync requests. {noformat} if (!stopped !isProcessingRequest() (request = committedRequests.poll()) != null) { {noformat} Here we are not processing any requests, because the original request has completed. We haven't dequeued either the read or the sync request in this processor. Next, the poll above will pull the sync request off the queue, and in the following block, the sync will get forwarded to the next processor. This is a problem because the read request hasn't been forwarded yet, so requests are now out of order. I've been able to reproduce this bug reliably by injecting a Thread.sleep(5000) between the two blocks above to make the race condition far more likely, then in a client program. {noformat} zoo_aget_children(zh, /, 0, getchildren_cb, NULL); //Wait long enough for queuedRequests to drain sleep(1); zoo_aget_children(zh, /, 0, getchildren_cb, th_ctx[0]); zoo_async(zh, /, sync_cb, th_ctx[0]); {noformat} When this bug is triggered, 3 things can happen: 1)
Re: [jira] [Commented] (ZOOKEEPER-1863) Race condition in commit processor leading to out of order request completion, xid mismatch on client.
I think I've identified the problem. In CommitProcessor line 203, we set the nextPending to null. But we never set it to null in the else case below on lines 205-210. It doesn't appear to be removed anywhere else in the file so the processor thinks it is always waiting for commit and will never continue. That seems to match the stack anyway. Not sure how this causes the xid mismatch yet though. On Jan 24, 2014 10:35 PM, Thawan Kooburat (JIRA) j...@apache.org wrote: [ https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881636#comment-13881636] Thawan Kooburat commented on ZOOKEEPER-1863: I have seen a Commit Processor getting stuck in our prod (which run our internal branch) I spent a few days digging into the problem but couldn't locate the root cause. The sequence of action that you put in description is very unlikely to occur in quorum mode. First, the Follower/ObserverReuestProcessor which is in the front of the CommitProcessor put a request into queuedRequests even before sending it out to the leader. It need at least a network round trip ( or a full quorum vote) before the same request will comeback from a leader and get put into commitRequest. This is the assumption that even the original CommitProcessor (prior to ZOOKEEPER-1505) rely on. However, a combination of bad thread scheduling and long GC pause might break this assumption. Sync request is special unlike other write request because it doesn't require quorum voting, but I still don't think it matter in this case. Again, since I saw this in prod but I am unable to repro it. I did add a background thread to detect a request stuck in nextPending for extended period of time and kill the server if it is the case. I can post the patch if we are able unable find the root cause. You can also capture a heap dump of server to inspect which request get stuck (at nextPending) and correlated the possible event. Race condition in commit processor leading to out of order request completion, xid mismatch on client. -- Key: ZOOKEEPER-1863 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1863 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.0 Reporter: Dutch T. Meyer Priority: Blocker Attachments: stack.17512 In CommitProcessor.java processor, if we are at the primary request handler on line 167: {noformat} while (!stopped !isWaitingForCommit() !isProcessingCommit() (request = queuedRequests.poll()) != null) { if (needCommit(request)) { nextPending.set(request); } else { sendToNextProcessor(request); } } {noformat} A request can be handled in this block and be quickly processed and completed on another thread. If queuedRequests is empty, we then exit the block. Next, before this thread makes any more progress, we can get 2 more requests, one get_children(say), and a sync placed on queuedRequests for the processor. Then, if we are very unlucky, the sync request can complete and this object's commit() routine is called (from FollowerZookeeperServer), which places the sync request on the previously empty committedRequests queue. At that point, this thread continues. We reach line 182, which is a check on sync requests. {noformat} if (!stopped !isProcessingRequest() (request = committedRequests.poll()) != null) { {noformat} Here we are not processing any requests, because the original request has completed. We haven't dequeued either the read or the sync request in this processor. Next, the poll above will pull the sync request off the queue, and in the following block, the sync will get forwarded to the next processor. This is a problem because the read request hasn't been forwarded yet, so requests are now out of order. I've been able to reproduce this bug reliably by injecting a Thread.sleep(5000) between the two blocks above to make the race condition far more likely, then in a client program. {noformat} zoo_aget_children(zh, /, 0, getchildren_cb, NULL); //Wait long enough for queuedRequests to drain sleep(1); zoo_aget_children(zh, /, 0, getchildren_cb, th_ctx[0]); zoo_async(zh, /, sync_cb, th_ctx[0]); {noformat} When this bug is triggered, 3 things can happen: 1) Clients will see requests complete out of order and fail on xid mismatches. 2) Kazoo in particular doesn't handle this runtime
[jira] [Commented] (ZOOKEEPER-1863) Race condition in commit processor leading to out of order request completion, xid mismatch on client.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882424#comment-13882424 ] Camille Fournier commented on ZOOKEEPER-1863: - OK Apologies, I (somewhat) understand why we don't care when nextPending doesn't match. So, I've played around with reproducing this. When I change the FollowerRequestProcessor to do this: case OpCode.sync: zks.pendingSyncs.add(request); zks.getFollower().request(request); zks.sync(); //this is what the leader will eventually cause us to call break; , randomly sleep the commit processor, and run the async hammer test with some extra reads and sleeps (just trying to force a reproduction), I can get the xid out of sync exception: java.io.IOException: Xid out of order. Got Xid 5 with err 0 expected Xid 1 for a packet with details: clientPath:/test- serverPath:/test- finished:false header:: 1,1 replyHeader:: 0,0,-4 request:: '/test-,,v{s{31,s{'world,'anyone}}},2 response:: at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:914) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:99) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:362) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1193) Note that this fails in the same way with the same changes in 3.4, so I don't think this is a new thing. Have yet to get it to hang or any other behavior except for the xid out of order. Feels like we may need to do some additional tracking of syncs inside the commit processor... Race condition in commit processor leading to out of order request completion, xid mismatch on client. -- Key: ZOOKEEPER-1863 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1863 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.5.0 Reporter: Dutch T. Meyer Priority: Blocker Attachments: stack.17512 In CommitProcessor.java processor, if we are at the primary request handler on line 167: {noformat} while (!stopped !isWaitingForCommit() !isProcessingCommit() (request = queuedRequests.poll()) != null) { if (needCommit(request)) { nextPending.set(request); } else { sendToNextProcessor(request); } } {noformat} A request can be handled in this block and be quickly processed and completed on another thread. If queuedRequests is empty, we then exit the block. Next, before this thread makes any more progress, we can get 2 more requests, one get_children(say), and a sync placed on queuedRequests for the processor. Then, if we are very unlucky, the sync request can complete and this object's commit() routine is called (from FollowerZookeeperServer), which places the sync request on the previously empty committedRequests queue. At that point, this thread continues. We reach line 182, which is a check on sync requests. {noformat} if (!stopped !isProcessingRequest() (request = committedRequests.poll()) != null) { {noformat} Here we are not processing any requests, because the original request has completed. We haven't dequeued either the read or the sync request in this processor. Next, the poll above will pull the sync request off the queue, and in the following block, the sync will get forwarded to the next processor. This is a problem because the read request hasn't been forwarded yet, so requests are now out of order. I've been able to reproduce this bug reliably by injecting a Thread.sleep(5000) between the two blocks above to make the race condition far more likely, then in a client program. {noformat} zoo_aget_children(zh, /, 0, getchildren_cb, NULL); //Wait long enough for queuedRequests to drain sleep(1); zoo_aget_children(zh, /, 0, getchildren_cb, th_ctx[0]); zoo_async(zh, /, sync_cb, th_ctx[0]); {noformat} When this bug is triggered, 3 things can happen: 1) Clients will see requests complete out of order and fail on xid mismatches. 2) Kazoo in particular doesn't handle this runtime exception well, and can orphan outstanding requests. 3) I've seen zookeeper servers deadlock, likely because the commit cannot be completed, which can wedge the commit processor. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (ZOOKEEPER-1833) fix windows build
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882510#comment-13882510 ] Flavio Junqueira commented on ZOOKEEPER-1833: - One strange thing I noticed while analyzing the logs of build 417 is that the format seems to be wrong: {noformat} [WorkerReceiver[myid=3]:FastLeaderElection@597] - Notification: 73657276 (message format version),... {noformat} I haven't observed this while running on unix and from the code it doesn't seem possible, unless some messages are being sent partially. fix windows build - Key: ZOOKEEPER-1833 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1833 Project: ZooKeeper Issue Type: Bug Affects Versions: 3.4.5 Reporter: Michi Mutsuzaki Assignee: Michi Mutsuzaki Priority: Blocker Fix For: 3.4.6 Attachments: TEST-org.apache.zookeeper.test.QuorumTest.zip, ZOOKEEPER-1833-b3.4.patch, ZOOKEEPER-1833.patch, ZOOKEEPER-1833.patch A bunch of 3.4 tests are failing on windows. {noformat} [junit] 2013-12-06 08:40:59,692 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testEarlyLeaderAbandonment [junit] 2013-12-06 08:41:10,472 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testHighestZxidJoinLate [junit] 2013-12-06 08:45:31,085 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testUpdatingEpoch [junit] 2013-12-06 08:55:34,630 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testObserversHammer [junit] 2013-12-06 08:55:59,889 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncExistsFailure_NoNode [junit] 2013-12-06 08:56:00,571 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetACL [junit] 2013-12-06 08:56:02,626 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetChildrenEmpty [junit] 2013-12-06 08:56:03,491 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetChildrenSingle [junit] 2013-12-06 08:56:11,276 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetChildrenTwo [junit] 2013-12-06 08:56:13,878 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetChildrenFailure_NoNode [junit] 2013-12-06 08:56:16,294 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetChildren2Empty [junit] 2013-12-06 08:56:18,622 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetChildren2Single [junit] 2013-12-06 08:56:21,224 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetChildren2Two [junit] 2013-12-06 08:56:23,738 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetChildren2Failure_NoNode [junit] 2013-12-06 08:56:26,058 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetData [junit] 2013-12-06 08:56:28,482 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testAsyncGetDataFailure_NoNode [junit] 2013-12-06 08:57:35,527 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testStartupFailureCreate [junit] 2013-12-06 08:57:38,645 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testStartupFailureSet [junit] 2013-12-06 08:57:41,261 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testStartupFailureSnapshot [junit] 2013-12-06 08:59:22,222 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testClientWithWatcherObj [junit] 2013-12-06 09:00:05,592 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testClientCleanup [junit] 2013-12-06 09:01:24,113 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testBindByAddress [junit] 2013-12-06 09:02:14,123 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testClientwithoutWatcherObj [junit] 2013-12-06 09:05:56,461 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testZeroWeightQuorum [junit] 2013-12-06 09:08:18,747 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testResyncByDiffAfterFollowerCrashes [junit] 2013-12-06 09:09:42,271 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testFourLetterWords [junit] 2013-12-06 09:14:03,770 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testLE [junit] 2013-12-06 09:46:30,002 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testHierarchicalQuorum [junit] 2013-12-06 09:50:26,912 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testHammerBasic [junit] 2013-12-06 09:51:07,604 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testQuotaWithQuorum [junit] 2013-12-06 09:52:41,515 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testNull [junit] 2013-12-06 09:53:22,648 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testDeleteWithChildren [junit] 2013-12-06 09:56:49,061 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testClientwithoutWatcherObj [junit] 2013-12-06 09:58:27,705 [myid:] - INFO [main:ZKTestCase$1@65] - FAILED testGetView [junit] 2013-12-06
[jira] [Created] (ZOOKEEPER-1867) Bug on ZkDatabaseCorruptionTest
Flavio Junqueira created ZOOKEEPER-1867: --- Summary: Bug on ZkDatabaseCorruptionTest Key: ZOOKEEPER-1867 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1867 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira If I'm reading the test case testCorruption right, it seems to depend on server 5 being elected, but if it is not the case, then it fails waiting for a server to be up. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Updated] (ZOOKEEPER-1867) Bug on ZkDatabaseCorruptionTest
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Flavio Junqueira updated ZOOKEEPER-1867: Attachment: ZOOKEEPER-1867.patch ZOOKEEPER-1867-b3.4.patch Simple fix to not wait for the leaderSid server. Bug on ZkDatabaseCorruptionTest --- Key: ZOOKEEPER-1867 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1867 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Fix For: 3.4.6 Attachments: ZOOKEEPER-1867-b3.4.patch, ZOOKEEPER-1867.patch If I'm reading the test case testCorruption right, it seems to depend on server 5 being elected, but if it is not the case, then it fails waiting for a server to be up. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (ZOOKEEPER-1867) Bug on ZkDatabaseCorruptionTest
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882586#comment-13882586 ] Rakesh R commented on ZOOKEEPER-1867: - +1 looks fine for me. I'm giving 'submit patch' to see jenkins report. Bug on ZkDatabaseCorruptionTest --- Key: ZOOKEEPER-1867 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1867 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Fix For: 3.4.6 Attachments: ZOOKEEPER-1867-b3.4.patch, ZOOKEEPER-1867.patch If I'm reading the test case testCorruption right, it seems to depend on server 5 being elected, but if it is not the case, then it fails waiting for a server to be up. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Success: ZOOKEEPER-1867 PreCommit Build #1899
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-1867 Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 292852 lines...] [exec] BUILD SUCCESSFUL [exec] Total time: 0 seconds [exec] [exec] [exec] [exec] [exec] +1 overall. Here are the results of testing the latest attachment [exec] http://issues.apache.org/jira/secure/attachment/12625308/ZOOKEEPER-1867.patch [exec] against trunk revision 1561236. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] [exec] +1 core tests. The patch passed core unit tests. [exec] [exec] +1 contrib tests. The patch passed contrib unit tests. [exec] [exec] Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//testReport/ [exec] Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html [exec] Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//console [exec] [exec] This message is automatically generated. [exec] [exec] [exec] == [exec] == [exec] Adding comment to Jira. [exec] == [exec] == [exec] [exec] [exec] Comment added. [exec] f14a36f2ed21715f76da184bb5d563621eb8cf42 logged out [exec] [exec] [exec] == [exec] == [exec] Finished build. [exec] == [exec] == [exec] [exec] BUILD SUCCESSFUL Total time: 35 minutes 44 seconds Archiving artifacts Recording test results Description set: ZOOKEEPER-1867 Email was triggered for: Success Sending email for trigger: Success ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Commented] (ZOOKEEPER-1867) Bug on ZkDatabaseCorruptionTest
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882596#comment-13882596 ] Hadoop QA commented on ZOOKEEPER-1867: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12625308/ZOOKEEPER-1867.patch against trunk revision 1561236. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//console This message is automatically generated. Bug on ZkDatabaseCorruptionTest --- Key: ZOOKEEPER-1867 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1867 Project: ZooKeeper Issue Type: Sub-task Reporter: Flavio Junqueira Assignee: Flavio Junqueira Fix For: 3.4.6, 3.5.0 Attachments: ZOOKEEPER-1867-b3.4.patch, ZOOKEEPER-1867.patch If I'm reading the test case testCorruption right, it seems to depend on server 5 being elected, but if it is not the case, then it fails waiting for a server to be up. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
ZooKeeper_branch33_solaris - Build # 778 - Still Failing
See https://builds.apache.org/job/ZooKeeper_branch33_solaris/778/ ### ## LAST 60 LINES OF THE CONSOLE ### Started by timer Building remotely on solaris1 in workspace /export/home/hudson/hudson-slave/workspace/ZooKeeper_branch33_solaris FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41) at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34) at hudson.remoting.Request.call(Request.java:174) at hudson.remoting.Channel.call(Channel.java:722) at hudson.EnvVars.getRemote(EnvVars.java:396) at hudson.model.Computer.getEnvironment(Computer.java:908) at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29) at hudson.model.Run.getEnvironment(Run.java:2191) at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:914) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:781) at hudson.model.AbstractProject.checkout(AbstractProject.java:1414) at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:652) at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:561) at hudson.model.Run.execute(Run.java:1678) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:231) Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel at hudson.remoting.Request.abort(Request.java:299) at hudson.remoting.Channel.terminate(Channel.java:782) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69) Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at hudson.remoting.Command.readFrom(Command.java:92) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) ### ## FAILED TESTS (if any) ## No tests ran.
ZooKeeper_branch34_solaris - Build # 779 - Still Failing
See https://builds.apache.org/job/ZooKeeper_branch34_solaris/779/ ### ## LAST 60 LINES OF THE CONSOLE ### Started by timer Building remotely on solaris1 in workspace /export/home/hudson/hudson-slave/workspace/ZooKeeper_branch34_solaris FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel hudson.remoting.RequestAbortedException: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41) at hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34) at hudson.remoting.Request.call(Request.java:174) at hudson.remoting.Channel.call(Channel.java:722) at hudson.EnvVars.getRemote(EnvVars.java:396) at hudson.model.Computer.getEnvironment(Computer.java:908) at jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29) at hudson.model.Run.getEnvironment(Run.java:2191) at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:914) at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:781) at hudson.model.AbstractProject.checkout(AbstractProject.java:1414) at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:652) at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88) at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:561) at hudson.model.Run.execute(Run.java:1678) at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46) at hudson.model.ResourceController.execute(ResourceController.java:88) at hudson.model.Executor.run(Executor.java:231) Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel at hudson.remoting.Request.abort(Request.java:299) at hudson.remoting.Channel.terminate(Channel.java:782) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69) Caused by: java.io.IOException: Unexpected termination of the channel at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50) Caused by: java.io.EOFException at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at hudson.remoting.Command.readFrom(Command.java:92) at hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71) at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48) ### ## FAILED TESTS (if any) ## No tests ran.
Jenkins build is back to normal : bookkeeper-trunk #526
See https://builds.apache.org/job/bookkeeper-trunk/526/