ZooKeeper-trunk-solaris - Build # 801 - Still Failing

2014-01-26 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper-trunk-solaris/801/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 216696 lines...]
[junit] 2014-01-26 09:05:05,073 [myid:] - INFO  
[main:SyncRequestProcessor@190] - Shutting down
[junit] 2014-01-26 09:05:05,073 [myid:] - INFO  [ProcessThread(sid:0 
cport:-1)::PrepRequestProcessor@156] - PrepRequestProcessor exited loop!
[junit] 2014-01-26 09:05:05,073 [myid:] - INFO  
[SyncThread:0:SyncRequestProcessor@168] - SyncRequestProcessor exited!
[junit] 2014-01-26 09:05:05,074 [myid:] - INFO  
[main:FinalRequestProcessor@454] - shutdown of request processor complete
[junit] 2014-01-26 09:05:05,074 [myid:] - INFO  
[main:FourLetterWordMain@43] - connecting to 127.0.0.1 11221
[junit] 2014-01-26 09:05:05,075 [myid:] - INFO  [main:JMXEnv@142] - 
ensureOnly:[]
[junit] 2014-01-26 09:05:05,076 [myid:] - INFO  [main:ClientBase@443] - 
STARTING server
[junit] 2014-01-26 09:05:05,076 [myid:] - INFO  [main:ClientBase@364] - 
CREATING server instance 127.0.0.1:11221
[junit] 2014-01-26 09:05:05,077 [myid:] - INFO  
[main:NIOServerCnxnFactory@670] - Configuring NIO connection handler with 10s 
sessionless connection timeout, 2 selector thread(s), 16 worker threads, and 64 
kB direct buffers.
[junit] 2014-01-26 09:05:05,078 [myid:] - INFO  
[main:NIOServerCnxnFactory@683] - binding to port 0.0.0.0/0.0.0.0:11221
[junit] 2014-01-26 09:05:05,078 [myid:] - INFO  [main:ClientBase@339] - 
STARTING server instance 127.0.0.1:11221
[junit] 2014-01-26 09:05:05,079 [myid:] - INFO  [main:ZooKeeperServer@149] 
- Created server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 
6 datadir 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/trunk/build/test/tmp/test3988846879111801612.junit.dir/version-2
 snapdir 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/trunk/build/test/tmp/test3988846879111801612.junit.dir/version-2
[junit] 2014-01-26 09:05:05,080 [myid:] - INFO  [main:FileSnap@83] - 
Reading snapshot 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/trunk/build/test/tmp/test3988846879111801612.junit.dir/version-2/snapshot.b
[junit] 2014-01-26 09:05:05,082 [myid:] - INFO  [main:FileTxnSnapLog@297] - 
Snapshotting: 0xb to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/trunk/build/test/tmp/test3988846879111801612.junit.dir/version-2/snapshot.b
[junit] 2014-01-26 09:05:05,084 [myid:] - INFO  
[main:FourLetterWordMain@43] - connecting to 127.0.0.1 11221
[junit] 2014-01-26 09:05:05,084 [myid:] - INFO  
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:11221:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /127.0.0.1:62892
[junit] 2014-01-26 09:05:05,085 [myid:] - INFO  
[NIOWorkerThread-1:NIOServerCnxn@835] - Processing stat command from 
/127.0.0.1:62892
[junit] 2014-01-26 09:05:05,085 [myid:] - INFO  
[NIOWorkerThread-1:NIOServerCnxn$StatCommand@684] - Stat command output
[junit] 2014-01-26 09:05:05,086 [myid:] - INFO  
[NIOWorkerThread-1:NIOServerCnxn@1006] - Closed socket connection for client 
/127.0.0.1:62892 (no session established for client)
[junit] 2014-01-26 09:05:05,086 [myid:] - INFO  [main:JMXEnv@224] - 
ensureParent:[InMemoryDataTree, StandaloneServer_port]
[junit] 2014-01-26 09:05:05,087 [myid:] - INFO  [main:JMXEnv@241] - 
expect:InMemoryDataTree
[junit] 2014-01-26 09:05:05,087 [myid:] - INFO  [main:JMXEnv@245] - 
found:InMemoryDataTree 
org.apache.ZooKeeperService:name0=StandaloneServer_port-1,name1=InMemoryDataTree
[junit] 2014-01-26 09:05:05,088 [myid:] - INFO  [main:JMXEnv@241] - 
expect:StandaloneServer_port
[junit] 2014-01-26 09:05:05,088 [myid:] - INFO  [main:JMXEnv@245] - 
found:StandaloneServer_port 
org.apache.ZooKeeperService:name0=StandaloneServer_port-1
[junit] 2014-01-26 09:05:05,088 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@55] - Memory used 13365
[junit] 2014-01-26 09:05:05,088 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@60] - Number of threads 24
[junit] 2014-01-26 09:05:05,089 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@65] - FINISHED TEST METHOD testQuota
[junit] 2014-01-26 09:05:05,089 [myid:] - INFO  [main:ClientBase@520] - 
tearDown starting
[junit] 2014-01-26 09:05:05,150 [myid:] - INFO  [main:ZooKeeper@954] - 
Session: 0x143cdcb894e closed
[junit] 2014-01-26 09:05:05,150 [myid:] - INFO  
[main-EventThread:ClientCnxn$EventThread@533] - EventThread shut down
[junit] 2014-01-26 09:05:05,151 [myid:] - INFO  [main:ClientBase@490] - 
STOPPING server
[junit] 2014-01-26 09:05:05,152 [myid:] - INFO  

ZooKeeper-3.4-WinVS2008_java - Build # 418 - Still Failing

2014-01-26 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper-3.4-WinVS2008_java/418/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 182575 lines...]
[junit] 2014-01-26 09:29:20,625 [myid:] - INFO  [main:ClientBase@364] - 
CREATING server instance 127.0.0.1:11221
[junit] 2014-01-26 09:29:20,626 [myid:] - INFO  
[main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:11221
[junit] 2014-01-26 09:29:20,627 [myid:] - INFO  [main:ClientBase@339] - 
STARTING server instance 127.0.0.1:11221
[junit] 2014-01-26 09:29:20,627 [myid:] - INFO  [main:ZooKeeperServer@162] 
- Created server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 
6 datadir 
f:\hudson\hudson-slave\workspace\ZooKeeper-3.4-WinVS2008_java\branch-3.4\build\test\tmp\test7812772836565187963.junit.dir\version-2
 snapdir 
f:\hudson\hudson-slave\workspace\ZooKeeper-3.4-WinVS2008_java\branch-3.4\build\test\tmp\test7812772836565187963.junit.dir\version-2
[junit] 2014-01-26 09:29:20,630 [myid:] - INFO  
[main:FourLetterWordMain@43] - connecting to 127.0.0.1 11221
[junit] 2014-01-26 09:29:20,631 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxnFactory@197] - 
Accepted socket connection from /127.0.0.1:52074
[junit] 2014-01-26 09:29:20,632 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@827] - Processing 
stat command from /127.0.0.1:52074
[junit] 2014-01-26 09:29:20,632 [myid:] - INFO  
[Thread-4:NIOServerCnxn$StatCommand@663] - Stat command output
[junit] 2014-01-26 09:29:20,634 [myid:] - INFO  
[Thread-4:NIOServerCnxn@1007] - Closed socket connection for client 
/127.0.0.1:52074 (no session established for client)
[junit] 2014-01-26 09:29:20,635 [myid:] - INFO  [main:JMXEnv@225] - 
ensureParent:[InMemoryDataTree, StandaloneServer_port]
[junit] 2014-01-26 09:29:20,636 [myid:] - INFO  [main:JMXEnv@242] - 
expect:InMemoryDataTree
[junit] 2014-01-26 09:29:20,637 [myid:] - INFO  [main:JMXEnv@246] - 
found:InMemoryDataTree 
org.apache.ZooKeeperService:name0=StandaloneServer_port-1,name1=InMemoryDataTree
[junit] 2014-01-26 09:29:20,637 [myid:] - INFO  [main:JMXEnv@242] - 
expect:StandaloneServer_port
[junit] 2014-01-26 09:29:20,637 [myid:] - INFO  [main:JMXEnv@246] - 
found:StandaloneServer_port 
org.apache.ZooKeeperService:name0=StandaloneServer_port-1
[junit] 2014-01-26 09:29:20,637 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@55] - Memory used 9365
[junit] 2014-01-26 09:29:20,637 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@60] - Number of threads 21
[junit] 2014-01-26 09:29:20,638 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@65] - FINISHED TEST METHOD testQuota
[junit] 2014-01-26 09:29:20,638 [myid:] - INFO  [main:ClientBase@520] - 
tearDown starting
[junit] 2014-01-26 09:29:20,784 [myid:] - INFO  
[main-SendThread(127.0.0.1:11221):ClientCnxn$SendThread@849] - Socket 
connection established to 127.0.0.1/127.0.0.1:11221, initiating session
[junit] 2014-01-26 09:29:20,784 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxnFactory@197] - 
Accepted socket connection from /127.0.0.1:52066
[junit] 2014-01-26 09:29:20,784 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:ZooKeeperServer@861] - Client 
attempting to renew session 0x143cde1b6cc at /127.0.0.1:52066
[junit] 2014-01-26 09:29:20,785 [myid:] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:ZooKeeperServer@617] - Established 
session 0x143cde1b6cc with negotiated timeout 3 for client 
/127.0.0.1:52066
[junit] 2014-01-26 09:29:20,785 [myid:] - INFO  
[main-SendThread(127.0.0.1:11221):ClientCnxn$SendThread@1228] - Session 
establishment complete on server 127.0.0.1/127.0.0.1:11221, sessionid = 
0x143cde1b6cc, negotiated timeout = 3
[junit] 2014-01-26 09:29:20,786 [myid:] - INFO  [ProcessThread(sid:0 
cport:-1)::PrepRequestProcessor@494] - Processed session termination for 
sessionid: 0x143cde1b6cc
[junit] 2014-01-26 09:29:20,786 [myid:] - INFO  
[SyncThread:0:FileTxnLog@199] - Creating new log file: log.c
[junit] 2014-01-26 09:29:20,789 [myid:] - INFO  [main:ZooKeeper@684] - 
Session: 0x143cde1b6cc closed
[junit] 2014-01-26 09:29:20,789 [myid:] - INFO  [main:ClientBase@490] - 
STOPPING server
[junit] 2014-01-26 09:29:20,789 [myid:] - WARN  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11221:NIOServerCnxn@357] - caught end of 
stream exception
[junit] EndOfStreamException: Unable to read additional data from client 
sessionid 0x143cde1b6cc, likely client has closed socket
[junit] at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
[junit] at 
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
[junit] at 

[jira] [Commented] (ZOOKEEPER-1830) Support command line shell for removing watches

2014-01-26 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882312#comment-13882312
 ] 

Rakesh R commented on ZOOKEEPER-1830:
-

Thanks again [~rgs] for the review and your time.

Hi [~phunt], Please have a look at it and would like to know your thoughts.

 Support command line shell for removing watches
 ---

 Key: ZOOKEEPER-1830
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1830
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Rakesh R
Assignee: Rakesh R
Priority: Critical
 Fix For: 3.5.0

 Attachments: ZOOKEEPER-1830.patch, ZOOKEEPER-1830.patch


 This JIRA to discuss the command line shell for removing watches. Makes it 
 easier to do ad-hoc testing.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [jira] [Commented] (ZOOKEEPER-1863) Race condition in commit processor leading to out of order request completion, xid mismatch on client.

2014-01-26 Thread Camille Fournier
Actually probably accounts for both places. Sounds like you know this
Thawan, question is why do we process a commit that we're not waiting on?
When should we do that?
On Jan 26, 2014 11:28 AM, cami...@apache.org wrote:

I think I've identified the problem.
In CommitProcessor line 203, we set the nextPending to null. But we never
set it to null in the else case below on lines 205-210. It doesn't appear
to be removed anywhere else in the file so the processor thinks it is
always waiting for commit and will never continue.
That seems to match the stack anyway. Not sure how this causes the xid
mismatch yet though.
On Jan 24, 2014 10:35 PM, Thawan Kooburat (JIRA) j...@apache.org wrote:


[
https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881636#comment-13881636]

Thawan Kooburat commented on ZOOKEEPER-1863:


I have seen a Commit Processor getting stuck in our prod (which run our
internal branch) I spent a few days digging into the problem but couldn't
locate the root cause.

The sequence of action that you put in description is very unlikely to
occur in quorum mode.  First, the Follower/ObserverReuestProcessor which is
in the front of the CommitProcessor put a request into queuedRequests even
before sending it out to the leader.   It need at least a network round
trip ( or a full quorum vote) before the same request will comeback from a
leader and get put into commitRequest.  This is the assumption that even
the original CommitProcessor (prior to ZOOKEEPER-1505) rely on. However, a
combination of bad thread scheduling and long GC pause might break this
assumption.

Sync request is special unlike other write request because it doesn't
require quorum voting, but I still don't think it matter in this case.

Again, since I saw this in prod but I am unable to repro it. I did add a
background thread to detect a request stuck in nextPending for extended
period of time and kill the server if it is the case.  I can post the patch
if we are able unable find the root cause.

You can also capture a heap dump of server to inspect which request get
stuck (at nextPending) and correlated the possible event.




 Race condition in commit processor leading to out of order request
completion, xid mismatch on client.

--

 Key: ZOOKEEPER-1863
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1863
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.5.0
Reporter: Dutch T. Meyer
Priority: Blocker
 Attachments: stack.17512


 In CommitProcessor.java processor, if we are at the primary request
handler on line 167:
 {noformat}
 while (!stopped  !isWaitingForCommit() 
!isProcessingCommit() 
(request = queuedRequests.poll()) != null) {
 if (needCommit(request)) {
 nextPending.set(request);
 } else {
 sendToNextProcessor(request);
 }
 }
 {noformat}
 A request can be handled in this block and be quickly processed and
completed on another thread. If queuedRequests is empty, we then exit the
block. Next, before this thread makes any more progress, we can get 2 more
requests, one get_children(say), and a sync placed on queuedRequests for
the processor. Then, if we are very unlucky, the sync request can complete
and this object's commit() routine is called (from
FollowerZookeeperServer), which places the sync request on the previously
empty committedRequests queue. At that point, this thread continues.
 We reach line 182, which is a check on sync requests.
 {noformat}
 if (!stopped  !isProcessingRequest() 
 (request = committedRequests.poll()) != null) {
 {noformat}
 Here we are not processing any requests, because the original request has
completed. We haven't dequeued either the read or the sync request in this
processor. Next, the poll above will pull the sync request off the queue,
and in the following block, the sync will get forwarded to the next
processor.
 This is a problem because the read request hasn't been forwarded yet, so
requests are now out of order.
 I've been able to reproduce this bug reliably by injecting a
Thread.sleep(5000) between the two blocks above to make the race condition
far more likely, then in a client program.
 {noformat}
 zoo_aget_children(zh, /, 0, getchildren_cb, NULL);
 //Wait long enough for queuedRequests to drain
 sleep(1);
 zoo_aget_children(zh, /, 0, getchildren_cb, th_ctx[0]);
 zoo_async(zh, /, sync_cb, th_ctx[0]);
 {noformat}
 When this bug is triggered, 3 things can happen:
 1) 

Re: [jira] [Commented] (ZOOKEEPER-1863) Race condition in commit processor leading to out of order request completion, xid mismatch on client.

2014-01-26 Thread Camille Fournier
I think I've identified the problem.
In CommitProcessor line 203, we set the nextPending to null. But we never
set it to null in the else case below on lines 205-210. It doesn't appear
to be removed anywhere else in the file so the processor thinks it is
always waiting for commit and will never continue.
That seems to match the stack anyway. Not sure how this causes the xid
mismatch yet though.
On Jan 24, 2014 10:35 PM, Thawan Kooburat (JIRA) j...@apache.org wrote:


 [
 https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13881636#comment-13881636]

 Thawan Kooburat commented on ZOOKEEPER-1863:
 

 I have seen a Commit Processor getting stuck in our prod (which run our
 internal branch) I spent a few days digging into the problem but couldn't
 locate the root cause.

 The sequence of action that you put in description is very unlikely to
 occur in quorum mode.  First, the Follower/ObserverReuestProcessor which is
 in the front of the CommitProcessor put a request into queuedRequests even
 before sending it out to the leader.   It need at least a network round
 trip ( or a full quorum vote) before the same request will comeback from a
 leader and get put into commitRequest.  This is the assumption that even
 the original CommitProcessor (prior to ZOOKEEPER-1505) rely on. However, a
 combination of bad thread scheduling and long GC pause might break this
 assumption.

 Sync request is special unlike other write request because it doesn't
 require quorum voting, but I still don't think it matter in this case.

 Again, since I saw this in prod but I am unable to repro it. I did add a
 background thread to detect a request stuck in nextPending for extended
 period of time and kill the server if it is the case.  I can post the patch
 if we are able unable find the root cause.

 You can also capture a heap dump of server to inspect which request get
 stuck (at nextPending) and correlated the possible event.




  Race condition in commit processor leading to out of order request
 completion, xid mismatch on client.
 
 --
 
  Key: ZOOKEEPER-1863
  URL:
 https://issues.apache.org/jira/browse/ZOOKEEPER-1863
  Project: ZooKeeper
   Issue Type: Bug
   Components: server
 Affects Versions: 3.5.0
 Reporter: Dutch T. Meyer
 Priority: Blocker
  Attachments: stack.17512
 
 
  In CommitProcessor.java processor, if we are at the primary request
 handler on line 167:
  {noformat}
  while (!stopped  !isWaitingForCommit() 
 !isProcessingCommit() 
 (request = queuedRequests.poll()) != null) {
  if (needCommit(request)) {
  nextPending.set(request);
  } else {
  sendToNextProcessor(request);
  }
  }
  {noformat}
  A request can be handled in this block and be quickly processed and
 completed on another thread. If queuedRequests is empty, we then exit the
 block. Next, before this thread makes any more progress, we can get 2 more
 requests, one get_children(say), and a sync placed on queuedRequests for
 the processor. Then, if we are very unlucky, the sync request can complete
 and this object's commit() routine is called (from
 FollowerZookeeperServer), which places the sync request on the previously
 empty committedRequests queue. At that point, this thread continues.
  We reach line 182, which is a check on sync requests.
  {noformat}
  if (!stopped  !isProcessingRequest() 
  (request = committedRequests.poll()) != null) {
  {noformat}
  Here we are not processing any requests, because the original request
 has completed. We haven't dequeued either the read or the sync request in
 this processor. Next, the poll above will pull the sync request off the
 queue, and in the following block, the sync will get forwarded to the next
 processor.
  This is a problem because the read request hasn't been forwarded yet, so
 requests are now out of order.
  I've been able to reproduce this bug reliably by injecting a
 Thread.sleep(5000) between the two blocks above to make the race condition
 far more likely, then in a client program.
  {noformat}
  zoo_aget_children(zh, /, 0, getchildren_cb, NULL);
  //Wait long enough for queuedRequests to drain
  sleep(1);
  zoo_aget_children(zh, /, 0, getchildren_cb, th_ctx[0]);
  zoo_async(zh, /, sync_cb, th_ctx[0]);
  {noformat}
  When this bug is triggered, 3 things can happen:
  1) Clients will see requests complete out of order and fail on xid
 mismatches.
  2) Kazoo in particular doesn't handle this runtime 

[jira] [Commented] (ZOOKEEPER-1863) Race condition in commit processor leading to out of order request completion, xid mismatch on client.

2014-01-26 Thread Camille Fournier (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882424#comment-13882424
 ] 

Camille Fournier commented on ZOOKEEPER-1863:
-

OK Apologies, I (somewhat) understand why we don't care when nextPending 
doesn't match.

So, I've played around with reproducing this. When I change the 
FollowerRequestProcessor to do this:
  case OpCode.sync:
zks.pendingSyncs.add(request);
zks.getFollower().request(request);
zks.sync(); //this is what the leader will eventually cause 
us to call
break;
 , randomly sleep the commit processor, and run the async hammer test with some 
extra reads and sleeps (just trying to force a reproduction), I can get the xid 
out of sync exception:
java.io.IOException: Xid out of order. Got Xid 5 with err 0 expected Xid 1 for 
a packet with details: clientPath:/test- serverPath:/test- finished:false 
header:: 1,1  replyHeader:: 0,0,-4  request:: 
'/test-,,v{s{31,s{'world,'anyone}}},2  response::  
at 
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:914)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:99)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:362)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1193)

Note that this fails in the same way with the same changes in 3.4, so I don't 
think this is a new thing.
Have yet to get it to hang or any other behavior except for the xid out of 
order.
Feels like we may need to do some additional tracking of syncs inside the 
commit processor...


 Race condition in commit processor leading to out of order request 
 completion, xid mismatch on client.
 --

 Key: ZOOKEEPER-1863
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1863
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.5.0
Reporter: Dutch T. Meyer
Priority: Blocker
 Attachments: stack.17512


 In CommitProcessor.java processor, if we are at the primary request handler 
 on line 167:
 {noformat}
 while (!stopped  !isWaitingForCommit() 
!isProcessingCommit() 
(request = queuedRequests.poll()) != null) {
 if (needCommit(request)) {
 nextPending.set(request);
 } else {
 sendToNextProcessor(request);
 }
 }
 {noformat}
 A request can be handled in this block and be quickly processed and completed 
 on another thread. If queuedRequests is empty, we then exit the block. Next, 
 before this thread makes any more progress, we can get 2 more requests, one 
 get_children(say), and a sync placed on queuedRequests for the processor. 
 Then, if we are very unlucky, the sync request can complete and this object's 
 commit() routine is called (from FollowerZookeeperServer), which places the 
 sync request on the previously empty committedRequests queue. At that point, 
 this thread continues.
 We reach line 182, which is a check on sync requests.
 {noformat}
 if (!stopped  !isProcessingRequest() 
 (request = committedRequests.poll()) != null) {
 {noformat}
 Here we are not processing any requests, because the original request has 
 completed. We haven't dequeued either the read or the sync request in this 
 processor. Next, the poll above will pull the sync request off the queue, and 
 in the following block, the sync will get forwarded to the next processor.
 This is a problem because the read request hasn't been forwarded yet, so 
 requests are now out of order.
 I've been able to reproduce this bug reliably by injecting a 
 Thread.sleep(5000) between the two blocks above to make the race condition 
 far more likely, then in a client program.
 {noformat}
 zoo_aget_children(zh, /, 0, getchildren_cb, NULL);
 //Wait long enough for queuedRequests to drain
 sleep(1);
 zoo_aget_children(zh, /, 0, getchildren_cb, th_ctx[0]);
 zoo_async(zh, /, sync_cb, th_ctx[0]);
 {noformat}
 When this bug is triggered, 3 things can happen:
 1) Clients will see requests complete out of order and fail on xid mismatches.
 2) Kazoo in particular doesn't handle this runtime exception well, and can 
 orphan outstanding requests.
 3) I've seen zookeeper servers deadlock, likely because the commit cannot be 
 completed, which can wedge the commit processor.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (ZOOKEEPER-1833) fix windows build

2014-01-26 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882510#comment-13882510
 ] 

Flavio Junqueira commented on ZOOKEEPER-1833:
-

One strange thing I noticed while analyzing the logs of build 417 is that the 
format seems to be wrong:

{noformat}
 [WorkerReceiver[myid=3]:FastLeaderElection@597] - Notification: 73657276 
(message format version),...
{noformat}

I haven't observed this while running on unix and from the code it doesn't seem 
possible, unless some messages are being sent partially.

 fix windows build
 -

 Key: ZOOKEEPER-1833
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1833
 Project: ZooKeeper
  Issue Type: Bug
Affects Versions: 3.4.5
Reporter: Michi Mutsuzaki
Assignee: Michi Mutsuzaki
Priority: Blocker
 Fix For: 3.4.6

 Attachments: TEST-org.apache.zookeeper.test.QuorumTest.zip, 
 ZOOKEEPER-1833-b3.4.patch, ZOOKEEPER-1833.patch, ZOOKEEPER-1833.patch


 A bunch of 3.4 tests are failing on windows.
 {noformat}
 [junit] 2013-12-06 08:40:59,692 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testEarlyLeaderAbandonment
 [junit] 2013-12-06 08:41:10,472 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testHighestZxidJoinLate
 [junit] 2013-12-06 08:45:31,085 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testUpdatingEpoch
 [junit] 2013-12-06 08:55:34,630 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testObserversHammer
 [junit] 2013-12-06 08:55:59,889 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncExistsFailure_NoNode
 [junit] 2013-12-06 08:56:00,571 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetACL
 [junit] 2013-12-06 08:56:02,626 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetChildrenEmpty
 [junit] 2013-12-06 08:56:03,491 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetChildrenSingle
 [junit] 2013-12-06 08:56:11,276 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetChildrenTwo
 [junit] 2013-12-06 08:56:13,878 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetChildrenFailure_NoNode
 [junit] 2013-12-06 08:56:16,294 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetChildren2Empty
 [junit] 2013-12-06 08:56:18,622 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetChildren2Single
 [junit] 2013-12-06 08:56:21,224 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetChildren2Two
 [junit] 2013-12-06 08:56:23,738 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetChildren2Failure_NoNode
 [junit] 2013-12-06 08:56:26,058 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetData
 [junit] 2013-12-06 08:56:28,482 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testAsyncGetDataFailure_NoNode
 [junit] 2013-12-06 08:57:35,527 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testStartupFailureCreate
 [junit] 2013-12-06 08:57:38,645 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testStartupFailureSet
 [junit] 2013-12-06 08:57:41,261 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testStartupFailureSnapshot
 [junit] 2013-12-06 08:59:22,222 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testClientWithWatcherObj
 [junit] 2013-12-06 09:00:05,592 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testClientCleanup
 [junit] 2013-12-06 09:01:24,113 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testBindByAddress
 [junit] 2013-12-06 09:02:14,123 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testClientwithoutWatcherObj
 [junit] 2013-12-06 09:05:56,461 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testZeroWeightQuorum
 [junit] 2013-12-06 09:08:18,747 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testResyncByDiffAfterFollowerCrashes
 [junit] 2013-12-06 09:09:42,271 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testFourLetterWords
 [junit] 2013-12-06 09:14:03,770 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testLE
 [junit] 2013-12-06 09:46:30,002 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testHierarchicalQuorum
 [junit] 2013-12-06 09:50:26,912 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testHammerBasic
 [junit] 2013-12-06 09:51:07,604 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testQuotaWithQuorum
 [junit] 2013-12-06 09:52:41,515 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testNull
 [junit] 2013-12-06 09:53:22,648 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testDeleteWithChildren
 [junit] 2013-12-06 09:56:49,061 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testClientwithoutWatcherObj
 [junit] 2013-12-06 09:58:27,705 [myid:] - INFO  [main:ZKTestCase$1@65] - 
 FAILED testGetView
 [junit] 2013-12-06 

[jira] [Created] (ZOOKEEPER-1867) Bug on ZkDatabaseCorruptionTest

2014-01-26 Thread Flavio Junqueira (JIRA)
Flavio Junqueira created ZOOKEEPER-1867:
---

 Summary: Bug on ZkDatabaseCorruptionTest
 Key: ZOOKEEPER-1867
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1867
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Flavio Junqueira


If I'm reading the test case testCorruption right, it seems to depend on server 
5 being elected, but if it is not the case, then it fails waiting for a server 
to be up.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (ZOOKEEPER-1867) Bug on ZkDatabaseCorruptionTest

2014-01-26 Thread Flavio Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira updated ZOOKEEPER-1867:


Attachment: ZOOKEEPER-1867.patch
ZOOKEEPER-1867-b3.4.patch

Simple fix to not wait for the leaderSid server.

 Bug on ZkDatabaseCorruptionTest
 ---

 Key: ZOOKEEPER-1867
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1867
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Flavio Junqueira
 Fix For: 3.4.6

 Attachments: ZOOKEEPER-1867-b3.4.patch, ZOOKEEPER-1867.patch


 If I'm reading the test case testCorruption right, it seems to depend on 
 server 5 being elected, but if it is not the case, then it fails waiting for 
 a server to be up.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (ZOOKEEPER-1867) Bug on ZkDatabaseCorruptionTest

2014-01-26 Thread Rakesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882586#comment-13882586
 ] 

Rakesh R commented on ZOOKEEPER-1867:
-

+1 looks fine for me. I'm giving 'submit patch' to see jenkins report.

 Bug on ZkDatabaseCorruptionTest
 ---

 Key: ZOOKEEPER-1867
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1867
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Flavio Junqueira
 Fix For: 3.4.6

 Attachments: ZOOKEEPER-1867-b3.4.patch, ZOOKEEPER-1867.patch


 If I'm reading the test case testCorruption right, it seems to depend on 
 server 5 being elected, but if it is not the case, then it fails waiting for 
 a server to be up.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Success: ZOOKEEPER-1867 PreCommit Build #1899

2014-01-26 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-1867
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 292852 lines...]
 [exec] BUILD SUCCESSFUL
 [exec] Total time: 0 seconds
 [exec] 
 [exec] 
 [exec] 
 [exec] 
 [exec] +1 overall.  Here are the results of testing the latest attachment 
 [exec]   
http://issues.apache.org/jira/secure/attachment/12625308/ZOOKEEPER-1867.patch
 [exec]   against trunk revision 1561236.
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
(version 1.3.9) warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] +1 core tests.  The patch passed core unit tests.
 [exec] 
 [exec] +1 contrib tests.  The patch passed contrib unit tests.
 [exec] 
 [exec] Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//testReport/
 [exec] Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
 [exec] Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//console
 [exec] 
 [exec] This message is automatically generated.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Adding comment to Jira.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] Comment added.
 [exec] f14a36f2ed21715f76da184bb5d563621eb8cf42 logged out
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Finished build.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 

BUILD SUCCESSFUL
Total time: 35 minutes 44 seconds
Archiving artifacts
Recording test results
Description set: ZOOKEEPER-1867
Email was triggered for: Success
Sending email for trigger: Success



###
## FAILED TESTS (if any) 
##
All tests passed

[jira] [Commented] (ZOOKEEPER-1867) Bug on ZkDatabaseCorruptionTest

2014-01-26 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13882596#comment-13882596
 ] 

Hadoop QA commented on ZOOKEEPER-1867:
--

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12625308/ZOOKEEPER-1867.patch
  against trunk revision 1561236.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/1899//console

This message is automatically generated.

 Bug on ZkDatabaseCorruptionTest
 ---

 Key: ZOOKEEPER-1867
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1867
 Project: ZooKeeper
  Issue Type: Sub-task
Reporter: Flavio Junqueira
Assignee: Flavio Junqueira
 Fix For: 3.4.6, 3.5.0

 Attachments: ZOOKEEPER-1867-b3.4.patch, ZOOKEEPER-1867.patch


 If I'm reading the test case testCorruption right, it seems to depend on 
 server 5 being elected, but if it is not the case, then it fails waiting for 
 a server to be up.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


ZooKeeper_branch33_solaris - Build # 778 - Still Failing

2014-01-26 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper_branch33_solaris/778/

###
## LAST 60 LINES OF THE CONSOLE 
###
Started by timer
Building remotely on solaris1 in workspace 
/export/home/hudson/hudson-slave/workspace/ZooKeeper_branch33_solaris
FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected 
termination of the channel
hudson.remoting.RequestAbortedException: 
hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected 
termination of the channel
at 
hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
at 
hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
at hudson.remoting.Request.call(Request.java:174)
at hudson.remoting.Channel.call(Channel.java:722)
at hudson.EnvVars.getRemote(EnvVars.java:396)
at hudson.model.Computer.getEnvironment(Computer.java:908)
at 
jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29)
at hudson.model.Run.getEnvironment(Run.java:2191)
at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:914)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:781)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1414)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:652)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:561)
at hudson.model.Run.execute(Run.java:1678)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:231)
Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: 
Unexpected termination of the channel
at hudson.remoting.Request.abort(Request.java:299)
at hudson.remoting.Channel.terminate(Channel.java:782)
at 
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
Caused by: java.io.IOException: Unexpected termination of the channel
at 
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at hudson.remoting.Command.readFrom(Command.java:92)
at 
hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
at 
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)



###
## FAILED TESTS (if any) 
##
No tests ran.

ZooKeeper_branch34_solaris - Build # 779 - Still Failing

2014-01-26 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper_branch34_solaris/779/

###
## LAST 60 LINES OF THE CONSOLE 
###
Started by timer
Building remotely on solaris1 in workspace 
/export/home/hudson/hudson-slave/workspace/ZooKeeper_branch34_solaris
FATAL: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected 
termination of the channel
hudson.remoting.RequestAbortedException: 
hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected 
termination of the channel
at 
hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:41)
at 
hudson.remoting.RequestAbortedException.wrapForRethrow(RequestAbortedException.java:34)
at hudson.remoting.Request.call(Request.java:174)
at hudson.remoting.Channel.call(Channel.java:722)
at hudson.EnvVars.getRemote(EnvVars.java:396)
at hudson.model.Computer.getEnvironment(Computer.java:908)
at 
jenkins.model.CoreEnvironmentContributor.buildEnvironmentFor(CoreEnvironmentContributor.java:29)
at hudson.model.Run.getEnvironment(Run.java:2191)
at hudson.model.AbstractBuild.getEnvironment(AbstractBuild.java:914)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:781)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1414)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:652)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:561)
at hudson.model.Run.execute(Run.java:1678)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:231)
Caused by: hudson.remoting.RequestAbortedException: java.io.IOException: 
Unexpected termination of the channel
at hudson.remoting.Request.abort(Request.java:299)
at hudson.remoting.Channel.terminate(Channel.java:782)
at 
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:69)
Caused by: java.io.IOException: Unexpected termination of the channel
at 
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:50)
Caused by: java.io.EOFException
at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2598)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1318)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at hudson.remoting.Command.readFrom(Command.java:92)
at 
hudson.remoting.ClassicCommandTransport.read(ClassicCommandTransport.java:71)
at 
hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:48)



###
## FAILED TESTS (if any) 
##
No tests ran.

Jenkins build is back to normal : bookkeeper-trunk #526

2014-01-26 Thread Apache Jenkins Server
See https://builds.apache.org/job/bookkeeper-trunk/526/