[jira] [Commented] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593443#comment-15593443
 ] 

Michael Han commented on ZOOKEEPER-2099:


Just did a search on my archived build mails - I see a good amount of tests 
failed from time to time with 'KeeperErrorCode = ConnectionLoss'. I think the 
test cases should be made more fault tolerant to such false negatives. I agree 
that we should not blindly do retry and the retry should be done on a case by 
case basis. Let me dig more into what this new test and those failed existing 
tests did..

> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Failed: ZOOKEEPER-2080 PreCommit Build #3496

2016-10-20 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 444809 lines...]
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
(version 2.0.3) warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] -1 core tests.  The patch failed core unit tests.
 [exec] 
 [exec] +1 contrib tests.  The patch passed contrib unit tests.
 [exec] 
 [exec] Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//testReport/
 [exec] Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
 [exec] Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//console
 [exec] 
 [exec] This message is automatically generated.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Adding comment to Jira.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] Comment added.
 [exec] ed3d5c765eba27c665beb23d9b0fb180b9e3cc8b logged out
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Finished build.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] mv: 
'/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/patchprocess' 
and 
'/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/patchprocess' 
are the same file

BUILD FAILED
/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/build.xml:1605: 
exec returned: 2

Total time: 19 minutes 39 seconds
Build step 'Execute shell' marked build as failure
Archiving artifacts
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Compressed 550.73 KB of artifacts by 34.9% relative to #3494
Recording test results
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
[description-setter] Description set: ZOOKEEPER-2080
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7



###
## FAILED TESTS (if any) 
##
1 tests failed.
FAILED:  org.apache.zookeeper.server.quorum.Zab1_0Test.testNormalObserverRun

Error Message:
Timeout occurred. Please note the time in the report does not reflect the time 
until the timeout.

Stack Trace:
junit.framework.AssertionFailedError: Timeout occurred. Please note the time in 
the report does not reflect the time until the timeout.
at java.lang.Thread.run(Thread.java:745)




[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593437#comment-15593437
 ] 

Hadoop QA commented on ZOOKEEPER-2080:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12834574/ZOOKEEPER-2080.patch
  against trunk revision cef5978969bedfe066f903834a9ea4af6d508844.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3496//console

This message is automatically generated.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-10-20 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593405#comment-15593405
 ] 

Michael Han commented on ZOOKEEPER-2080:


I've also left some comments on PR about the refactoring. The patch is stress 
tested on internal Jenkins with the previous failed tests, and so far all tests 
passed (with 500+ runs).

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] zookeeper pull request #92: ZOOKEEPER-2080: Fix deadlock in dynamic reconfig...

2016-10-20 Thread hanm
GitHub user hanm opened a pull request:

https://github.com/apache/zookeeper/pull/92

ZOOKEEPER-2080: Fix deadlock in dynamic reconfiguration.

Use explicit fine grained locks for synchronizing access to QuorumVerifier 
states in QuorumPeer.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hanm/zookeeper ZOOKEEPER-2080

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/92.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #92


commit cfb4d73a94f23c6e047f75d400372caa495f197f
Author: Michael Han 
Date:   2016-10-20T22:37:22Z

Use explicit fine grained lock for maintaining a consistent global view of 
QuorumVerifier in QuorumPeer.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-10-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593386#comment-15593386
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2080:
---

GitHub user hanm opened a pull request:

https://github.com/apache/zookeeper/pull/92

ZOOKEEPER-2080: Fix deadlock in dynamic reconfiguration.

Use explicit fine grained locks for synchronizing access to QuorumVerifier 
states in QuorumPeer.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hanm/zookeeper ZOOKEEPER-2080

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/92.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #92


commit cfb4d73a94f23c6e047f75d400372caa495f197f
Author: Michael Han 
Date:   2016-10-20T22:37:22Z

Use explicit fine grained lock for maintaining a consistent global view of 
QuorumVerifier in QuorumPeer.




> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] zookeeper pull request #91: Use explicit fine grained locks for synchronizin...

2016-10-20 Thread hanm
Github user hanm closed the pull request at:

https://github.com/apache/zookeeper/pull/91


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] zookeeper pull request #91: Use explicit fine grained locks for synchronizin...

2016-10-20 Thread hanm
GitHub user hanm opened a pull request:

https://github.com/apache/zookeeper/pull/91

Use explicit fine grained locks for synchronizing access to QuorumVerifier 
states in QuorumPeer



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hanm/zookeeper ZOOKEEPER-2080

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/91.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #91


commit cfb4d73a94f23c6e047f75d400372caa495f197f
Author: Michael Han 
Date:   2016-10-20T22:37:22Z

Use explicit fine grained lock for maintaining a consistent global view of 
QuorumVerifier in QuorumPeer.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-10-20 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593380#comment-15593380
 ] 

Michael Han commented on ZOOKEEPER-2080:


Getting back on this issue. I have another perspective - rather than 
synchronizing on entire QuorumPeer we could use fine grained locks to 
specifically guard the set of states affiliated with QuorumVerifier, namely:
* Quorum verifier.
* Last seen quorum verifier.
* Some additional fields that appertain to QV and last QV, such as quorum 
address, election port, and client address.

As for what to guard, I feel guard on individual fields (QV, last QV, quorum 
address, etc) is enough after examining the code paths and call graphs - if I 
am wrong, then we will need to figure out a compound state and invent a new 
class that encapsulates all these states. Also we should make sure that every 
access to these states will be guarded by the same lock. 

bq. But this update would have to synchronize on both QP and QCM, so not sure 
if the same problem exists as above.
I feel the same way - this solution just moves the lock around =).

bq. Do we need to update all view listeners in a critical section?
I think yes, if we want to maintain a global consistent view of the QV states. 
When the value of QV in QP is updated, QP needs to fire events to notify 
subscribers of such change, and without locking when the event hits subscriber 
the view of the world might already changed. We might also need locking on the 
event callbacks of each listener, to avoid any potential race condition of each 
listener. But this case the likely hood of running into race / dead lock sounds 
lower as we are not synchronizing on a single QP anymore.

In any cases, I feel refactoring the QuorumPeer to use fine grained locking is 
necessary. Attaching a patch to express my idea.

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-2080) ReconfigRecoveryTest fails intermittently

2016-10-20 Thread Michael Han (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Han updated ZOOKEEPER-2080:
---
Attachment: ZOOKEEPER-2080.patch

> ReconfigRecoveryTest fails intermittently
> -
>
> Key: ZOOKEEPER-2080
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2080
> Project: ZooKeeper
>  Issue Type: Sub-task
>Reporter: Ted Yu
>Assignee: Michael Han
> Fix For: 3.5.3, 3.6.0
>
> Attachments: ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, ZOOKEEPER-2080.patch, 
> ZOOKEEPER-2080.patch, jacoco-ZOOKEEPER-2080.unzip-grows-to-70MB.7z, 
> repro-20150816.log, threaddump.log
>
>
> I got the following test failure on MacBook with trunk code:
> {code}
> Testcase: testCurrentObserverIsParticipantInNewConfig took 93.628 sec
>   FAILED
> waiting for server 2 being up
> junit.framework.AssertionFailedError: waiting for server 2 being up
>   at 
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig(ReconfigRecoveryTest.java:529)
>   at 
> org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593379#comment-15593379
 ] 

Martin Kuchta commented on ZOOKEEPER-2099:
--

As a start, I think I should change the test to use waitForOne on the client 
being used for the operation after shutting down server 4 instead of 
waitForServerUp. This will at least make sure the test is waiting for the right 
thing before moving on and creating paths. I do expect the clients to be 
disconnected when taking down server 4 (since that should be the current 
leader), and if that's the only source of connection loss, the test might not 
need to retry.

As far as I can tell, most other tests don't retry on connection loss. Should 
they?



> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593326#comment-15593326
 ] 

Michael Han commented on ZOOKEEPER-2099:


Should we catch the connection loss exception and make sure the ZK handle used 
in test is in the connected state? I think all ZK handles were initially in 
connected state (after calling waitForAll), but maybe for some reasons one or 
more connections were lost because networking or other issues.

> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593326#comment-15593326
 ] 

Michael Han edited comment on ZOOKEEPER-2099 at 10/20/16 11:16 PM:
---

Should we catch the connection loss exception and retry connecting to make sure 
the ZK handle used in test is in the connected state? I think all ZK handles 
were initially in connected state (after calling waitForAll), but maybe for 
some reasons one or more connections were lost because networking or other 
issues.


was (Author: hanm):
Should we catch the connection loss exception and make sure the ZK handle used 
in test is in the connected state? I think all ZK handles were initially in 
connected state (after calling waitForAll), but maybe for some reasons one or 
more connections were lost because networking or other issues.

> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian 

[jira] [Commented] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593316#comment-15593316
 ] 

Michael Han commented on ZOOKEEPER-2099:


Ah never mind, that is the test added into the patch:)

> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Michael Han (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593314#comment-15593314
 ] 

Michael Han commented on ZOOKEEPER-2099:


I can't find the other failed test - 
org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testTransactionLogGap in 
code base... where is this test coming from? 

> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-1045) Support Quorum Peer mutual authentication via SASL

2016-10-20 Thread Michael Han (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Han updated ZOOKEEPER-1045:
---
Attachment: ZOOKEEPER-1045 Test Plan.pdf

Test plan update now with many new stuff in the patch (i.e. authorization 
piece). Anyone interested in this JIRA could help testing by executing some of 
the test cases there.

> Support Quorum Peer mutual authentication via SASL
> --
>
> Key: ZOOKEEPER-1045
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1045
> Project: ZooKeeper
>  Issue Type: New Feature
>  Components: quorum, security
>Reporter: Eugene Koontz
>Assignee: Rakesh R
>Priority: Critical
> Fix For: 3.4.10, 3.5.3
>
> Attachments: 0001-ZOOKEEPER-1045-br-3-4.patch, 
> 1045_failing_phunt.tar.gz, HOST_RESOLVER-ZK-1045.patch, 
> TEST-org.apache.zookeeper.server.quorum.auth.QuorumAuthUpgradeTest.txt, 
> ZK-1045-test-case-failure-logs.zip, ZOOKEEPER-1045 Test Plan.pdf, 
> ZOOKEEPER-1045-00.patch, ZOOKEEPER-1045-Rolling Upgrade Design Proposal.pdf, 
> ZOOKEEPER-1045-br-3-4.patch, ZOOKEEPER-1045-br-3-4.patch, 
> ZOOKEEPER-1045-br-3-4.patch, ZOOKEEPER-1045-br-3-4.patch, 
> ZOOKEEPER-1045-br-3-4.patch, ZOOKEEPER-1045-br-3-4.patch, 
> ZOOKEEPER-1045-br-3-4.patch, ZOOKEEPER-1045-br-3-4.patch, 
> ZOOKEEPER-1045-br-3-4.patch, ZOOKEEPER-1045TestValidationDesign.pdf, 
> org.apache.zookeeper.server.quorum.auth.QuorumAuthUpgradeTest.testRollingUpgrade.log
>
>
> ZOOKEEPER-938 addresses mutual authentication between clients and servers. 
> This bug, on the other hand, is for authentication among quorum peers. 
> Hopefully much of the work done on SASL integration with Zookeeper for 
> ZOOKEEPER-938 can be used as a foundation for this enhancement.
> Review board: https://reviews.apache.org/r/47354/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2597) Add script to merge PR from Apache git repo to Github

2016-10-20 Thread Edward Ribeiro (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593241#comment-15593241
 ] 

Edward Ribeiro commented on ZOOKEEPER-2597:
---

I am resuming today, should update the PR really soon.

[~breed], let me know if you have further suggestions, please. I am gonna hurry 
up with your review points. Thanks. PS: do you agree with my comments about the 
merge vs cherry-pick usage in the PR? Does it make sense?

> Add script to merge PR from Apache git repo to Github
> -
>
> Key: ZOOKEEPER-2597
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2597
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Edward Ribeiro
>Assignee: Edward Ribeiro
>Priority: Minor
> Attachments: ZOOKEEPER-2597.patch
>
>
> A port of kafka-merge-pr.py to workon on ZooKeeper repo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2597) Add script to merge PR from Apache git repo to Github

2016-10-20 Thread Flavio Junqueira (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593078#comment-15593078
 ] 

Flavio Junqueira commented on ZOOKEEPER-2597:
-

[~breed] [~eribeiro] guys, it would be cool to have this done.

> Add script to merge PR from Apache git repo to Github
> -
>
> Key: ZOOKEEPER-2597
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2597
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Edward Ribeiro
>Assignee: Edward Ribeiro
>Priority: Minor
> Attachments: ZOOKEEPER-2597.patch
>
>
> A port of kafka-merge-pr.py to workon on ZooKeeper repo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (ZOOKEEPER-2597) Add script to merge PR from Apache git repo to Github

2016-10-20 Thread Flavio Junqueira (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flavio Junqueira updated ZOOKEEPER-2597:

Assignee: Edward Ribeiro

> Add script to merge PR from Apache git repo to Github
> -
>
> Key: ZOOKEEPER-2597
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2597
> Project: ZooKeeper
>  Issue Type: Improvement
>Reporter: Edward Ribeiro
>Assignee: Edward Ribeiro
>Priority: Minor
> Attachments: ZOOKEEPER-2597.patch
>
>
> A port of kafka-merge-pr.py to workon on ZooKeeper repo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Martin Kuchta (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593051#comment-15593051
 ] 

Martin Kuchta commented on ZOOKEEPER-2099:
--

Looks like the test needs to be hardened a bit. I think I see the issue - 
QuorumBase.waitForServerUp doesn't guarantee that the client the test is using 
to create the nodes is also connected. I ran the test a few dozen times on my 
machine and saw no failures, but that's obviously not good enough.

As for the testLE failure, that seems to be a known flaky test (ZOOKEEPER-1932)

> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A couple ways in which the state could be 
> persisted:
> ** Write a file (for example: loggap.) in the data dir indicating that 
> the host was sync'd with a SNAP and thus txnlogs might be missing. Presence 
> of these files would be checked when reading txnlogs.
> ** Write a new record into the txnlog file as "sync'd-by-snap-from-leader" 
> marker. Readers of the txnlog would then check for presence of this record 
> when iterating through it and act appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593016#comment-15593016
 ] 

Hadoop QA commented on ZOOKEEPER-2099:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12834543/ZOOKEEPER-2099.patch
  against trunk revision cef5978969bedfe066f903834a9ea4af6d508844.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 9 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 2.0.3) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3495//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3495//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3495//console

This message is automatically generated.

> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.5.0, 3.6.0
>Reporter: Santeri (Santtu) Voutilainen
>Assignee: Martin Kuchta
> Attachments: ZOOKEEPER-2099-repro.patch, ZOOKEEPER-2099.patch
>
>
> When a learner sync's with the leader, it is possible for the Leader to send 
> the learner a DIFF that does NOT contain all the transactions between the 
> learner's zxid and that of the leader's zxid thus resulting in a corruption 
> datatree on the learner.
> For this to occur, the leader must have sync'd with a previous leader using a 
> SNAP and the zxid requested by the learner must still exist in the current 
> leader's txnlog files.
> This issue was introduced by ZOOKEEPER-1413.
> *Scenario*
> A sample sequence in which this issue occurs:
> # Hosts H1 and H2 disconnect from the current leader H3 (crash, network 
> partition, etc).  The last zxid on these hosts is Z1.
> # Additional transactions occur on the cluster resulting in the latest zxid 
> being Z2.
> # Host H1 recovers and connects to H3 to sync and sends Z1 as part of its 
> FOLLOWERINFO or OBSERVERINFO packet.
> # The leader, H3, decides to send a SNAP because a) it does not have the 
> necessary records in the in-mem committed log, AND b) the size of the 
> required txnlog to send it larger than the limit.
> # Host H1 successfully sync's with the leader (H3). At this point H1's 
> txnlogs have records up to and including Z1 as well as Z2 and up.  It does 
> NOT have records between Z1 and Z2.
> # Host H3 fails; a leader election occurs and H1 is chosen as the leader
> # Host H2 recovers and connects to H1 to sync and sends Z1 in its 
> FOLLOWERINFO/OBSERVERINFO packet
> # The leader, H1, determines it can send a DIFF.  It concludes this because 
> although it does not have the necessary records in its in-memory commit log, 
> it does have Z1 in its txnlog and the size of the log is less than the limit. 
>  H1 ends up with a different size calculation than H3 because H1 is missing 
> all the records between Z1 and Z2 so it has less log to send.
> # H2 receives the DIFF and applies the records to its data tree. Depending on 
> the type of transactions that occurred between Z1 and Z2 it may not hit any 
> errors when applying these records.
> H2 now has a corrupted view of the data tree because it is missing all the 
> changes made by the transactions between Z1 and Z2.
> *Recovery*
> The way to recover from this situation is to delete the data/snap directory 
> contents from the affected hosts and have them resync with the leader at 
> which point they will receive a SNAP since they will appear as empty hosts.
> *Workaround*
> A quick workaround for anyone concerned about this issue is to disable sync 
> from the txnlog by changing the database size limit to 0.  This is a code 
> change as it is not a configurable setting.
> *Potential fixes*
> There are several ways of fixing this.  A few of options:
> * Delete all snaps and txnlog files on a host when it receives a SNAP from 
> the leader
> * Invalidate sync from txnlog after receiving a SNAP. This state must also be 
> persisted on-disk so that the txnlogs with the gap cannot be used to provide 
> a DIFF even after restart.  A 

Failed: ZOOKEEPER-2099 PreCommit Build #3495

2016-10-20 Thread Apache Jenkins Server
Jira: https://issues.apache.org/jira/browse/ZOOKEEPER-2099
Build: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3495/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 411846 lines...]
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
(version 2.0.3) warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 
 [exec] -1 core tests.  The patch failed core unit tests.
 [exec] 
 [exec] +1 contrib tests.  The patch passed contrib unit tests.
 [exec] 
 [exec] Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3495//testReport/
 [exec] Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3495//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
 [exec] Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3495//console
 [exec] 
 [exec] This message is automatically generated.
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Adding comment to Jira.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] Comment added.
 [exec] 9cf4a0b0d586cd99655a7c4736f37aea43951f32 logged out
 [exec] 
 [exec] 
 [exec] 
==
 [exec] 
==
 [exec] Finished build.
 [exec] 
==
 [exec] 
==
 [exec] 
 [exec] 
 [exec] mv: 
‘/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/patchprocess’ 
and 
‘/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/patchprocess’ 
are the same file

BUILD FAILED
/home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/build.xml:1605: 
exec returned: 1

Total time: 13 minutes 50 seconds
Build step 'Execute shell' marked build as failure
Archiving artifacts
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Compressed 550.71 KB of artifacts by 23.2% relative to #3494
Recording test results
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
[description-setter] Description set: ZOOKEEPER-2099
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7
Setting JDK_1_7_LATEST__HOME=/home/jenkins/tools/java/latest1.7



###
## FAILED TESTS (if any) 
##
2 tests failed.
FAILED:  
org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testTransactionLogGap

Error Message:
KeeperErrorCode = ConnectionLoss for /x

Stack Trace:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /x
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1415)
at 
org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testTransactionLogGap(QuorumPeerMainTest.java:373)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79)


FAILED:  org.apache.zookeeper.test.LETest.testLE

Error Message:
Threads didn't join

Stack Trace:
junit.framework.AssertionFailedError: Threads didn't join
at org.apache.zookeeper.test.LETest.testLE(LETest.java:123)
at 
org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:79)




[jira] [Updated] (ZOOKEEPER-2099) Using txnlog to sync a learner can corrupt the learner's datatree

2016-10-20 Thread Martin Kuchta (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Kuchta updated ZOOKEEPER-2099:
-
Attachment: ZOOKEEPER-2099.patch

Attached is an initial attempt at fixing this.

I was investigating an issue with massive inconsistency between ZK servers and 
developed my own test to reproduce the issue. I originally thought our issue 
was isolated to ephemeral znodes so I didn't find this issue in my searches, 
but the test I wrote ended up being almost exactly the same as the one provided 
by [~svoutil]. I did find another detail in my investigation that seems 
important - this bug not only causes the leader to not send the correct updates 
to the follower, but the leader can also incorrectly tell the follower to 
truncate its log, resulting in the loss of even more transactions on the 
follower. This requires a slightly different sequence of events (a few more 
steps). I found it easier to use my original test as a base and make some 
improvements using the test submitted here, so that's why the test I'm 
submitting looks so different.

h4. Test

Here's a description of what the test does:

# Enable forceSnapSync
# Set a very high snapshotSizeFactor to guarantee a DIFF if forceSnapSync is off
# Start 5 servers
# Create a baseline znode /w (not necessary, but shows where the data loss 
starts)
# Shutdown SID 4
# Create /x while SID 4 is down
# Shutdown SID 0
# Create /y while SIDs 0 and 4 are down
# Start SID 4 (which receives a SNAP from the current leader because of 
forceSnapSync=true)
# Create /z while SID 0 is down
# Disable forceSnapSync
# Shutdown current leader - SID 4 becomes leader
# Start SID 0 (which receives a TRUNC from SID 4 without the fix and a SNAP 
with the fix)
# Check for the presence of all znodes on all servers (without the fix, SID 0 
is missing /x and /y)

More detail on what goes wrong in step 13:

(Using W = the zxid of the transaction which creates /w, X for /x, etc.)

At this point, SID 4 has W and Z in its log and it has a snapshot containing 
the updates from W, X, and Y. It tries to sync with SID 0 (whose last zxid is 
Y), and iterates through its log until it finds a zxid > Y. It then looks back 
at the previous log entry (W), sees that W < Y, and tells SID 0 to truncate its 
log to W. After this, it starts sending updates at Z. SID 0 therefore deletes X 
and misses Y. The only correct thing for SID 4 to do here is to send a snapshot.

h4. Fix

The approach of writing a file with the last SNAP received to the data dir and 
checking that value when trying to sync with a follower seems best. The patch 
adds code to ZKDatabase to handle this file (called lastSnapReceived). 
LearnerHandler checks this lastSnapReceived value, and if it falls in the range 
of transactions a follower needs in syncFollower, a snapshot is sent.

We desperately need this fix because of the massive issues the bug is causing, 
so I will be doing as much testing as I can around it before fixing our 
internal version of ZK. It would be great to also get it polished to a state 
where it could be included in a future 3.5.x version.

Some big points to discuss:
* What should ZKDatabase/Learner do if it can't create or write to the file? It 
currently doesn't handle any exceptions which will result in the Learner 
stopping. This ensures correctness, but introduces another way for a Learner to 
fail.
* What should ZKDatabase/LearnerHandler do if it can't read the file? 
LearnerHandler currently catches all exceptions and falls back to sending a 
SNAP. This is always correct, but there will be performance loss in syncing new 
learners if the file becomes unreadable/corrupted somehow.
* Is there risk with upgrades or downgrades? It doesn't seem like there should 
be. Versions without the fix will just ignore the file if it's present in their 
data dir. Upgrading from a version without the fix to a version with the fix 
will result in the file being written when initializing the ZKDatabase.

Smaller points I couldn't decide on:
* Is it acceptable to enforce snapLog being non-null when constructing a 
ZKDatabase now? I had to modify some unit tests, but I liked that better than a 
test-only null check in the constructor.
* Should zookeeper.forceSnapSync and zookeeper.snapshotSizeFactor be settable 
system properties? The property names were included in the relevant classes but 
never used, and I wasn't sure if that was intended or not.
* Is IOUtils a good home for writeLongToFileAtomic since QuorumPeer and 
ZKDatabase both need that logic now?

Patch generated against master and seems to apply to branch-3.5. Not needed in 
branch-3.4 since the issue was introduced in 3.5.0


> Using txnlog to sync a learner can corrupt the learner's datatree
> -
>
> Key: ZOOKEEPER-2099
> URL: 

[jira] [Commented] (ZOOKEEPER-761) Remove *synchronous* calls from the *single-threaded* C clieant API, since they are documented not to work

2016-10-20 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592526#comment-15592526
 ] 

ASF GitHub Bot commented on ZOOKEEPER-761:
--

GitHub user breed opened a pull request:

https://github.com/apache/zookeeper/pull/90

ZOOKEEPER-761: Remove *synchronous* calls from the *single-threaded* C 
client API

the synchronous calls from a single-threaded client do not work. this patch
makes using them in a single-threaded client a compilation error.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/breed/zookeeper ZOOKEEPER-761

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/90.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #90


commit af04dd6fc0b74ab723e6ce449c0e80cc73df
Author: Ben Reed 
Date:   2016-10-19T17:36:44Z

ZOOKEEPER-761: Remove *synchronous* calls from the *single-threaded* C 
client API

the synchronous calls from a single-threaded client do not work. this patch
makes using them in a single-threaded client a compilation error.




> Remove *synchronous* calls from the *single-threaded* C clieant API, since 
> they are documented not to work
> --
>
> Key: ZOOKEEPER-761
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-761
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: c client
>Affects Versions: 3.1.1, 3.2.2
> Environment: RHEL 4u8 (Linux).  The issue is not OS-specific though.
>Reporter: Jozef Hatala
>Assignee: Pierre Habouzit
>Priority: Minor
> Fix For: 3.5.3, 3.6.0
>
> Attachments: fix-sync-apis-in-st-adaptor.patch, 
> fix-sync-apis-in-st-adaptor.v2.patch
>
>
> Since the synchronous calls are 
> [known|http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#Using+the+C+Client]
>  to be unimplemented in the single threaded version of the client library 
> libzookeeper_st.so, I believe that it would be helpful towards users of the 
> library if that information was also obvious from the header file.
> Anecdotally more than one of us here made the mistake of starting by using 
> the synchronous calls with the single-threaded library, and we found 
> ourselves debugging it.  An early warning would have been greatly appreciated.
> 1. Could you please add warnings to the doxygen blocks of all synchronous 
> calls saying that they are not available in the single-threaded API.  This 
> cannot be safely done with {{#ifdef THREADED}}, obviously, because the same 
> header file is included whichever client library implementation one is 
> compiling for.
> 2. Could you please bracket the implementation of all synchronous calls in 
> zookeeper.c with {{#ifdef THREADED}} and {{#endif}}, so that those symbols 
> are not present in libzookeeper_st.so?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] zookeeper pull request #90: ZOOKEEPER-761: Remove *synchronous* calls from t...

2016-10-20 Thread breed
GitHub user breed opened a pull request:

https://github.com/apache/zookeeper/pull/90

ZOOKEEPER-761: Remove *synchronous* calls from the *single-threaded* C 
client API

the synchronous calls from a single-threaded client do not work. this patch
makes using them in a single-threaded client a compilation error.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/breed/zookeeper ZOOKEEPER-761

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/90.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #90


commit af04dd6fc0b74ab723e6ce449c0e80cc73df
Author: Ben Reed 
Date:   2016-10-19T17:36:44Z

ZOOKEEPER-761: Remove *synchronous* calls from the *single-threaded* C 
client API

the synchronous calls from a single-threaded client do not work. this patch
makes using them in a single-threaded client a compilation error.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (ZOOKEEPER-2614) Port ZOOKEEPER-1576 to branch3.4

2016-10-20 Thread Edward Ribeiro (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592492#comment-15592492
 ] 

Edward Ribeiro commented on ZOOKEEPER-2614:
---

Yup, my fault. I should have checked it was backported to 3.4 back then (it's 
an old patch). Any committer can check and commit [~vishk] patch, please?

It LGTM, but a second opinion would be great. Thanks!

> Port ZOOKEEPER-1576 to branch3.4
> 
>
> Key: ZOOKEEPER-2614
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2614
> Project: ZooKeeper
>  Issue Type: Bug
>Affects Versions: 3.4.9
>Reporter: Vishal Khandelwal
>Assignee: Vishal Khandelwal
> Fix For: 3.4.9
>
> Attachments: ZOOKEEPER-2614.branch-3.4.00.patch
>
>
> ZOOKEEPER-1576 handles UnknownHostException and it good to have this change 
> for 3.4 branch as well. Porting the changes to 3.4 after resolving the 
> conflicts



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


ZooKeeper_branch35_solaris - Build # 290 - Still Failing

2016-10-20 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper_branch35_solaris/290/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 10421 lines...]
[junit] 2016-10-20 16:19:39,428 [myid:] - INFO  [main:JMXEnv@245] - 
expect:InMemoryDataTree
[junit] 2016-10-20 16:19:39,428 [myid:] - INFO  [main:JMXEnv@249] - 
found:InMemoryDataTree 
org.apache.ZooKeeperService:name0=StandaloneServer_port11231,name1=InMemoryDataTree
[junit] 2016-10-20 16:19:39,428 [myid:] - INFO  [main:JMXEnv@245] - 
expect:StandaloneServer_port
[junit] 2016-10-20 16:19:39,428 [myid:] - INFO  [main:JMXEnv@249] - 
found:StandaloneServer_port 
org.apache.ZooKeeperService:name0=StandaloneServer_port11231
[junit] 2016-10-20 16:19:39,429 [myid:] - INFO  [main:ClientBase@462] - 
Client test setup finished
[junit] 2016-10-20 16:19:39,429 [myid:] - INFO  [main:ZooKeeper@855] - 
Initiating client connection, connectString=127.0.0.1:11231 
sessionTimeout=3 
watcher=org.apache.zookeeper.test.ClientBase$CountdownWatcher@158803f
[junit] 2016-10-20 16:19:39,430 [myid:127.0.0.1:11231] - INFO  
[main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@1113] - Opening socket 
connection to server 127.0.0.1/127.0.0.1:11231. Will not attempt to 
authenticate using SASL (unknown error)
[junit] 2016-10-20 16:19:39,430 [myid:127.0.0.1:11231] - INFO  
[main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@948] - Socket 
connection established, initiating session, client: null, server: null
[junit] 2016-10-20 16:19:39,430 [myid:] - INFO  
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:11231:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /127.0.0.1:49405
[junit] 2016-10-20 16:19:39,499 [myid:] - INFO  
[NIOWorkerThread-2:ZooKeeperServer@995] - Client attempting to establish new 
session at /127.0.0.1:49405
[junit] 2016-10-20 16:19:39,499 [myid:] - INFO  
[SyncThread:0:FileTxnLog@204] - Creating new log file: log.1
[junit] 2016-10-20 16:19:39,502 [myid:] - INFO  
[SyncThread:0:ZooKeeperServer@709] - Established session 0x1245fdedabe with 
negotiated timeout 3 for client /127.0.0.1:49405
[junit] 2016-10-20 16:19:39,502 [myid:127.0.0.1:11231] - INFO  
[main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@1381] - Session 
establishment complete on server null, sessionid = 0x1245fdedabe, 
negotiated timeout = 3
[junit] 2016-10-20 16:19:39,503 [myid:] - INFO  [main:JMXEnv@117] - 
expect:0x1245fdedabe
[junit] 2016-10-20 16:19:39,504 [myid:] - INFO  [main:JMXEnv@120] - 
found:0x1245fdedabe 
org.apache.ZooKeeperService:name0=StandaloneServer_port11231,name1=Connections,name2=127.0.0.1,name3=0x1245fdedabe
[junit] 2016-10-20 16:19:39,504 [myid:] - INFO  [Time-limited 
test:JUnit4ZKTestRunner$LoggedInvokeMethod@77] - RUNNING TEST METHOD testCreate
[junit] 2016-10-20 16:19:39,513 [myid:] - INFO  [Time-limited 
test:JUnit4ZKTestRunner$LoggedInvokeMethod@82] - Memory used 10281
[junit] 2016-10-20 16:19:39,513 [myid:] - INFO  [Time-limited 
test:JUnit4ZKTestRunner$LoggedInvokeMethod@87] - Number of threads 41
[junit] 2016-10-20 16:19:39,513 [myid:] - INFO  [Time-limited 
test:JUnit4ZKTestRunner$LoggedInvokeMethod@102] - FINISHED TEST METHOD 
testCreate
[junit] 2016-10-20 16:19:39,513 [myid:] - INFO  [main:ClientBase@543] - 
tearDown starting
[junit] 2016-10-20 16:19:39,514 [myid:] - INFO  [ProcessThread(sid:0 
cport:11231)::PrepRequestProcessor@647] - Processed session termination for 
sessionid: 0x1245fdedabe
[junit] 2016-10-20 16:19:39,515 [myid:] - INFO  
[NIOWorkerThread-1:MBeanRegistry@128] - Unregister MBean 
[org.apache.ZooKeeperService:name0=StandaloneServer_port11231,name1=Connections,name2=127.0.0.1,name3=0x1245fdedabe]
[junit] 2016-10-20 16:19:39,515 [myid:] - INFO  [main:ZooKeeper@1313] - 
Session: 0x1245fdedabe closed
[junit] 2016-10-20 16:19:39,515 [myid:] - INFO  [main:ClientBase@513] - 
STOPPING server
[junit] 2016-10-20 16:19:39,515 [myid:] - INFO  
[main-EventThread:ClientCnxn$EventThread@513] - EventThread shut down for 
session: 0x1245fdedabe
[junit] 2016-10-20 16:19:39,515 [myid:] - INFO  
[NIOWorkerThread-1:NIOServerCnxn@607] - Closed socket connection for client 
/127.0.0.1:49405 which had sessionid 0x1245fdedabe
[junit] 2016-10-20 16:19:39,516 [myid:] - INFO  
[ConnnectionExpirer:NIOServerCnxnFactory$ConnectionExpirerThread@583] - 
ConnnectionExpirerThread interrupted
[junit] 2016-10-20 16:19:39,516 [myid:] - INFO  
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:11231:NIOServerCnxnFactory$AcceptThread@219]
 - accept thread exitted run method
[junit] 2016-10-20 16:19:39,516 [myid:] - INFO  
[NIOServerCxnFactory.SelectorThread-0:NIOServerCnxnFactory$SelectorThread@420] 
- selector thread exitted run method
[junit] 2016-10-20 16:19:39,517 

ZooKeeper_branch35_openjdk7 - Build # 270 - Failure

2016-10-20 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper_branch35_openjdk7/270/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 464540 lines...]
[junit] at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
[junit] at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357)
[junit] at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214)
[junit] 2016-10-20 10:19:47,553 [myid:127.0.0.1:22118] - INFO  
[main-SendThread(127.0.0.1:22118):ClientCnxn$SendThread@1113] - Opening socket 
connection to server 127.0.0.1/127.0.0.1:22118. Will not attempt to 
authenticate using SASL (unknown error)
[junit] 2016-10-20 10:19:47,553 [myid:127.0.0.1:22118] - WARN  
[main-SendThread(127.0.0.1:22118):ClientCnxn$SendThread@1235] - Session 
0x200754833ca for server 127.0.0.1/127.0.0.1:22118, unexpected error, 
closing socket connection and attempting reconnect
[junit] java.net.ConnectException: Connection refused
[junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
[junit] at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
[junit] at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357)
[junit] at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214)
[junit] 2016-10-20 10:19:48,609 [myid:127.0.0.1:21994] - INFO  
[main-SendThread(127.0.0.1:21994):ClientCnxn$SendThread@1113] - Opening socket 
connection to server 127.0.0.1/127.0.0.1:21994. Will not attempt to 
authenticate using SASL (unknown error)
[junit] 2016-10-20 10:19:48,610 [myid:127.0.0.1:21994] - WARN  
[main-SendThread(127.0.0.1:21994):ClientCnxn$SendThread@1235] - Session 
0x10075433863 for server 127.0.0.1/127.0.0.1:21994, unexpected error, 
closing socket connection and attempting reconnect
[junit] java.net.ConnectException: Connection refused
[junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
[junit] at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
[junit] at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357)
[junit] at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214)
[junit] 2016-10-20 10:19:48,729 [myid:127.0.0.1:22115] - INFO  
[main-SendThread(127.0.0.1:22115):ClientCnxn$SendThread@1113] - Opening socket 
connection to server 127.0.0.1/127.0.0.1:22115. Will not attempt to 
authenticate using SASL (unknown error)
[junit] 2016-10-20 10:19:48,730 [myid:127.0.0.1:22115] - WARN  
[main-SendThread(127.0.0.1:22115):ClientCnxn$SendThread@1235] - Session 
0x100754833ca for server 127.0.0.1/127.0.0.1:22115, unexpected error, 
closing socket connection and attempting reconnect
[junit] java.net.ConnectException: Connection refused
[junit] at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
[junit] at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
[junit] at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:357)
[junit] at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1214)
[junit] 2016-10-20 10:19:48,799 [myid:127.0.0.1:22142] - INFO  
[main-SendThread(127.0.0.1:22142):ClientCnxn$SendThread@1113] - Opening socket 
connection to server 127.0.0.1/127.0.0.1:22142. Will not attempt to 
authenticate using SASL (unknown error)
[junit] 2016-10-20 10:19:48,800 [myid:127.0.0.1:22142] - INFO  
[main-SendThread(127.0.0.1:22142):ClientCnxn$SendThread@948] - Socket 
connection established, initiating session, client: /127.0.0.1:32794, server: 
127.0.0.1/127.0.0.1:22142
[junit] 2016-10-20 10:19:48,800 [myid:] - WARN  [New I/O worker 
#4090:NettyServerCnxn@399] - Closing connection to /127.0.0.1:32794
[junit] java.io.IOException: ZK down
[junit] at 
org.apache.zookeeper.server.NettyServerCnxn.receiveMessage(NettyServerCnxn.java:336)
[junit] at 
org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.processMessage(NettyServerCnxnFactory.java:244)
[junit] at 
org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.messageReceived(NettyServerCnxnFactory.java:166)
[junit] at 
org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
[junit] at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
[junit] at 
org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
[junit] at 
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
[junit] at 
org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
[junit] at 

ZooKeeper_branch35_jdk7 - Build # 705 - Still Failing

2016-10-20 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper_branch35_jdk7/705/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 35 lines...]
 [echo] contrib: zkfuse

clean:
 [echo] contrib: zkperl

clean:
 [echo] contrib: zkpython

clean:
 [echo] contrib: zktreeutil

clean:
 [echo] contrib: ZooInspector

clean-recipes:

clean:

clean:
 [echo] recipes: election

clean:
 [echo] recipes: lock

clean:
 [echo] recipes: queue

clean:

init:
[mkdir] Created dir: 
/home/jenkins/jenkins-slave/workspace/ZooKeeper_branch35_jdk7/build/classes
[mkdir] Created dir: 
/home/jenkins/jenkins-slave/workspace/ZooKeeper_branch35_jdk7/build/lib
[mkdir] Created dir: 
/home/jenkins/jenkins-slave/workspace/ZooKeeper_branch35_jdk7/build/package/lib
[mkdir] Created dir: 
/home/jenkins/jenkins-slave/workspace/ZooKeeper_branch35_jdk7/build/test/lib

ivy-download:
  [get] Getting: 
https://repo1.maven.org/maven2/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar
  [get] To: 
/home/jenkins/jenkins-slave/workspace/ZooKeeper_branch35_jdk7/src/java/lib/ivy-2.4.0.jar
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (sharedRuntime.cpp:814), pid=26759, tid=140328636253952
#  guarantee(cb->is_adapter_blob() || cb->is_method_handles_adapter_blob()) 
failed: exception happened outside interpreter, nmethods and vtable stubs (1)
#
# JRE version: Java(TM) SE Runtime Environment (7.0_80-b15) (build 1.7.0_80-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.80-b11 mixed mode linux-amd64 
compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# 
/home/jenkins/jenkins-slave/workspace/ZooKeeper_branch35_jdk7/hs_err_pid26759.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Build step 'Invoke Ant' marked build as failure
Recording test results
ERROR: Step ‘Publish JUnit test result report’ failed: No test report files 
were found. Configuration error?
Email was triggered for: Failure - Any
Sending email for trigger: Failure - Any



###
## FAILED TESTS (if any) 
##
No tests ran.

ZooKeeper-trunk-solaris - Build # 1355 - Still Failing

2016-10-20 Thread Apache Jenkins Server
See https://builds.apache.org/job/ZooKeeper-trunk-solaris/1355/

###
## LAST 60 LINES OF THE CONSOLE 
###
[...truncated 443604 lines...]
[junit] 2016-10-20 08:21:52,776 [myid:] - INFO  [main:ClientBase@386] - 
CREATING server instance 127.0.0.1:11222
[junit] 2016-10-20 08:21:52,776 [myid:] - INFO  
[main:NIOServerCnxnFactory@673] - Configuring NIO connection handler with 10s 
sessionless connection timeout, 2 selector thread(s), 16 worker threads, and 64 
kB direct buffers.
[junit] 2016-10-20 08:21:52,777 [myid:] - INFO  
[main:NIOServerCnxnFactory@686] - binding to port 0.0.0.0/0.0.0.0:11222
[junit] 2016-10-20 08:21:52,778 [myid:] - INFO  [main:ClientBase@361] - 
STARTING server instance 127.0.0.1:11222
[junit] 2016-10-20 08:21:52,778 [myid:] - INFO  [main:ZooKeeperServer@889] 
- minSessionTimeout set to 6000
[junit] 2016-10-20 08:21:52,778 [myid:] - INFO  [main:ZooKeeperServer@898] 
- maxSessionTimeout set to 6
[junit] 2016-10-20 08:21:52,778 [myid:] - INFO  [main:ZooKeeperServer@159] 
- Created server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 
6 datadir 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/build/test/tmp/test4058106769337100180.junit.dir/version-2
 snapdir 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/build/test/tmp/test4058106769337100180.junit.dir/version-2
[junit] 2016-10-20 08:21:52,779 [myid:] - INFO  [main:FileSnap@83] - 
Reading snapshot 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/build/test/tmp/test4058106769337100180.junit.dir/version-2/snapshot.b
[junit] 2016-10-20 08:21:52,782 [myid:] - INFO  [main:FileTxnSnapLog@306] - 
Snapshotting: 0xb to 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/workspace/ZooKeeper-trunk-solaris/build/test/tmp/test4058106769337100180.junit.dir/version-2/snapshot.b
[junit] 2016-10-20 08:21:52,783 [myid:] - ERROR [main:ZooKeeperServer@501] 
- ZKShutdownHandler is not registered, so ZooKeeper server won't take any 
action on ERROR or SHUTDOWN server state changes
[junit] 2016-10-20 08:21:52,784 [myid:] - INFO  
[main:FourLetterWordMain@85] - connecting to 127.0.0.1 11222
[junit] 2016-10-20 08:21:52,784 [myid:] - INFO  
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:11222:NIOServerCnxnFactory$AcceptThread@296]
 - Accepted socket connection from /127.0.0.1:41620
[junit] 2016-10-20 08:21:52,789 [myid:] - INFO  
[NIOWorkerThread-1:NIOServerCnxn@485] - Processing stat command from 
/127.0.0.1:41620
[junit] 2016-10-20 08:21:52,789 [myid:] - INFO  
[NIOWorkerThread-1:StatCommand@49] - Stat command output
[junit] 2016-10-20 08:21:52,790 [myid:] - INFO  
[NIOWorkerThread-1:NIOServerCnxn@607] - Closed socket connection for client 
/127.0.0.1:41620 (no session established for client)
[junit] 2016-10-20 08:21:52,790 [myid:] - INFO  [main:JMXEnv@228] - 
ensureParent:[InMemoryDataTree, StandaloneServer_port]
[junit] 2016-10-20 08:21:52,791 [myid:] - INFO  [main:JMXEnv@245] - 
expect:InMemoryDataTree
[junit] 2016-10-20 08:21:52,791 [myid:] - INFO  [main:JMXEnv@249] - 
found:InMemoryDataTree 
org.apache.ZooKeeperService:name0=StandaloneServer_port11222,name1=InMemoryDataTree
[junit] 2016-10-20 08:21:52,791 [myid:] - INFO  [main:JMXEnv@245] - 
expect:StandaloneServer_port
[junit] 2016-10-20 08:21:52,792 [myid:] - INFO  [main:JMXEnv@249] - 
found:StandaloneServer_port 
org.apache.ZooKeeperService:name0=StandaloneServer_port11222
[junit] 2016-10-20 08:21:52,792 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@82] - Memory used 17666
[junit] 2016-10-20 08:21:52,792 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@87] - Number of threads 24
[junit] 2016-10-20 08:21:52,792 [myid:] - INFO  
[main:JUnit4ZKTestRunner$LoggedInvokeMethod@102] - FINISHED TEST METHOD 
testQuota
[junit] 2016-10-20 08:21:52,792 [myid:] - INFO  [main:ClientBase@543] - 
tearDown starting
[junit] 2016-10-20 08:21:52,862 [myid:] - INFO  [main:ZooKeeper@1315] - 
Session: 0x1245e296ec8 closed
[junit] 2016-10-20 08:21:52,862 [myid:] - INFO  
[main-EventThread:ClientCnxn$EventThread@513] - EventThread shut down for 
session: 0x1245e296ec8
[junit] 2016-10-20 08:21:52,862 [myid:] - INFO  [main:ClientBase@513] - 
STOPPING server
[junit] 2016-10-20 08:21:52,863 [myid:] - INFO  
[NIOServerCxnFactory.AcceptThread:0.0.0.0/0.0.0.0:11222:NIOServerCnxnFactory$AcceptThread@219]
 - accept thread exitted run method
[junit] 2016-10-20 08:21:52,863 [myid:] - INFO  
[NIOServerCxnFactory.SelectorThread-1:NIOServerCnxnFactory$SelectorThread@420] 
- selector thread exitted run method
[junit] 2016-10-20 08:21:52,863 [myid:] - INFO