[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2018-12-26 Thread Brian Nixon (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729136#comment-16729136
 ] 

Brian Nixon commented on ZOOKEEPER-2872:


Now that the patch is merged, was there any further work here?

> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>Priority: Major
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-17 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131747#comment-16131747
 ] 

Hudson commented on ZOOKEEPER-2872:
---

SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #3503 (See 
[https://builds.apache.org/job/ZooKeeper-trunk/3503/])
ZOOKEEPER-2872: Interrupted snapshot sync causes data loss (hanm: rev 
0706b40afad079f19fe9f76c99bbb7ec69780dbd)
* (edit) src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
* (edit) src/java/test/org/apache/zookeeper/test/TruncateTest.java
* (edit) src/java/main/org/apache/zookeeper/server/quorum/Learner.java
* (edit) src/java/main/org/apache/zookeeper/server/persistence/SnapShot.java
* (edit) src/java/main/org/apache/zookeeper/server/persistence/FileSnap.java
* (edit) src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java
* (edit) 
src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java


> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131705#comment-16131705
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/333
  
Committed to master: 0706b40afad079f19fe9f76c99bbb7ec69780dbd

Pending JIRA resolve after fixing merge conflicts and commit into 
branch-3.4 and 3.5.


> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131702#comment-16131702
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
---

Github user asfgit closed the pull request at:

https://github.com/apache/zookeeper/pull/333


> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131683#comment-16131683
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/333
  
>> it seems best to keep snapshot taking a lighter weight operation.

Sounds reasonable.

>> I am unable to reproduce the test failure in Zab1_0Test

I think it's a flaky test. Filed ZOOKEEPER-2877 for this.


> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-17 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130996#comment-16130996
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
---

Github user enixon commented on the issue:

https://github.com/apache/zookeeper/pull/333
  
I am unable to reproduce the test failure in Zab1_0Test


> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129973#comment-16129973
 ] 

Hadoop QA commented on ZOOKEEPER-2872:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/942//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/942//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/942//console

This message is automatically generated.

> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-16 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129963#comment-16129963
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
---

Github user enixon commented on the issue:

https://github.com/apache/zookeeper/pull/333
  
We contemplated doing an fsync for every snapshot and decided against. 
You're taking a guaranteed io spike each time. That's fine when you're just 
syncing with the quorum but during normal operation, it seems best to keep 
snapshot taking a lighter weight operation.


> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124803#comment-16124803
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/333#discussion_r132832594
  
--- Diff: src/java/main/org/apache/zookeeper/server/quorum/Learner.java ---
@@ -364,6 +364,7 @@ protected void syncWithLeader(long newLeaderZxid) 
throws Exception{
 readPacket(qp);
 LinkedList packetsCommitted = new LinkedList();
 LinkedList packetsNotCommitted = new 
LinkedList();
+boolean syncSnapshot = false;
--- End diff --

We can level this variable definition up so it's clustered with 
`snapshotNeed` boolean.

Another possibility is to get ride of this variable and use existing 
`snapshotNeeded` - but that will do fysnc snapshot for TRUNC sync, which the 
existing patch will not. 


> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124805#comment-16124805
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
---

Github user hanm commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/333#discussion_r132832602
  
--- Diff: src/java/main/org/apache/zookeeper/server/quorum/Learner.java ---
@@ -364,6 +364,7 @@ protected void syncWithLeader(long newLeaderZxid) 
throws Exception{
 readPacket(qp);
 LinkedList packetsCommitted = new LinkedList();
 LinkedList packetsNotCommitted = new 
LinkedList();
+boolean syncSnapshot = false;
--- End diff --

Another possibility as I just commented is to get rid of this variable and 
always Fsync snapshot serialization.


> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124802#comment-16124802
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
---

Github user hanm commented on the issue:

https://github.com/apache/zookeeper/pull/333
  
I am now wondering why we should not fsync snapshot taking at all cases. It 
seems to be a useful property to have for snapshot serialization, and will make 
code simpler. Any performance considerations that lead to the conclusion of 
only applying fsync snapshot when it's a SNAP sync?


> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-11 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123755#comment-16123755
 ] 

Hadoop QA commented on ZOOKEEPER-2872:
--

+1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/939//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/939//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/939//console

This message is automatically generated.

> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122684#comment-16122684
 ] 

Hadoop QA commented on ZOOKEEPER-2872:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/938//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/938//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/938//console

This message is automatically generated.

> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122615#comment-16122615
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
---

Github user enixon commented on the issue:

https://github.com/apache/zookeeper/pull/333
  
AtomicFileOutputStream performs an fsync when the stream is closed with the 
following.
"((FileOutputStream) out).getFD().sync();"



> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122093#comment-16122093
 ] 

Hadoop QA commented on ZOOKEEPER-2872:
--

-1 overall.  GitHub Pull Request  Build
  

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/937//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/937//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/937//console

This message is automatically generated.

> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss

2017-08-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122076#comment-16122076
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2872:
---

GitHub user enixon opened a pull request:

https://github.com/apache/zookeeper/pull/333

ZOOKEEPER-2872: Interrupted snapshot sync causes data loss



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/enixon/zookeeper snap-sync

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/333.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #333


commit 39bd1a3eb9171a014845fff97648341cbfb40a11
Author: Brian Nixon 
Date:   2017-08-01T20:25:51Z

ZOOKEEPER-2872: Interrupted snapshot sync causes data loss




> Interrupted snapshot sync causes data loss
> --
>
> Key: ZOOKEEPER-2872
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872
> Project: ZooKeeper
>  Issue Type: Bug
>  Components: server
>Affects Versions: 3.4.10, 3.5.3, 3.6.0
>Reporter: Brian Nixon
>
> There is a way for observers to permanently lose data from their local data 
> tree while remaining members of good standing with the ensemble and 
> continuing to serve client traffic when the following chain of events occurs.
> 1. The observer dies in epoch N from machine failure.
> 2. The observer comes back up in epoch N+1 and requests a snapshot sync to 
> catch up.
> 3. The machine powers off before the snapshot is synced to disc and after 
> some txn's have been logged (depending on the OS, this can happen!).
> 4. The observer comes back a second time and replays its most recent snapshot 
> (epoch <= N) as well as the txn logs (epoch N+1). 
> 5. A diff sync is requested from the leader and the observer broadcasts 
> availability.
> In this scenario, any commits from epoch N that the observer did not receive 
> before it died the first time will never be exposed to the observer and no 
> part of the ensemble will complain. 
> This situation is not unique to observers and can happen to any learner. As a 
> simple fix, fsync-ing the snapshots received from the leader will avoid the 
> case of missing snapshots causing data loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)