[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729136#comment-16729136 ] Brian Nixon commented on ZOOKEEPER-2872: Now that the patch is merged, was there any further work here? > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon >Priority: Major > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131747#comment-16131747 ] Hudson commented on ZOOKEEPER-2872: --- SUCCESS: Integrated in Jenkins build ZooKeeper-trunk #3503 (See [https://builds.apache.org/job/ZooKeeper-trunk/3503/]) ZOOKEEPER-2872: Interrupted snapshot sync causes data loss (hanm: rev 0706b40afad079f19fe9f76c99bbb7ec69780dbd) * (edit) src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java * (edit) src/java/test/org/apache/zookeeper/test/TruncateTest.java * (edit) src/java/main/org/apache/zookeeper/server/quorum/Learner.java * (edit) src/java/main/org/apache/zookeeper/server/persistence/SnapShot.java * (edit) src/java/main/org/apache/zookeeper/server/persistence/FileSnap.java * (edit) src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java * (edit) src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131705#comment-16131705 ] ASF GitHub Bot commented on ZOOKEEPER-2872: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/333 Committed to master: 0706b40afad079f19fe9f76c99bbb7ec69780dbd Pending JIRA resolve after fixing merge conflicts and commit into branch-3.4 and 3.5. > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131702#comment-16131702 ] ASF GitHub Bot commented on ZOOKEEPER-2872: --- Github user asfgit closed the pull request at: https://github.com/apache/zookeeper/pull/333 > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131683#comment-16131683 ] ASF GitHub Bot commented on ZOOKEEPER-2872: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/333 >> it seems best to keep snapshot taking a lighter weight operation. Sounds reasonable. >> I am unable to reproduce the test failure in Zab1_0Test I think it's a flaky test. Filed ZOOKEEPER-2877 for this. > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130996#comment-16130996 ] ASF GitHub Bot commented on ZOOKEEPER-2872: --- Github user enixon commented on the issue: https://github.com/apache/zookeeper/pull/333 I am unable to reproduce the test failure in Zab1_0Test > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129973#comment-16129973 ] Hadoop QA commented on ZOOKEEPER-2872: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/942//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/942//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/942//console This message is automatically generated. > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16129963#comment-16129963 ] ASF GitHub Bot commented on ZOOKEEPER-2872: --- Github user enixon commented on the issue: https://github.com/apache/zookeeper/pull/333 We contemplated doing an fsync for every snapshot and decided against. You're taking a guaranteed io spike each time. That's fine when you're just syncing with the quorum but during normal operation, it seems best to keep snapshot taking a lighter weight operation. > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124803#comment-16124803 ] ASF GitHub Bot commented on ZOOKEEPER-2872: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/333#discussion_r132832594 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/Learner.java --- @@ -364,6 +364,7 @@ protected void syncWithLeader(long newLeaderZxid) throws Exception{ readPacket(qp); LinkedList packetsCommitted = new LinkedList(); LinkedList packetsNotCommitted = new LinkedList(); +boolean syncSnapshot = false; --- End diff -- We can level this variable definition up so it's clustered with `snapshotNeed` boolean. Another possibility is to get ride of this variable and use existing `snapshotNeeded` - but that will do fysnc snapshot for TRUNC sync, which the existing patch will not. > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124805#comment-16124805 ] ASF GitHub Bot commented on ZOOKEEPER-2872: --- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/333#discussion_r132832602 --- Diff: src/java/main/org/apache/zookeeper/server/quorum/Learner.java --- @@ -364,6 +364,7 @@ protected void syncWithLeader(long newLeaderZxid) throws Exception{ readPacket(qp); LinkedList packetsCommitted = new LinkedList(); LinkedList packetsNotCommitted = new LinkedList(); +boolean syncSnapshot = false; --- End diff -- Another possibility as I just commented is to get rid of this variable and always Fsync snapshot serialization. > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124802#comment-16124802 ] ASF GitHub Bot commented on ZOOKEEPER-2872: --- Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/333 I am now wondering why we should not fsync snapshot taking at all cases. It seems to be a useful property to have for snapshot serialization, and will make code simpler. Any performance considerations that lead to the conclusion of only applying fsync snapshot when it's a SNAP sync? > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123755#comment-16123755 ] Hadoop QA commented on ZOOKEEPER-2872: -- +1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/939//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/939//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/939//console This message is automatically generated. > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122684#comment-16122684 ] Hadoop QA commented on ZOOKEEPER-2872: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/938//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/938//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/938//console This message is automatically generated. > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122615#comment-16122615 ] ASF GitHub Bot commented on ZOOKEEPER-2872: --- Github user enixon commented on the issue: https://github.com/apache/zookeeper/pull/333 AtomicFileOutputStream performs an fsync when the stream is closed with the following. "((FileOutputStream) out).getFD().sync();" > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122093#comment-16122093 ] Hadoop QA commented on ZOOKEEPER-2872: -- -1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/937//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/937//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/937//console This message is automatically generated. > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2872) Interrupted snapshot sync causes data loss
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16122076#comment-16122076 ] ASF GitHub Bot commented on ZOOKEEPER-2872: --- GitHub user enixon opened a pull request: https://github.com/apache/zookeeper/pull/333 ZOOKEEPER-2872: Interrupted snapshot sync causes data loss You can merge this pull request into a Git repository by running: $ git pull https://github.com/enixon/zookeeper snap-sync Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/333.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #333 commit 39bd1a3eb9171a014845fff97648341cbfb40a11 Author: Brian NixonDate: 2017-08-01T20:25:51Z ZOOKEEPER-2872: Interrupted snapshot sync causes data loss > Interrupted snapshot sync causes data loss > -- > > Key: ZOOKEEPER-2872 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2872 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.10, 3.5.3, 3.6.0 >Reporter: Brian Nixon > > There is a way for observers to permanently lose data from their local data > tree while remaining members of good standing with the ensemble and > continuing to serve client traffic when the following chain of events occurs. > 1. The observer dies in epoch N from machine failure. > 2. The observer comes back up in epoch N+1 and requests a snapshot sync to > catch up. > 3. The machine powers off before the snapshot is synced to disc and after > some txn's have been logged (depending on the OS, this can happen!). > 4. The observer comes back a second time and replays its most recent snapshot > (epoch <= N) as well as the txn logs (epoch N+1). > 5. A diff sync is requested from the leader and the observer broadcasts > availability. > In this scenario, any commits from epoch N that the observer did not receive > before it died the first time will never be exposed to the observer and no > part of the ensemble will complain. > This situation is not unique to observers and can happen to any learner. As a > simple fix, fsync-ing the snapshots received from the leader will avoid the > case of missing snapshots causing data loss. -- This message was sent by Atlassian JIRA (v6.4.14#64029)