[GitHub] zookeeper pull request #647: [ZOOKEEPER-3125] Fixing pzxid consistent issue ...

2018-11-11 Thread lvfangmin
Github user lvfangmin closed the pull request at:

https://github.com/apache/zookeeper/pull/647


---


[GitHub] zookeeper pull request #647: ZOOKEEPER-3125: Fixing pzxid consistent issue w...

2018-09-28 Thread lvfangmin
Github user lvfangmin commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/647#discussion_r221391402
  
--- Diff: src/java/main/org/apache/zookeeper/server/DataTree.java ---
@@ -527,6 +527,24 @@ public void deleteNode(String path, long zxid)
 int lastSlash = path.lastIndexOf('/');
 String parentName = path.substring(0, lastSlash);
 String childName = path.substring(lastSlash + 1);
+
+// The child might already be deleted during taking fuzzy snapshot,
+// but we still need to update the pzxid here before throw 
exception
+// for no such child
+DataNode parent = nodes.get(parentName);
+if (parent == null) {
+throw new KeeperException.NoNodeException();
+}
+synchronized (parent) {
+parent.removeChild(childName);
+// Only update pzxid when the zxid is larger than the current 
pzxid,
+// otherwise we might override higher pzxid set by a following 
create 
+// Txn, which could cause the cversion and pzxid inconsistent
+if (zxid > parent.stat.getPzxid()) {
--- End diff --

This is a new bug I found recently, the previous change in ZOOKEEPER-3125 
has a bug which could revert pzxid. Given it's a known issue in that Jira, I 
think it's reasonable to fix it in the same patch here before porting back to 
3.5.


---


[GitHub] zookeeper pull request #647: ZOOKEEPER-3125: Fixing pzxid consistent issue w...

2018-09-28 Thread lvfangmin
Github user lvfangmin commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/647#discussion_r221390760
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/FuzzySnapshotRelatedTest.java 
---
@@ -162,6 +167,98 @@ public void process(String path) {
 new String(zk[followerA].getData(node2, null, null)));
 }
 
+/**
+ * It's possibel during SNAP sync, the parent is serialized before the
+ * child get deleted during sending the snapshot over.
+ *
+ * In which case, we need to make sure the pzxid get correctly updated
+ * when applying the txns received.
+ */
+@Test
+public void testPZxidUpdatedDuringSnapSyncing() throws Exception {
+LOG.info("Enable force snapshot sync");
+System.setProperty(LearnerHandler.FORCE_SNAP_SYNC, "true");
+
+final String parent = "/testPZxidUpdatedWhenDeletingNonExistNode";
+final String child = parent + "/child";
+createEmptyNode(zk[leaderId], parent);
+createEmptyNode(zk[leaderId], child);
+
+LOG.info("shutdown follower {}", followerA);
+mt[followerA].shutdown();
+QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTING);
+
+LOG.info("Set up ZKDatabase to catch the node serializing in 
DataTree");
+addSerializeListener(leaderId, parent, child);
+
+LOG.info("Restart follower A to trigger a SNAP sync with leader");
+mt[followerA].start();
+QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTED);
+
+LOG.info("Check and make sure the pzxid of the parent is the same 
" +
+"on leader and follower A");
+compareStat(parent, leaderId, followerA);
+}
+
+/**
+ * It's possible during taking fuzzy snapshot, the parent is serialized
+ * before the child get deleted in the fuzzy range.
+ *
+ * In which case, we need to make sure the pzxid get correctly updated
+ * when replaying the txns.
+ */
+@Test
+public void testPZxidUpdatedWhenLoadingSnapshot() throws Exception {
+
+final String parent = "/testPZxidUpdatedDuringTakingSnapshot";
+final String child = parent + "/child";
+createEmptyNode(zk[followerA], parent);
+createEmptyNode(zk[followerA], child);
+
+LOG.info("Set up ZKDatabase to catch the node serializing in 
DataTree");
+addSerializeListener(followerA, parent, child);
+
+LOG.info("Take snapshot on follower A");
+ZooKeeperServer zkServer = 
mt[followerA].main.quorumPeer.getActiveServer();
+zkServer.takeSnapshot();
+
+LOG.info("Restarting follower A to load snapshot");
+mt[followerA].shutdown();
+mt[followerA].start();
+QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTED);
+
+LOG.info("Check and make sure the pzxid of the parent is the same 
" +
+"on leader and follower A");
+compareStat(parent, leaderId, followerA);
+}
+
+private void addSerializeListener(int sid, String parent, String 
child) {
+final ZooKeeper zkClient = zk[followerA];
+CustomDataTree dt =
+(CustomDataTree) 
mt[sid].main.quorumPeer.getZkDb().getDataTree();
+dt.addListener(parent, new NodeSerializeListener() {
+@Override
+public void nodeSerialized(String path) {
+try {
+zkClient.delete(child, -1);
+LOG.info("Deleted the child node after the parent is 
serialized");
+} catch (Exception e) {
+LOG.error("Error when deleting node {}", e);
+}
+}
+});
+}
+
+private void compareStat(String path, int sid, int compareWithSid) 
throws Exception {
+Stat stat1 = new Stat();
+zk[sid].getData(path, null, stat1);
+
+Stat stat2 = new Stat();
+zk[compareWithSid].getData(path, null, stat2);
--- End diff --

I made a comment on that before sending this diff out, I was trying to hear 
the opinion before making the test change here. Seems we agreed on not using 
RetryRule, I'll improve this test class.


---


[GitHub] zookeeper pull request #647: ZOOKEEPER-3125: Fixing pzxid consistent issue w...

2018-09-28 Thread anmolnar
Github user anmolnar commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/647#discussion_r221201278
  
--- Diff: 
src/java/test/org/apache/zookeeper/server/quorum/FuzzySnapshotRelatedTest.java 
---
@@ -162,6 +167,98 @@ public void process(String path) {
 new String(zk[followerA].getData(node2, null, null)));
 }
 
+/**
+ * It's possibel during SNAP sync, the parent is serialized before the
+ * child get deleted during sending the snapshot over.
+ *
+ * In which case, we need to make sure the pzxid get correctly updated
+ * when applying the txns received.
+ */
+@Test
+public void testPZxidUpdatedDuringSnapSyncing() throws Exception {
+LOG.info("Enable force snapshot sync");
+System.setProperty(LearnerHandler.FORCE_SNAP_SYNC, "true");
+
+final String parent = "/testPZxidUpdatedWhenDeletingNonExistNode";
+final String child = parent + "/child";
+createEmptyNode(zk[leaderId], parent);
+createEmptyNode(zk[leaderId], child);
+
+LOG.info("shutdown follower {}", followerA);
+mt[followerA].shutdown();
+QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTING);
+
+LOG.info("Set up ZKDatabase to catch the node serializing in 
DataTree");
+addSerializeListener(leaderId, parent, child);
+
+LOG.info("Restart follower A to trigger a SNAP sync with leader");
+mt[followerA].start();
+QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTED);
+
+LOG.info("Check and make sure the pzxid of the parent is the same 
" +
+"on leader and follower A");
+compareStat(parent, leaderId, followerA);
+}
+
+/**
+ * It's possible during taking fuzzy snapshot, the parent is serialized
+ * before the child get deleted in the fuzzy range.
+ *
+ * In which case, we need to make sure the pzxid get correctly updated
+ * when replaying the txns.
+ */
+@Test
+public void testPZxidUpdatedWhenLoadingSnapshot() throws Exception {
+
+final String parent = "/testPZxidUpdatedDuringTakingSnapshot";
+final String child = parent + "/child";
+createEmptyNode(zk[followerA], parent);
+createEmptyNode(zk[followerA], child);
+
+LOG.info("Set up ZKDatabase to catch the node serializing in 
DataTree");
+addSerializeListener(followerA, parent, child);
+
+LOG.info("Take snapshot on follower A");
+ZooKeeperServer zkServer = 
mt[followerA].main.quorumPeer.getActiveServer();
+zkServer.takeSnapshot();
+
+LOG.info("Restarting follower A to load snapshot");
+mt[followerA].shutdown();
+mt[followerA].start();
+QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTED);
+
+LOG.info("Check and make sure the pzxid of the parent is the same 
" +
+"on leader and follower A");
+compareStat(parent, leaderId, followerA);
+}
+
+private void addSerializeListener(int sid, String parent, String 
child) {
+final ZooKeeper zkClient = zk[followerA];
+CustomDataTree dt =
+(CustomDataTree) 
mt[sid].main.quorumPeer.getZkDb().getDataTree();
+dt.addListener(parent, new NodeSerializeListener() {
+@Override
+public void nodeSerialized(String path) {
+try {
+zkClient.delete(child, -1);
+LOG.info("Deleted the child node after the parent is 
serialized");
+} catch (Exception e) {
+LOG.error("Error when deleting node {}", e);
+}
+}
+});
+}
+
+private void compareStat(String path, int sid, int compareWithSid) 
throws Exception {
+Stat stat1 = new Stat();
+zk[sid].getData(path, null, stat1);
+
+Stat stat2 = new Stat();
+zk[compareWithSid].getData(path, null, stat2);
--- End diff --

@hanm mentioned a flaky test issue related to this part:
https://github.com/apache/zookeeper/pull/605/files#r219628119
I don't see answer in the original PR, how are you dealing with the problem?


---


[GitHub] zookeeper pull request #647: ZOOKEEPER-3125: Fixing pzxid consistent issue w...

2018-09-28 Thread anmolnar
Github user anmolnar commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/647#discussion_r221200308
  
--- Diff: src/java/main/org/apache/zookeeper/server/DataTree.java ---
@@ -527,6 +527,24 @@ public void deleteNode(String path, long zxid)
 int lastSlash = path.lastIndexOf('/');
 String parentName = path.substring(0, lastSlash);
 String childName = path.substring(lastSlash + 1);
+
+// The child might already be deleted during taking fuzzy snapshot,
+// but we still need to update the pzxid here before throw 
exception
+// for no such child
+DataNode parent = nodes.get(parentName);
+if (parent == null) {
+throw new KeeperException.NoNodeException();
+}
+synchronized (parent) {
+parent.removeChild(childName);
+// Only update pzxid when the zxid is larger than the current 
pzxid,
+// otherwise we might override higher pzxid set by a following 
create 
+// Txn, which could cause the cversion and pzxid inconsistent
+if (zxid > parent.stat.getPzxid()) {
--- End diff --

Looks like this is the fix you're talking about and which will be fixed 
separately in 3.6. Is that correct?


---


[GitHub] zookeeper pull request #647: ZOOKEEPER-3125: Fixing pzxid consistent issue w...

2018-09-28 Thread anmolnar
Github user anmolnar commented on a diff in the pull request:

https://github.com/apache/zookeeper/pull/647#discussion_r221200641
  
--- Diff: src/java/test/org/apache/zookeeper/server/DataTreeTest.java ---
@@ -153,6 +153,31 @@ public void testIncrementCversion() throws Exception {
 (newCversion == prevCversion + 1 && newPzxid == prevPzxid 
+ 1));
 }
 
+@Test
+public void testPzxidUpdatedWhenDeletingNonExistNode() throws 
Exception {
--- End diff --

This new test is also part of the fix.


---


[GitHub] zookeeper pull request #647: ZOOKEEPER-3125: Fixing pzxid consistent issue w...

2018-09-28 Thread lvfangmin
GitHub user lvfangmin opened a pull request:

https://github.com/apache/zookeeper/pull/647

ZOOKEEPER-3125: Fixing pzxid consistent issue when replaying a txn for a 
deleted node

Port this to 3.5, this also fixed the issue where the pzxid might be 
overwritten by a smaller one when replaying deleteNode txn, which will be fixed 
in a separate PR for 3.6.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lvfangmin/zookeeper ZOOKEEPER-3125-3.5

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/zookeeper/pull/647.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #647


commit 90ef67cf79c0eaec4b8b3389a8711cce8ba25255
Author: Fangmin Lyu 
Date:   2018-09-12T11:29:13Z

ZOOKEEPER-3125: Fixing pzxid consistent issue when replaying a txn for a 
deleted node

Author: Fangmin Lyu 

Reviewers: br...@apache.org, an...@apache.org

Closes #605 from lvfangmin/ZOOKEEPER-3125




---