[jira] [Created] (ZOOKEEPER-3607) Potential data inconsistency due to the inconsistency between ZKDatabase.committedLog and dataTree in Trunc sync.
Jiafu Jiang created ZOOKEEPER-3607: -- Summary: Potential data inconsistency due to the inconsistency between ZKDatabase.committedLog and dataTree in Trunc sync. Key: ZOOKEEPER-3607 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3607 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.4.14 Reporter: Jiafu Jiang I will describe the problem by a detail example. 1. Suppose we have three zk servers: zk1, zk2, and zk3. zk1 and zk2 are online, zk3 is offline, zk1 is the leader. 2. In TRUNC sync, zk1 sends a TRUNC request to zk2, then sends the remaining proposals in the committedLog. *When the follower zk2 receives the proposals, it applies them directly into the datatree, but not the committedLog.* 3. After the data sync phase, zk1 may continue to send zk2 more committed proposals, and they will be applied to both the datatree and the committedLog of zk2. 4. Then zk1 fails, zk3 restarts successfully, zk2 becomes the leader. 5. The leader zk2 sends a TRUNC request to zk3, then the remaining proposals from the committedLog. But since some proposals, which are from the leader zk1 in TRUNC sync(as I describe above), are not in the committedLog, they will not be sent to zk3. 6. Now data inconsistency happens between zk2 and zk3, since some data may exist in zk2's datatree, but not zk3's datatree. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ZOOKEEPER-3393) Read-only file system may make the whole ZooKeeper cluster to be unavailable.
Jiafu Jiang created ZOOKEEPER-3393: -- Summary: Read-only file system may make the whole ZooKeeper cluster to be unavailable. Key: ZOOKEEPER-3393 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3393 Project: ZooKeeper Issue Type: Bug Components: leaderElection, server Affects Versions: 3.4.14, 3.4.12 Reporter: Jiafu Jiang Say we have 3 nodes: zk1, zk2, and zk3, zk3 is the leader. If the file system of the ZooKeeper data directory of the leader is read-only due to some hardware error, the leader will exit and begin a new election. But the election will keep looping because the new leader may be zk3 again, but zk3 will fail to write epoch to disk due to read-only file system. Since we have 3 nodes, if only one of them is in problem, should the ZooKeeper cluster be available? If the answer is yes, then we ought to fix this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3266) ZooKeeper Java client blocks for a very long time.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3266: --- Description: I found that ZooKeeper java client blocked, and the related call stack was shown below: "Election thread-20" #20 prio=5 os_prio=0 tid=0x7f7deeadfd80 nid=0x5ec3 in Object.wait() [0x7f7ddd5d8000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1411) - locked <0xe04b63b0> (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1177) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1210) at com.sugon.parastor.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:643) And I also found that the block process did not have the SendThread thread. It seems like a normal process with ZooKeeper java client should have a SendThread, like below: "Thread-0-SendThread(ofs_zk1:2181)" #23 daemon prio=5 os_prio=0 tid=0x7f8c540379c0 nid=0x739 runnable [0x7f8c5ad71000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xe00287a8> (a sun.nio.ch.Util$3) - locked <0xe0028798> (a java.util.Collections$UnmodifiableSet) - locked <0xe0028750> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145) So, will the missing of the SendThread thread cause the blocking of exist method?? I'm not sure. was: I found that ZooKeeper java client blocked, and the related call stack was showing below: "Election thread-20" #20 prio=5 os_prio=0 tid=0x7f7deeadfd80 nid=0x5ec3 in Object.wait() [0x7f7ddd5d8000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1411) - locked <0xe04b63b0> (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1177) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1210) at com.sugon.parastor.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:643) And I also found that the block process did not have the SendThread. It seems like a normal process that have ZooKeeper java client should have a SendThread, like below: "Thread-0-SendThread(ofs_zk1:2181)" #23 daemon prio=5 os_prio=0 tid=0x7f8c540379c0 nid=0x739 runnable [0x7f8c5ad71000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xe00287a8> (a sun.nio.ch.Util$3) - locked <0xe0028798> (a java.util.Collections$UnmodifiableSet) - locked <0xe0028750> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145) So, will the missing of the SendThread cause the blocking of exist method?? I'm not sure. > ZooKeeper Java client blocks for a very long time. > -- > > Key: ZOOKEEPER-3266 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3266 > Project: ZooKeeper > Issue Type: Bug > Components: java client >Affects Versions: 3.4.13 >Reporter: Jiafu Jiang >Priority: Major > > I found that ZooKeeper java client blocked, and the related call stack was > shown below: > "Election thread-20" #20 prio=5 os_prio=0 tid=0x7f7deeadfd80 nid=0x5ec3 > in Object.wait() [0x7f7ddd5d8000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1411) > - locked <0xe04b63b0> (a org.apache.zookeeper.ClientCnxn$Packet) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1177) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1210) > at > com.sugon.parastor.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:643) >
[jira] [Updated] (ZOOKEEPER-3266) ZooKeeper Java client blocks for a very long time.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3266: --- Description: I found that ZooKeeper java client blocked, and the related call stack was showing below: "Election thread-20" #20 prio=5 os_prio=0 tid=0x7f7deeadfd80 nid=0x5ec3 in Object.wait() [0x7f7ddd5d8000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1411) - locked <0xe04b63b0> (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1177) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1210) at com.sugon.parastor.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:643) And I also found that the block process did not have the SendThread. It seems like a normal process that have ZooKeeper java client should have a SendThread, like below: "Thread-0-SendThread(ofs_zk1:2181)" #23 daemon prio=5 os_prio=0 tid=0x7f8c540379c0 nid=0x739 runnable [0x7f8c5ad71000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xe00287a8> (a sun.nio.ch.Util$3) - locked <0xe0028798> (a java.util.Collections$UnmodifiableSet) - locked <0xe0028750> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145) So, will the missing of the SendThread cause the blocking of exist method?? I'm not sure. was: I found that ZooKeeper java client blocked, and the related call stack was showing below: "Election thread-20" #20 prio=5 os_prio=0 tid=0x7f7deeadfd80 nid=0x5ec3 in Object.wait() [0x7f7ddd5d8000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1411) - locked <0xe04b63b0> (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1177) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1210) at com.sugon.parastor.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:643) And I also found that the block process did not have the SendThread. It seems like a normal process that have ZooKeeper java client should have a SendThread, like below: "Thread-0-SendThread(ofs_zk1:2181)" #23 daemon prio=5 os_prio=0 tid=0x7f8c540379c0 nid=0x739 runnable [0x7f8c5ad71000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xe00287a8> (a sun.nio.ch.Util$3) - locked <0xe0028798> (a java.util.Collections$UnmodifiableSet) - locked <0xe0028750> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145) So, will the missing of SendThread cause the blocking of exist method?? I'm not sure. > ZooKeeper Java client blocks for a very long time. > -- > > Key: ZOOKEEPER-3266 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3266 > Project: ZooKeeper > Issue Type: Bug > Components: java client >Affects Versions: 3.4.13 >Reporter: Jiafu Jiang >Priority: Major > > I found that ZooKeeper java client blocked, and the related call stack was > showing below: > "Election thread-20" #20 prio=5 os_prio=0 tid=0x7f7deeadfd80 nid=0x5ec3 > in Object.wait() [0x7f7ddd5d8000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1411) > - locked <0xe04b63b0> (a org.apache.zookeeper.ClientCnxn$Packet) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1177) > at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1210) > at > com.sugon.parastor.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:643) >
[jira] [Created] (ZOOKEEPER-3266) ZooKeeper Java client blocks for a very long time.
Jiafu Jiang created ZOOKEEPER-3266: -- Summary: ZooKeeper Java client blocks for a very long time. Key: ZOOKEEPER-3266 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3266 Project: ZooKeeper Issue Type: Bug Components: java client Affects Versions: 3.4.13 Reporter: Jiafu Jiang I found that ZooKeeper java client blocked, and the related call stack was showing below: "Election thread-20" #20 prio=5 os_prio=0 tid=0x7f7deeadfd80 nid=0x5ec3 in Object.wait() [0x7f7ddd5d8000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1411) - locked <0xe04b63b0> (a org.apache.zookeeper.ClientCnxn$Packet) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1177) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1210) at com.sugon.parastor.zookeeper.ZooKeeperClient.exists(ZooKeeperClient.java:643) And I also found that the block process did not have the SendThread. It seems like a normal process that have ZooKeeper java client should have a SendThread, like below: "Thread-0-SendThread(ofs_zk1:2181)" #23 daemon prio=5 os_prio=0 tid=0x7f8c540379c0 nid=0x739 runnable [0x7f8c5ad71000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0xe00287a8> (a sun.nio.ch.Util$3) - locked <0xe0028798> (a java.util.Collections$UnmodifiableSet) - locked <0xe0028750> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:349) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1145) So, will the missing of SendThread cause the blocking of exist method?? I'm not sure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3231) Purge task may lost data when we have many invalid snapshots.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3231: --- Description: I read the ZooKeeper source code, and I find the purge task use FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does not check whether the snapshots are valid. Consider a worse case, a ZooKeeper server may have many invalid snapshots, and when a purge task begins, it will use the zxid in the last snapshot's name to purge old snapshots and transaction logs, then we may lost data. I think we should use FileSnap#findNValidSnapshots(int) instead of FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots, but I am not sure. was: I read the ZooKeeper source code, and I find the purge task use FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does not check whether the snapshots are valid. Consider a worse case, a ZooKeeper server may have many invalid snapshots, and when a purge task begins, it will use the zxid in the last snapshot's name to purge old snapshots and transaction logs, then we may lost data. I think we should use FileSnap#findNValidSnapshots(int) instead of FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots. I am not sure. > Purge task may lost data when we have many invalid snapshots. > -- > > Key: ZOOKEEPER-3231 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3231 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.4, 3.4.13 >Reporter: Jiafu Jiang >Priority: Major > > I read the ZooKeeper source code, and I find the purge task use > FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does > not check whether the snapshots are valid. > Consider a worse case, a ZooKeeper server may have many invalid snapshots, > and when a purge task begins, it will use the zxid in the last snapshot's > name to purge old snapshots and transaction logs, then we may lost data. > I think we should use FileSnap#findNValidSnapshots(int) instead of > FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots, but I > am not sure. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3231) Purge task may lost data when we have many invalid snapshots.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3231: --- Summary: Purge task may lost data when we have many invalid snapshots. (was: Purge task may lost data when we have many invalid snapshot files.) > Purge task may lost data when we have many invalid snapshots. > -- > > Key: ZOOKEEPER-3231 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3231 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.4, 3.4.13 >Reporter: Jiafu Jiang >Priority: Major > > I read the ZooKeeper source code, and I find the purge task use > FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does > not check whether the snapshots are valid. > Consider a worse case, a ZooKeeper server may have many invalid snapshots, > and when a purge task begins, it will use the zxid in the last snapshot's > name to purge old snapshots and transaction logs, then we may lost data. > I think we should use FileSnap#findNValidSnapshots(int) instead of > FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots. I am > not sure. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3231) Purge task may lost data when we have many invalid snapshot files.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3231: --- Description: I read the ZooKeeper source code, and I find the purge task use FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does not check whether the snapshots are valid. Consider a worse case, a ZooKeeper server may have many invalid snapshots, and when a purge task begins, it will use the zxid in the last snapshot file name to purge old snapshots or transaction logs, then we may lost data. I think we should use FileSnap#findNValidSnapshots(int) instead of FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots. I am not sure. was: I read the ZooKeeper source code, and I find the purge task use FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does not check whether the snapshots are valid. Consider a worse case, a ZooKeeper server may have many invalid snapshots, and when a purge task begins, is will use the zxid in the last snapshot file name to purge old snapshots or transaction logs, then we may lost data. I think we should use FileSnap#findNValidSnapshots(int) instead of FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots. I am not sure. > Purge task may lost data when we have many invalid snapshot files. > --- > > Key: ZOOKEEPER-3231 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3231 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.4, 3.4.13 >Reporter: Jiafu Jiang >Priority: Major > > I read the ZooKeeper source code, and I find the purge task use > FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does > not check whether the snapshots are valid. > Consider a worse case, a ZooKeeper server may have many invalid snapshots, > and when a purge task begins, it will use the zxid in the last snapshot file > name to purge old snapshots or transaction logs, then we may lost data. > I think we should use FileSnap#findNValidSnapshots(int) instead of > FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots. I am > not sure. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3231) Purge task may lost data when we have many invalid snapshot files.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3231: --- Description: I read the ZooKeeper source code, and I find the purge task use FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does not check whether the snapshots are valid. Consider a worse case, a ZooKeeper server may have many invalid snapshots, and when a purge task begins, it will use the zxid in the last snapshot's name to purge old snapshots and transaction logs, then we may lost data. I think we should use FileSnap#findNValidSnapshots(int) instead of FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots. I am not sure. was: I read the ZooKeeper source code, and I find the purge task use FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does not check whether the snapshots are valid. Consider a worse case, a ZooKeeper server may have many invalid snapshots, and when a purge task begins, it will use the zxid in the last snapshot file name to purge old snapshots or transaction logs, then we may lost data. I think we should use FileSnap#findNValidSnapshots(int) instead of FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots. I am not sure. > Purge task may lost data when we have many invalid snapshot files. > --- > > Key: ZOOKEEPER-3231 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3231 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.5.4, 3.4.13 >Reporter: Jiafu Jiang >Priority: Major > > I read the ZooKeeper source code, and I find the purge task use > FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does > not check whether the snapshots are valid. > Consider a worse case, a ZooKeeper server may have many invalid snapshots, > and when a purge task begins, it will use the zxid in the last snapshot's > name to purge old snapshots and transaction logs, then we may lost data. > I think we should use FileSnap#findNValidSnapshots(int) instead of > FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots. I am > not sure. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3231) Purge task may lost data when we have many invalid snapshot files.
Jiafu Jiang created ZOOKEEPER-3231: -- Summary: Purge task may lost data when we have many invalid snapshot files. Key: ZOOKEEPER-3231 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3231 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.13, 3.5.4 Reporter: Jiafu Jiang I read the ZooKeeper source code, and I find the purge task use FileTxnSnapLog#findNRecentSnapshots to find snapshots, but the method does not check whether the snapshots are valid. Consider a worse case, a ZooKeeper server may have many invalid snapshots, and when a purge task begins, is will use the zxid in the last snapshot file name to purge old snapshots or transaction logs, then we may lost data. I think we should use FileSnap#findNValidSnapshots(int) instead of FileSnap#findNRecentSnapshots in FileTxnSnapLog#findNRecentSnapshots. I am not sure. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3220) The snapshot is not saved to disk and may cause data inconsistency.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729312#comment-16729312 ] Jiafu Jiang commented on ZOOKEEPER-3220: [~nixon] Thanks very much! > The snapshot is not saved to disk and may cause data inconsistency. > --- > > Key: ZOOKEEPER-3220 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3220 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.12, 3.4.13 >Reporter: Jiafu Jiang >Priority: Critical > > We known that ZooKeeper server will call fsync to make sure that log data has > been successfully saved to disk. But ZooKeeper server does not call fsync to > make sure that a snapshot has been successfully saved, which may cause > potential problems. Since a close to a file description does not make sure > that data is written to disk, see > [http://man7.org/linux/man-pages/man2/close.2.html#notes] for more details. > > If the snapshot is not successfully saved to disk, it may lead to data > inconsistency. Here is my example, which is also a real problem I have ever > met. > 1. I deployed a 3-node ZooKeeper cluster: zk1, zk2, and zk3, zk2 was the > leader. > 2. Both zk1 and zk2 had the log records from log1~logX, X was the zxid. > 3. The machine of zk1 restarted, and during the reboot, log(X+1) ~ log Y are > saved to log files of both zk2(leader) and zk3(follower). > 4. After zk1 restarted successfully, it found itself to be a follower, and it > began to synchronize data with the leader. The leader sent a snapshot(records > from log 1 ~ log Y) to zk1, zk1 then saved the snapshot to local disk by > calling the method ZooKeeperServer.takeSnapshot. But unfortunately, when the > method returned, the snapshot data was not saved to disk yet. In fact the > snapshot file was created, but the size was 0. > 5. zk1 finished the synchronization and began to accept new requests from the > leader. Say log records from log(Y + 1) ~ log Z were accepted by zk1 and > saved to log file. With fsync zk1 could make sure log data was not lost. > 6. zk1 restarted again. Since the snapshot's size was 0, it would not be > used, therefore zk1 recovered using the log files. But the records from > log(X+1) ~ logY were lost ! > > Sorry for my poor English. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3220) The snapshot is not saved to disk and may cause data inconsistency.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729308#comment-16729308 ] Jiafu Jiang commented on ZOOKEEPER-3220: [~maoling] why this situation happend? The disk is full? No, but the machine restarted. Do you see some logs about *FileTxnSnapLog#save* at that time? No any error log, if fact, during the machine reboot, some log of the follower was missing. But from the log of the leader, the follower had received a snapshot and began to received other transaction logs, so the *FileTxnSnapLog#save of follower must have succeed, but the data is not in disk!* *2.Even if this situation that the size of snapshot is 0 could not cause data inconsistency.* Yes, I know. Zookeeper recover it's data from both logs and snapshot. If a ZooKeeper follower believes a snapshot is saved, it believes that the data in the snapshot is all in the disk(but in fact it may be not), it will begin to receive logs that come after the snapshot. If the snapshot is invalid, ZooKeeper server will recover data from logs only, but some data is missing, because the data is only saved in the snapshot. > The snapshot is not saved to disk and may cause data inconsistency. > --- > > Key: ZOOKEEPER-3220 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3220 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.12, 3.4.13 >Reporter: Jiafu Jiang >Priority: Critical > > We known that ZooKeeper server will call fsync to make sure that log data has > been successfully saved to disk. But ZooKeeper server does not call fsync to > make sure that a snapshot has been successfully saved, which may cause > potential problems. Since a close to a file description does not make sure > that data is written to disk, see > [http://man7.org/linux/man-pages/man2/close.2.html#notes] for more details. > > If the snapshot is not successfully saved to disk, it may lead to data > inconsistency. Here is my example, which is also a real problem I have ever > met. > 1. I deployed a 3-node ZooKeeper cluster: zk1, zk2, and zk3, zk2 was the > leader. > 2. Both zk1 and zk2 had the log records from log1~logX, X was the zxid. > 3. The machine of zk1 restarted, and during the reboot, log(X+1) ~ log Y are > saved to log files of both zk2(leader) and zk3(follower). > 4. After zk1 restarted successfully, it found itself to be a follower, and it > began to synchronize data with the leader. The leader sent a snapshot(records > from log 1 ~ log Y) to zk1, zk1 then saved the snapshot to local disk by > calling the method ZooKeeperServer.takeSnapshot. But unfortunately, when the > method returned, the snapshot data was not saved to disk yet. In fact the > snapshot file was created, but the size was 0. > 5. zk1 finished the synchronization and began to accept new requests from the > leader. Say log records from log(Y + 1) ~ log Z were accepted by zk1 and > saved to log file. With fsync zk1 could make sure log data was not lost. > 6. zk1 restarted again. Since the snapshot's size was 0, it would not be > used, therefore zk1 recovered using the log files. But the records from > log(X+1) ~ logY were lost ! > > Sorry for my poor English. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3220) The snapshot is not saved to disk and may cause data inconsistency.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728572#comment-16728572 ] Jiafu Jiang commented on ZOOKEEPER-3220: In my environment, the save method returned successfully, that means no exception had been thrown. But, the data was not in disk! That's the problem I want to report! And yes, the snapshot with size 0 was invalid, and was skip when ZooKeeper server restarted again. > The snapshot is not saved to disk and may cause data inconsistency. > --- > > Key: ZOOKEEPER-3220 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3220 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.12, 3.4.13 >Reporter: Jiafu Jiang >Priority: Critical > > We known that ZooKeeper server will call fsync to make sure that log data has > been successfully saved to disk. But ZooKeeper server does not call fsync to > make sure that a snapshot has been successfully saved, which may cause > potential problems. Since a close to a file description does not make sure > that data is written to disk, see > [http://man7.org/linux/man-pages/man2/close.2.html#notes] for more details. > > If the snapshot is not successfully saved to disk, it may lead to data > inconsistency. Here is my example, which is also a real problem I have ever > met. > 1. I deployed a 3-node ZooKeeper cluster: zk1, zk2, and zk3, zk2 was the > leader. > 2. Both zk1 and zk2 had the log records from log1~logX, X was the zxid. > 3. The machine of zk1 restarted, and during the reboot, log(X+1) ~ log Y are > saved to log files of both zk2(leader) and zk3(follower). > 4. After zk1 restarted successfully, it found itself to be a follower, and it > began to synchronize data with the leader. The leader sent a snapshot(records > from log 1 ~ log Y) to zk1, zk1 then saved the snapshot to local disk by > calling the method ZooKeeperServer.takeSnapshot. But unfortunately, when the > method returned, the snapshot data was not saved to disk yet. In fact the > snapshot file was created, but the size was 0. > 5. zk1 finished the synchronization and began to accept new requests from the > leader. Say log records from log(Y + 1) ~ log Z were accepted by zk1 and > saved to log file. With fsync zk1 could make sure log data was not lost. > 6. zk1 restarted again. Since the snapshot's size was 0, it would not be > used, therefore zk1 recovered using the log files. But the records from > log(X+1) ~ logY were lost ! > > Sorry for my poor English. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3220) The snapshot is not saved to disk and may cause data inconsistency.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3220: --- Description: We known that ZooKeeper server will call fsync to make sure that log data has been successfully saved to disk. But ZooKeeper server does not call fsync to make sure that a snapshot has been successfully saved, which may cause potential problems. Since a close to a file description does not make sure that data is written to disk, see [http://man7.org/linux/man-pages/man2/close.2.html#notes] for more details. If the snapshot is not successfully saved to disk, it may lead to data inconsistency. Here is my example, which is also a real problem I have ever met. 1. I deployed a 3-node ZooKeeper cluster: zk1, zk2, and zk3, zk2 was the leader. 2. Both zk1 and zk2 had the log records from log1~logX, X was the zxid. 3. The machine of zk1 restarted, and during the reboot, log(X+1) ~ log Y are saved to log files of both zk2(leader) and zk3(follower). 4. After zk1 restarted successfully, it found itself to be a follower, and it began to synchronize data with the leader. The leader sent a snapshot(records from log 1 ~ log Y) to zk1, zk1 then saved the snapshot to local disk by calling the method ZooKeeperServer.takeSnapshot. But unfortunately, when the method returned, the snapshot data was not saved to disk yet. In fact the snapshot file was created, but the size was 0. 5. zk1 finished the synchronization and began to accept new requests from the leader. Say log records from log(Y + 1) ~ log Z were accepted by zk1 and saved to log file. With fsync zk1 could make sure log data was not lost. 6. zk1 restarted again. Since the snapshot's size was 0, it would not be used, therefore zk1 recovered using the log files. But the records from log(X+1) ~ logY were lost ! Sorry for my poor English. was: We known that ZooKeeper server will call fsync to make sure that log data has been successfully saved to disk. But ZooKeeper server does not call fsync to make sure that a snapshot has been successfully saved, which may cause potential problems. Since a close to a file description does not make sure that data is written to disk, see [http://man7.org/linux/man-pages/man2/close.2.html#notes] for more details. If the snapshot is not successfully saved to disk, it may lead to data inconsistency. Here is my example, which is also a real problem I have ever met. 1. I deployed a 3-node ZooKeeper cluster: zk1, zk2, and zk3, zk2 was the leader. 2. Both zk1 and zk2 had the log records from log1~logX, X is the zxid. 3. The machine of zk1 restarted, and during the reboot, log(X+1) ~ log Y are saved to log files of both zk2(leader) and zk3(follower). 4. After zk1 restarted successfully, it found itself to be a follower, and it began to synchronize log with the leader. The leader sent a snapshot(records from log 1 ~ log Y) to zk1, zk1 saved the snapshot to local disk by calling the method ZooKeeperServer.takeSnapshot. But unfortunately, when the method returned, the snapshot data was not saved to disk yet. If fact the snapshot file was created, but the size was 0. 5. zk1 finished the synchronization and began to accept new request from the leader. Say log(Y + 1) ~ log Z was accepted by zk1 and saved to log file. With fsync zk1 can make sure log data is not lost. 6. zk1 restarted again. Since the snapshot's size was 0, it would not be used, therefore zk1 recovered using the log files. But the records from log(X+1) ~ logY were lost ! Sorry for my poor English. > The snapshot is not saved to disk and may cause data inconsistency. > --- > > Key: ZOOKEEPER-3220 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3220 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.12, 3.4.13 >Reporter: Jiafu Jiang >Priority: Critical > > We known that ZooKeeper server will call fsync to make sure that log data has > been successfully saved to disk. But ZooKeeper server does not call fsync to > make sure that a snapshot has been successfully saved, which may cause > potential problems. Since a close to a file description does not make sure > that data is written to disk, see > [http://man7.org/linux/man-pages/man2/close.2.html#notes] for more details. > > If the snapshot is not successfully saved to disk, it may lead to data > inconsistency. Here is my example, which is also a real problem I have ever > met. > 1. I deployed a 3-node ZooKeeper cluster: zk1, zk2, and zk3, zk2 was the > leader. > 2. Both zk1 and zk2 had the log records from log1~logX, X was the zxid. > 3. The machine of zk1 restarted, and during the reboot, log(X+1) ~ log Y are > saved to log files of both
[jira] [Created] (ZOOKEEPER-3220) Snapshot is not written to disk and cause data inconsistency.
Jiafu Jiang created ZOOKEEPER-3220: -- Summary: Snapshot is not written to disk and cause data inconsistency. Key: ZOOKEEPER-3220 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3220 Project: ZooKeeper Issue Type: Bug Components: server Affects Versions: 3.4.13, 3.4.12 Reporter: Jiafu Jiang We known that ZooKeeper server will call fsync to make sure that log data has been successfully saved to disk. But ZooKeeper server does not call fsync to make sure that a snapshot has been successfully saved, which may cause potential problems. Since a close to a file description does not make sure that data is written to disk, see [http://man7.org/linux/man-pages/man2/close.2.html#notes] for more details. If the snapshot is not successfully saved to disk, it may lead to data inconsistency. Here is my example, which is also a real problem I have ever met. 1. I deployed a 3-node ZooKeeper cluster: zk1, zk2, and zk3, zk2 was the leader. 2. Both zk1 and zk2 had the log records from log1~logX, X is the zxid. 3. The machine of zk1 restarted, and during the reboot, log(X+1) ~ log Y are saved to log files of both zk2(leader) and zk3(follower). 4. After zk1 restarted successfully, it found itself to be a follower, and it began to synchronize log with the leader. The leader sent a snapshot(records from log 1 ~ log Y) to zk1, zk1 saved the snapshot to local disk by calling the method ZooKeeperServer.takeSnapshot. But unfortunately, when the method returned, the snapshot data was not saved to disk yet. If fact the snapshot file was created, but the size was 0. 5. zk1 finished the synchronization and began to accept new request from the leader. Say log(Y + 1) ~ log Z was accepted by zk1 and saved to log file. With fsync zk1 can make sure log data is not lost. 6. zk1 restarted again. Since the snapshot's size was 0, it would not be used, therefore zk1 recovered using the log files. But the records from log(X+1) ~ logY were lost ! Sorry for my poor English. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3220) The snapshot is not saved to disk and may cause data inconsistency.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3220: --- Summary: The snapshot is not saved to disk and may cause data inconsistency. (was: Snapshot is not written to disk and cause data inconsistency.) > The snapshot is not saved to disk and may cause data inconsistency. > --- > > Key: ZOOKEEPER-3220 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3220 > Project: ZooKeeper > Issue Type: Bug > Components: server >Affects Versions: 3.4.12, 3.4.13 >Reporter: Jiafu Jiang >Priority: Critical > > We known that ZooKeeper server will call fsync to make sure that log data has > been successfully saved to disk. But ZooKeeper server does not call fsync to > make sure that a snapshot has been successfully saved, which may cause > potential problems. Since a close to a file description does not make sure > that data is written to disk, see > [http://man7.org/linux/man-pages/man2/close.2.html#notes] for more details. > > If the snapshot is not successfully saved to disk, it may lead to data > inconsistency. Here is my example, which is also a real problem I have ever > met. > 1. I deployed a 3-node ZooKeeper cluster: zk1, zk2, and zk3, zk2 was the > leader. > 2. Both zk1 and zk2 had the log records from log1~logX, X is the zxid. > 3. The machine of zk1 restarted, and during the reboot, log(X+1) ~ log Y are > saved to log files of both zk2(leader) and zk3(follower). > 4. After zk1 restarted successfully, it found itself to be a follower, and it > began to synchronize log with the leader. The leader sent a snapshot(records > from log 1 ~ log Y) to zk1, zk1 saved the snapshot to local disk by calling > the method ZooKeeperServer.takeSnapshot. But unfortunately, when the method > returned, the snapshot data was not saved to disk yet. If fact the snapshot > file was created, but the size was 0. > 5. zk1 finished the synchronization and began to accept new request from the > leader. Say log(Y + 1) ~ log Z was accepted by zk1 and saved to log file. > With fsync zk1 can make sure log data is not lost. > 6. zk1 restarted again. Since the snapshot's size was 0, it would not be > used, therefore zk1 recovered using the log files. But the records from > log(X+1) ~ logY were lost ! > > Sorry for my poor English. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3099) ZooKeeper cluster is unavailable for session_timeout time due to network partition in a three-node environment.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3099: --- Summary: ZooKeeper cluster is unavailable for session_timeout time due to network partition in a three-node environment. (was: ZooKeeper cluster is unavailable for session_timeout time when the leader shutdown in a three-node environment. ) > ZooKeeper cluster is unavailable for session_timeout time due to network > partition in a three-node environment. > -- > > Key: ZOOKEEPER-3099 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3099 > Project: ZooKeeper > Issue Type: Bug > Components: c client, java client >Affects Versions: 3.4.11, 3.5.4, 3.4.12, 3.4.13 >Reporter: Jiafu Jiang >Priority: Major > > > The default readTimeout timeout of ZooKeeper client is 2/3 * session_time, > the default connectTimeout is session_time/hostProvider.size(). If the > ZooKeeper cluster has 3 nodes, then connectTimeout is 1/3 * session_time. > > Supports we have three ZooKeeper servers: zk1, zk2, zk3 deployed. And zk3 is > now the leader. Client c1 is now connected to zk2(follower). Then we shutdown > the network of zk3(leader), the same time, client c1 begin to write some data > to ZooKeeper. After a (syncLimit * tick) timeout, zk2 will disconnect with > leader and begin a new election, and zk2 becomes the leader. > > The write operation will not succeed due to the leader is shutdown. It will > take at most readTimeout time for c1 to discover the failure, and client c1 > will try to choose another ZooKeeper server. Unfortunately, c1 may choose > zk3, which is unreachable now, then c1 will spend connectTimeout to find out > that zk3 is unused. Notice that readTimeout + connectTimeout = > sesstion_timeout in my case(three-node cluster). > > Therefore, in this case, the ZooKeeper cluster is unavailable for session > timeout time when only one ZooKeeper server is shutdown. > > I have some suggestions: > # The HostProvider used by ZooKeeper can be specified by an argument. > # readTimeout can also be specified in any way. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3099) ZooKeeper cluster is unavailable for session_timeout time due to network partition in a three-node environment.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3099: --- Description: The default readTimeout timeout of ZooKeeper client is 2/3 * session_time, the default connectTimeout is session_time/hostProvider.size(). If the ZooKeeper cluster has 3 nodes, then connectTimeout is 1/3 * session_time. Supports we have three ZooKeeper servers: zk1, zk2, zk3 deployed. And zk3 is now the leader. Client c1 is now connected to zk2(follower). Then we shutdown the network of zk3(leader), the same time, client c1 begin to write some data to ZooKeeper. After a (syncLimit * tick) timeout, zk2 will disconnect with leader and begin a new election, and zk2 becomes the leader. The write operation will not succeed due to the leader is unavailable. It will take at most readTimeout time for c1 to discover the failure, and client c1 will try to choose another ZooKeeper server. Unfortunately, c1 may choose zk3, which is unreachable now, then c1 will spend connectTimeout to find out that zk3 is unused. Notice that readTimeout + connectTimeout = sesstion_timeout in my case(three-node cluster). Therefore, in this case, the ZooKeeper cluster is unavailable for session timeout time when only one ZooKeeper server is unreachable due to network partition. I have some suggestions: # The HostProvider used by ZooKeeper can be specified by an argument. # readTimeout can also be specified in any way. was: The default readTimeout timeout of ZooKeeper client is 2/3 * session_time, the default connectTimeout is session_time/hostProvider.size(). If the ZooKeeper cluster has 3 nodes, then connectTimeout is 1/3 * session_time. Supports we have three ZooKeeper servers: zk1, zk2, zk3 deployed. And zk3 is now the leader. Client c1 is now connected to zk2(follower). Then we shutdown the network of zk3(leader), the same time, client c1 begin to write some data to ZooKeeper. After a (syncLimit * tick) timeout, zk2 will disconnect with leader and begin a new election, and zk2 becomes the leader. The write operation will not succeed due to the leader is unavailable. It will take at most readTimeout time for c1 to discover the failure, and client c1 will try to choose another ZooKeeper server. Unfortunately, c1 may choose zk3, which is unreachable now, then c1 will spend connectTimeout to find out that zk3 is unused. Notice that readTimeout + connectTimeout = sesstion_timeout in my case(three-node cluster). Therefore, in this case, the ZooKeeper cluster is unavailable for session timeout time when only one ZooKeeper server is unreachable due to network . I have some suggestions: # The HostProvider used by ZooKeeper can be specified by an argument. # readTimeout can also be specified in any way. > ZooKeeper cluster is unavailable for session_timeout time due to network > partition in a three-node environment. > -- > > Key: ZOOKEEPER-3099 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3099 > Project: ZooKeeper > Issue Type: Bug > Components: c client, java client >Affects Versions: 3.4.11, 3.5.4, 3.4.12, 3.4.13 >Reporter: Jiafu Jiang >Priority: Major > > > The default readTimeout timeout of ZooKeeper client is 2/3 * session_time, > the default connectTimeout is session_time/hostProvider.size(). If the > ZooKeeper cluster has 3 nodes, then connectTimeout is 1/3 * session_time. > > Supports we have three ZooKeeper servers: zk1, zk2, zk3 deployed. And zk3 is > now the leader. Client c1 is now connected to zk2(follower). Then we shutdown > the network of zk3(leader), the same time, client c1 begin to write some data > to ZooKeeper. After a (syncLimit * tick) timeout, zk2 will disconnect with > leader and begin a new election, and zk2 becomes the leader. > > The write operation will not succeed due to the leader is unavailable. It > will take at most readTimeout time for c1 to discover the failure, and client > c1 will try to choose another ZooKeeper server. Unfortunately, c1 may choose > zk3, which is unreachable now, then c1 will spend connectTimeout to find out > that zk3 is unused. Notice that readTimeout + connectTimeout = > sesstion_timeout in my case(three-node cluster). > > Therefore, in this case, the ZooKeeper cluster is unavailable for session > timeout time when only one ZooKeeper server is unreachable due to network > partition. > > I have some suggestions: > # The HostProvider used by ZooKeeper can be specified by an argument. > # readTimeout can also be specified in any way. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-3099) ZooKeeper cluster is unavailable for session_timeout time due to network partition in a three-node environment.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16649635#comment-16649635 ] Jiafu Jiang commented on ZOOKEEPER-3099: [~lvfangmin] thanks for your advice. I have changed the title and the description. > ZooKeeper cluster is unavailable for session_timeout time due to network > partition in a three-node environment. > -- > > Key: ZOOKEEPER-3099 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3099 > Project: ZooKeeper > Issue Type: Bug > Components: c client, java client >Affects Versions: 3.4.11, 3.5.4, 3.4.12, 3.4.13 >Reporter: Jiafu Jiang >Priority: Major > > > The default readTimeout timeout of ZooKeeper client is 2/3 * session_time, > the default connectTimeout is session_time/hostProvider.size(). If the > ZooKeeper cluster has 3 nodes, then connectTimeout is 1/3 * session_time. > > Supports we have three ZooKeeper servers: zk1, zk2, zk3 deployed. And zk3 is > now the leader. Client c1 is now connected to zk2(follower). Then we shutdown > the network of zk3(leader), the same time, client c1 begin to write some data > to ZooKeeper. After a (syncLimit * tick) timeout, zk2 will disconnect with > leader and begin a new election, and zk2 becomes the leader. > > The write operation will not succeed due to the leader is unavailable. It > will take at most readTimeout time for c1 to discover the failure, and client > c1 will try to choose another ZooKeeper server. Unfortunately, c1 may choose > zk3, which is unreachable now, then c1 will spend connectTimeout to find out > that zk3 is unused. Notice that readTimeout + connectTimeout = > sesstion_timeout in my case(three-node cluster). > > Therefore, in this case, the ZooKeeper cluster is unavailable for session > timeout time when only one ZooKeeper server is unreachable due to network > partition. > > I have some suggestions: > # The HostProvider used by ZooKeeper can be specified by an argument. > # readTimeout can also be specified in any way. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-3099) ZooKeeper cluster is unavailable for session_timeout time due to network partition in a three-node environment.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-3099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-3099: --- Description: The default readTimeout timeout of ZooKeeper client is 2/3 * session_time, the default connectTimeout is session_time/hostProvider.size(). If the ZooKeeper cluster has 3 nodes, then connectTimeout is 1/3 * session_time. Supports we have three ZooKeeper servers: zk1, zk2, zk3 deployed. And zk3 is now the leader. Client c1 is now connected to zk2(follower). Then we shutdown the network of zk3(leader), the same time, client c1 begin to write some data to ZooKeeper. After a (syncLimit * tick) timeout, zk2 will disconnect with leader and begin a new election, and zk2 becomes the leader. The write operation will not succeed due to the leader is unavailable. It will take at most readTimeout time for c1 to discover the failure, and client c1 will try to choose another ZooKeeper server. Unfortunately, c1 may choose zk3, which is unreachable now, then c1 will spend connectTimeout to find out that zk3 is unused. Notice that readTimeout + connectTimeout = sesstion_timeout in my case(three-node cluster). Therefore, in this case, the ZooKeeper cluster is unavailable for session timeout time when only one ZooKeeper server is unreachable due to network . I have some suggestions: # The HostProvider used by ZooKeeper can be specified by an argument. # readTimeout can also be specified in any way. was: The default readTimeout timeout of ZooKeeper client is 2/3 * session_time, the default connectTimeout is session_time/hostProvider.size(). If the ZooKeeper cluster has 3 nodes, then connectTimeout is 1/3 * session_time. Supports we have three ZooKeeper servers: zk1, zk2, zk3 deployed. And zk3 is now the leader. Client c1 is now connected to zk2(follower). Then we shutdown the network of zk3(leader), the same time, client c1 begin to write some data to ZooKeeper. After a (syncLimit * tick) timeout, zk2 will disconnect with leader and begin a new election, and zk2 becomes the leader. The write operation will not succeed due to the leader is shutdown. It will take at most readTimeout time for c1 to discover the failure, and client c1 will try to choose another ZooKeeper server. Unfortunately, c1 may choose zk3, which is unreachable now, then c1 will spend connectTimeout to find out that zk3 is unused. Notice that readTimeout + connectTimeout = sesstion_timeout in my case(three-node cluster). Therefore, in this case, the ZooKeeper cluster is unavailable for session timeout time when only one ZooKeeper server is shutdown. I have some suggestions: # The HostProvider used by ZooKeeper can be specified by an argument. # readTimeout can also be specified in any way. > ZooKeeper cluster is unavailable for session_timeout time due to network > partition in a three-node environment. > -- > > Key: ZOOKEEPER-3099 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3099 > Project: ZooKeeper > Issue Type: Bug > Components: c client, java client >Affects Versions: 3.4.11, 3.5.4, 3.4.12, 3.4.13 >Reporter: Jiafu Jiang >Priority: Major > > > The default readTimeout timeout of ZooKeeper client is 2/3 * session_time, > the default connectTimeout is session_time/hostProvider.size(). If the > ZooKeeper cluster has 3 nodes, then connectTimeout is 1/3 * session_time. > > Supports we have three ZooKeeper servers: zk1, zk2, zk3 deployed. And zk3 is > now the leader. Client c1 is now connected to zk2(follower). Then we shutdown > the network of zk3(leader), the same time, client c1 begin to write some data > to ZooKeeper. After a (syncLimit * tick) timeout, zk2 will disconnect with > leader and begin a new election, and zk2 becomes the leader. > > The write operation will not succeed due to the leader is unavailable. It > will take at most readTimeout time for c1 to discover the failure, and client > c1 will try to choose another ZooKeeper server. Unfortunately, c1 may choose > zk3, which is unreachable now, then c1 will spend connectTimeout to find out > that zk3 is unused. Notice that readTimeout + connectTimeout = > sesstion_timeout in my case(three-node cluster). > > Therefore, in this case, the ZooKeeper cluster is unavailable for session > timeout time when only one ZooKeeper server is unreachable due to network . > > I have some suggestions: > # The HostProvider used by ZooKeeper can be specified by an argument. > # readTimeout can also be specified in any way. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ZOOKEEPER-3099) ZooKeeper cluster is unavailable for session_timeout time when the leader shutdown in a three-node environment.
Jiafu Jiang created ZOOKEEPER-3099: -- Summary: ZooKeeper cluster is unavailable for session_timeout time when the leader shutdown in a three-node environment. Key: ZOOKEEPER-3099 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3099 Project: ZooKeeper Issue Type: Bug Components: c client, java client Affects Versions: 3.4.13, 3.4.12, 3.5.4, 3.4.11 Reporter: Jiafu Jiang The default readTimeout timeout of ZooKeeper client is 2/3 * session_time, the default connectTimeout is session_time/hostProvider.size(). If the ZooKeeper cluster has 3 nodes, then connectTimeout is 1/3 * session_time. Supports we have three ZooKeeper servers: zk1, zk2, zk3 deployed. And zk3 is now the leader. Client c1 is now connected to zk2(follower). Then we shutdown the network of zk3(leader), the same time, client c1 begin to write some data to ZooKeeper. After a (syncLimit * tick) timeout, zk2 will disconnect with leader and begin a new election, and zk2 becomes the leader. The write operation will not succeed due to the leader is shutdown. It will take at most readTimeout time for c1 to discover the failure, and client c1 will try to choose another ZooKeeper server. Unfortunately, c1 may choose zk3, which is unreachable now, then c1 will spend connectTimeout to find out that zk3 is unused. Notice that readTimeout + connectTimeout = sesstion_timeout in my case(three-node cluster). Therefore, in this case, the ZooKeeper cluster is unavailable for session timeout time when only one ZooKeeper server is shutdown. I have some suggestions: # The HostProvider used by ZooKeeper can be specified by an argument. # readTimeout can also be specified in any way. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2701) Timeout for RecvWorker is too long
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544918#comment-16544918 ] Jiafu Jiang commented on ZOOKEEPER-2701: I read the source code of ZooKeeper 3.4.12, and I find the SendWorker or RecvWorker will be finished when an IOException happends. When network problems happen, the OS may or may not discover the dead connection in time, especially when the socket timeout is infinity. This will lead to problem that ZooKeeper take *several* minutes to elect new leader. > Timeout for RecvWorker is too long > -- > > Key: ZOOKEEPER-2701 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.4.11 > Environment: Centos6.5 > ZooKeeper 3.4.8 >Reporter: Jiafu Jiang >Priority: Major > > Environment: > I deploy ZooKeeper in a cluster of three nodes. Each node has three network > interfaces(eth0, eth1, eth2). > Hostname is used instead of IP address in zoo.cfg, and > quorumListenOnAllIPs=true > Probleam: > I start three ZooKeeper servers( node A, node B, and node C) one by one, > when the leader election finishes, node B is the leader. > Then I shutdown one network interface of node A by command "ifdown eth0". > The ZooKeeper server on node A will lost connection to node B and node C. In > my test, I will take about 20 minites that the ZooKeepr server of node A > realizes the event and try to call the QuorumServer.recreateSocketAddress the > resolve the hostname. > I try to read the source code, and I find the code in > {code:java|title=QuorumCnxManager.java:|borderStyle=solid} > class RecvWorker extends ZooKeeperThread { > Long sid; > Socket sock; > volatile boolean running = true; > final DataInputStream din; > final SendWorker sw; > RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) > { > super("RecvWorker:" + sid); > this.sid = sid; > this.sock = sock; > this.sw = sw; > this.din = din; > try { > // OK to wait until socket disconnects while reading. > sock.setSoTimeout(0); > } catch (IOException e) { > LOG.error("Error while accessing socket for " + sid, e); > closeSocket(sock); > running = false; > } > } >... > } > {code} > I notice that the soTime is set to 0 in RecvWorker constructor. I think this > is reasonable when the IP address of a ZooKeeper server never change, but > considering that the IP address of each ZooKeeper server may change, maybe we > should better set a timeout here. > I think this is a problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-2701) Timeout for RecvWorker is too long
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2701: --- Priority: Trivial (was: Minor) > Timeout for RecvWorker is too long > -- > > Key: ZOOKEEPER-2701 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.4.11 > Environment: Centos6.5 > ZooKeeper 3.4.8 >Reporter: Jiafu Jiang >Priority: Trivial > > Environment: > I deploy ZooKeeper in a cluster of three nodes. Each node has three network > interfaces(eth0, eth1, eth2). > Hostname is used instead of IP address in zoo.cfg, and > quorumListenOnAllIPs=true > Probleam: > I start three ZooKeeper servers( node A, node B, and node C) one by one, > when the leader election finishes, node B is the leader. > Then I shutdown one network interface of node A by command "ifdown eth0". > The ZooKeeper server on node A will lost connection to node B and node C. In > my test, I will take about 20 minites that the ZooKeepr server of node A > realizes the event and try to call the QuorumServer.recreateSocketAddress the > resolve the hostname. > I try to read the source code, and I find the code in > {code:java|title=QuorumCnxManager.java:|borderStyle=solid} > class RecvWorker extends ZooKeeperThread { > Long sid; > Socket sock; > volatile boolean running = true; > final DataInputStream din; > final SendWorker sw; > RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) > { > super("RecvWorker:" + sid); > this.sid = sid; > this.sock = sock; > this.sw = sw; > this.din = din; > try { > // OK to wait until socket disconnects while reading. > sock.setSoTimeout(0); > } catch (IOException e) { > LOG.error("Error while accessing socket for " + sid, e); > closeSocket(sock); > running = false; > } > } >... > } > {code} > I notice that the soTime is set to 0 in RecvWorker constructor. I think this > is reasonable when the IP address of a ZooKeeper server never change, but > considering that the IP address of each ZooKeeper server may change, maybe we > should better set a timeout here. > I think this is a problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-2701) Timeout for RecvWorker is too long
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2701: --- Priority: Major (was: Trivial) > Timeout for RecvWorker is too long > -- > > Key: ZOOKEEPER-2701 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.4.11 > Environment: Centos6.5 > ZooKeeper 3.4.8 >Reporter: Jiafu Jiang >Priority: Major > > Environment: > I deploy ZooKeeper in a cluster of three nodes. Each node has three network > interfaces(eth0, eth1, eth2). > Hostname is used instead of IP address in zoo.cfg, and > quorumListenOnAllIPs=true > Probleam: > I start three ZooKeeper servers( node A, node B, and node C) one by one, > when the leader election finishes, node B is the leader. > Then I shutdown one network interface of node A by command "ifdown eth0". > The ZooKeeper server on node A will lost connection to node B and node C. In > my test, I will take about 20 minites that the ZooKeepr server of node A > realizes the event and try to call the QuorumServer.recreateSocketAddress the > resolve the hostname. > I try to read the source code, and I find the code in > {code:java|title=QuorumCnxManager.java:|borderStyle=solid} > class RecvWorker extends ZooKeeperThread { > Long sid; > Socket sock; > volatile boolean running = true; > final DataInputStream din; > final SendWorker sw; > RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) > { > super("RecvWorker:" + sid); > this.sid = sid; > this.sock = sock; > this.sw = sw; > this.din = din; > try { > // OK to wait until socket disconnects while reading. > sock.setSoTimeout(0); > } catch (IOException e) { > LOG.error("Error while accessing socket for " + sid, e); > closeSocket(sock); > running = false; > } > } >... > } > {code} > I notice that the soTime is set to 0 in RecvWorker constructor. I think this > is reasonable when the IP address of a ZooKeeper server never change, but > considering that the IP address of each ZooKeeper server may change, maybe we > should better set a timeout here. > I think this is a problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2701) Timeout for RecvWorker is too long
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544897#comment-16544897 ] Jiafu Jiang commented on ZOOKEEPER-2701: I remove the following code: try {// OK to wait until socket disconnects while reading.sock.setSoTimeout(0); } catch (IOException e) { LOG.error("Error while accessing socket for " + sid, e); closeSocket(sock); running = false; } And I find it works fine in my test environment. > Timeout for RecvWorker is too long > -- > > Key: ZOOKEEPER-2701 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.4.11 > Environment: Centos6.5 > ZooKeeper 3.4.8 >Reporter: Jiafu Jiang >Priority: Minor > > Environment: > I deploy ZooKeeper in a cluster of three nodes. Each node has three network > interfaces(eth0, eth1, eth2). > Hostname is used instead of IP address in zoo.cfg, and > quorumListenOnAllIPs=true > Probleam: > I start three ZooKeeper servers( node A, node B, and node C) one by one, > when the leader election finishes, node B is the leader. > Then I shutdown one network interface of node A by command "ifdown eth0". > The ZooKeeper server on node A will lost connection to node B and node C. In > my test, I will take about 20 minites that the ZooKeepr server of node A > realizes the event and try to call the QuorumServer.recreateSocketAddress the > resolve the hostname. > I try to read the source code, and I find the code in > {code:java|title=QuorumCnxManager.java:|borderStyle=solid} > class RecvWorker extends ZooKeeperThread { > Long sid; > Socket sock; > volatile boolean running = true; > final DataInputStream din; > final SendWorker sw; > RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) > { > super("RecvWorker:" + sid); > this.sid = sid; > this.sock = sock; > this.sw = sw; > this.din = din; > try { > // OK to wait until socket disconnects while reading. > sock.setSoTimeout(0); > } catch (IOException e) { > LOG.error("Error while accessing socket for " + sid, e); > closeSocket(sock); > running = false; > } > } >... > } > {code} > I notice that the soTime is set to 0 in RecvWorker constructor. I think this > is reasonable when the IP address of a ZooKeeper server never change, but > considering that the IP address of each ZooKeeper server may change, maybe we > should better set a timeout here. > I think this is a problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2701) Timeout for RecvWorker is too long
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544893#comment-16544893 ] Jiafu Jiang commented on ZOOKEEPER-2701: I remove the following code: try {// OK to wait until socket disconnects while reading.sock.setSoTimeout(0); } catch (IOException e) { LOG.error("Error while accessing socket for " + sid, e); closeSocket(sock); running = false; } And I find it works fine in my test environment. > Timeout for RecvWorker is too long > -- > > Key: ZOOKEEPER-2701 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.4.11 > Environment: Centos6.5 > ZooKeeper 3.4.8 >Reporter: Jiafu Jiang >Priority: Minor > > Environment: > I deploy ZooKeeper in a cluster of three nodes. Each node has three network > interfaces(eth0, eth1, eth2). > Hostname is used instead of IP address in zoo.cfg, and > quorumListenOnAllIPs=true > Probleam: > I start three ZooKeeper servers( node A, node B, and node C) one by one, > when the leader election finishes, node B is the leader. > Then I shutdown one network interface of node A by command "ifdown eth0". > The ZooKeeper server on node A will lost connection to node B and node C. In > my test, I will take about 20 minites that the ZooKeepr server of node A > realizes the event and try to call the QuorumServer.recreateSocketAddress the > resolve the hostname. > I try to read the source code, and I find the code in > {code:java|title=QuorumCnxManager.java:|borderStyle=solid} > class RecvWorker extends ZooKeeperThread { > Long sid; > Socket sock; > volatile boolean running = true; > final DataInputStream din; > final SendWorker sw; > RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) > { > super("RecvWorker:" + sid); > this.sid = sid; > this.sock = sock; > this.sw = sw; > this.din = din; > try { > // OK to wait until socket disconnects while reading. > sock.setSoTimeout(0); > } catch (IOException e) { > LOG.error("Error while accessing socket for " + sid, e); > closeSocket(sock); > running = false; > } > } >... > } > {code} > I notice that the soTime is set to 0 in RecvWorker constructor. I think this > is reasonable when the IP address of a ZooKeeper server never change, but > considering that the IP address of each ZooKeeper server may change, maybe we > should better set a timeout here. > I think this is a problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (ZOOKEEPER-2701) Timeout for RecvWorker is too long
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2701: --- Comment: was deleted (was: I remove the following code: try {// OK to wait until socket disconnects while reading. sock.setSoTimeout(0); } catch (IOException e) { LOG.error("Error while accessing socket for " + sid, e); closeSocket(sock); running = false; } And I find it works fine in my test environment.) > Timeout for RecvWorker is too long > -- > > Key: ZOOKEEPER-2701 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.4.11 > Environment: Centos6.5 > ZooKeeper 3.4.8 >Reporter: Jiafu Jiang >Priority: Minor > > Environment: > I deploy ZooKeeper in a cluster of three nodes. Each node has three network > interfaces(eth0, eth1, eth2). > Hostname is used instead of IP address in zoo.cfg, and > quorumListenOnAllIPs=true > Probleam: > I start three ZooKeeper servers( node A, node B, and node C) one by one, > when the leader election finishes, node B is the leader. > Then I shutdown one network interface of node A by command "ifdown eth0". > The ZooKeeper server on node A will lost connection to node B and node C. In > my test, I will take about 20 minites that the ZooKeepr server of node A > realizes the event and try to call the QuorumServer.recreateSocketAddress the > resolve the hostname. > I try to read the source code, and I find the code in > {code:java|title=QuorumCnxManager.java:|borderStyle=solid} > class RecvWorker extends ZooKeeperThread { > Long sid; > Socket sock; > volatile boolean running = true; > final DataInputStream din; > final SendWorker sw; > RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) > { > super("RecvWorker:" + sid); > this.sid = sid; > this.sock = sock; > this.sw = sw; > this.din = din; > try { > // OK to wait until socket disconnects while reading. > sock.setSoTimeout(0); > } catch (IOException e) { > LOG.error("Error while accessing socket for " + sid, e); > closeSocket(sock); > running = false; > } > } >... > } > {code} > I notice that the soTime is set to 0 in RecvWorker constructor. I think this > is reasonable when the IP address of a ZooKeeper server never change, but > considering that the IP address of each ZooKeeper server may change, maybe we > should better set a timeout here. > I think this is a problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ZOOKEEPER-2701) Timeout for RecvWorker is too long
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16544893#comment-16544893 ] Jiafu Jiang edited comment on ZOOKEEPER-2701 at 7/16/18 7:32 AM: - I remove the following code: try {// OK to wait until socket disconnects while reading. sock.setSoTimeout(0); } catch (IOException e) { LOG.error("Error while accessing socket for " + sid, e); closeSocket(sock); running = false; } And I find it works fine in my test environment. was (Author: jiangjiafu): I remove the following code: try {// OK to wait until socket disconnects while reading.sock.setSoTimeout(0); } catch (IOException e) { LOG.error("Error while accessing socket for " + sid, e); closeSocket(sock); running = false; } And I find it works fine in my test environment. > Timeout for RecvWorker is too long > -- > > Key: ZOOKEEPER-2701 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.4.11 > Environment: Centos6.5 > ZooKeeper 3.4.8 >Reporter: Jiafu Jiang >Priority: Minor > > Environment: > I deploy ZooKeeper in a cluster of three nodes. Each node has three network > interfaces(eth0, eth1, eth2). > Hostname is used instead of IP address in zoo.cfg, and > quorumListenOnAllIPs=true > Probleam: > I start three ZooKeeper servers( node A, node B, and node C) one by one, > when the leader election finishes, node B is the leader. > Then I shutdown one network interface of node A by command "ifdown eth0". > The ZooKeeper server on node A will lost connection to node B and node C. In > my test, I will take about 20 minites that the ZooKeepr server of node A > realizes the event and try to call the QuorumServer.recreateSocketAddress the > resolve the hostname. > I try to read the source code, and I find the code in > {code:java|title=QuorumCnxManager.java:|borderStyle=solid} > class RecvWorker extends ZooKeeperThread { > Long sid; > Socket sock; > volatile boolean running = true; > final DataInputStream din; > final SendWorker sw; > RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) > { > super("RecvWorker:" + sid); > this.sid = sid; > this.sock = sock; > this.sw = sw; > this.din = din; > try { > // OK to wait until socket disconnects while reading. > sock.setSoTimeout(0); > } catch (IOException e) { > LOG.error("Error while accessing socket for " + sid, e); > closeSocket(sock); > running = false; > } > } >... > } > {code} > I notice that the soTime is set to 0 in RecvWorker constructor. I think this > is reasonable when the IP address of a ZooKeeper server never change, but > considering that the IP address of each ZooKeeper server may change, maybe we > should better set a timeout here. > I think this is a problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-2701) Timeout for RecvWorker is too long
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2701: --- Affects Version/s: 3.4.11 > Timeout for RecvWorker is too long > -- > > Key: ZOOKEEPER-2701 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2701 > Project: ZooKeeper > Issue Type: Bug >Affects Versions: 3.4.8, 3.4.9, 3.4.10, 3.4.11 > Environment: Centos6.5 > ZooKeeper 3.4.8 >Reporter: Jiafu Jiang >Priority: Minor > > Environment: > I deploy ZooKeeper in a cluster of three nodes. Each node has three network > interfaces(eth0, eth1, eth2). > Hostname is used instead of IP address in zoo.cfg, and > quorumListenOnAllIPs=true > Probleam: > I start three ZooKeeper servers( node A, node B, and node C) one by one, > when the leader election finishes, node B is the leader. > Then I shutdown one network interface of node A by command "ifdown eth0". > The ZooKeeper server on node A will lost connection to node B and node C. In > my test, I will take about 20 minites that the ZooKeepr server of node A > realizes the event and try to call the QuorumServer.recreateSocketAddress the > resolve the hostname. > I try to read the source code, and I find the code in > {code:java|title=QuorumCnxManager.java:|borderStyle=solid} > class RecvWorker extends ZooKeeperThread { > Long sid; > Socket sock; > volatile boolean running = true; > final DataInputStream din; > final SendWorker sw; > RecvWorker(Socket sock, DataInputStream din, Long sid, SendWorker sw) > { > super("RecvWorker:" + sid); > this.sid = sid; > this.sock = sock; > this.sw = sw; > this.din = din; > try { > // OK to wait until socket disconnects while reading. > sock.setSoTimeout(0); > } catch (IOException e) { > LOG.error("Error while accessing socket for " + sid, e); > closeSocket(sock); > running = false; > } > } >... > } > {code} > I notice that the soTime is set to 0 in RecvWorker constructor. I think this > is reasonable when the IP address of a ZooKeeper server never change, but > considering that the IP address of each ZooKeeper server may change, maybe we > should better set a timeout here. > I think this is a problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381347#comment-16381347 ] Jiafu Jiang commented on ZOOKEEPER-2930: I hope this PB can be fix in version 3.4.X, since 3.4.X is the stable version. > Leader cannot be elected due to network timeout of some members. > > > Key: ZOOKEEPER-2930 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.4.10, 3.5.3, 3.4.11, 3.5.4, 3.4.12 > Environment: Java 8 > ZooKeeper 3.4.11(from github) > Centos6.5 >Reporter: Jiafu Jiang >Priority: Critical > Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log > > > I deploy a cluster of ZooKeeper with three nodes: > ofs_zk1:20.10.11.101, 30.10.11.101 > ofs_zk2:20.10.11.102, 30.10.11.102 > ofs_zk3:20.10.11.103, 30.10.11.103 > I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. > It is supposed that the new Leader should be elected in some seconds, but the > fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of > them can become the new Leader. > I change the log level to DEBUG (the default is INFO), and restart zookeeper > servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. > I read the log and the ZooKeeper source code, and I think I find the reason. > When the potential leader(says ofs_zk3) begins the > election(FastLeaderElection.lookForLeader()), it will send notifications to > all the servers. > When it fails to receive any notification during a timeout, it will resend > the notifications, and double the timeout. This process will repeat until any > notification is received or the timeout reaches a max value. > The FastLeaderElection.sendNotifications() just put the notification message > into a queue and return. The WorkerSender is responsable to send the > notifications. > The WorkerSender just process the notifications one by one by passing the > notifications to QuorumCnxManager. Here comes the problem, the > QuorumCnxManager.toSend() blocks for a long time when the notification is > send to ofs_zk2(whose network is down) and some notifications (which belongs > to ofs_zk1) will thus be blocked for a long time. The repeated notifications > by FastLeaderElection.sendNotifications() just make things worse. > Here is the related source code: > {code:java} > public void toSend(Long sid, ByteBuffer b) { > /* > * If sending message to myself, then simply enqueue it (loopback). > */ > if (this.mySid == sid) { > b.position(0); > addToRecvQueue(new Message(b.duplicate(), sid)); > /* > * Otherwise send to the corresponding thread to send. > */ > } else { > /* > * Start a new connection if doesn't have one already. > */ > ArrayBlockingQueue bq = new > ArrayBlockingQueue(SEND_CAPACITY); > ArrayBlockingQueue bqExisting = > queueSendMap.putIfAbsent(sid, bq); > if (bqExisting != null) { > addToSendQueue(bqExisting, b); > } else { > addToSendQueue(bq, b); > } > > // This may block!!! > connectOne(sid); > > } > } > {code} > Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the > epoch ack, but in fact the ofs_zk1 does not receive the notification(which > says the leader is ofs_zk3) because the ofs_zk3 has not sent the > notification(which may still exist in the sendqueue of WorkerSender). At > last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, > so it quits the leader and begins a new election. > The log files of ofs_zk1 and ofs_zk3 are attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Affects Version/s: 3.5.4 3.5.3 > Leader cannot be elected due to network timeout of some members. > > > Key: ZOOKEEPER-2930 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.4.10, 3.5.3, 3.4.11, 3.5.4, 3.4.12 > Environment: Java 8 > ZooKeeper 3.4.11(from github) > Centos6.5 >Reporter: Jiafu Jiang >Priority: Critical > Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log > > > I deploy a cluster of ZooKeeper with three nodes: > ofs_zk1:20.10.11.101, 30.10.11.101 > ofs_zk2:20.10.11.102, 30.10.11.102 > ofs_zk3:20.10.11.103, 30.10.11.103 > I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. > It is supposed that the new Leader should be elected in some seconds, but the > fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of > them can become the new Leader. > I change the log level to DEBUG (the default is INFO), and restart zookeeper > servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. > I read the log and the ZooKeeper source code, and I think I find the reason. > When the potential leader(says ofs_zk3) begins the > election(FastLeaderElection.lookForLeader()), it will send notifications to > all the servers. > When it fails to receive any notification during a timeout, it will resend > the notifications, and double the timeout. This process will repeat until any > notification is received or the timeout reaches a max value. > The FastLeaderElection.sendNotifications() just put the notification message > into a queue and return. The WorkerSender is responsable to send the > notifications. > The WorkerSender just process the notifications one by one by passing the > notifications to QuorumCnxManager. Here comes the problem, the > QuorumCnxManager.toSend() blocks for a long time when the notification is > send to ofs_zk2(whose network is down) and some notifications (which belongs > to ofs_zk1) will thus be blocked for a long time. The repeated notifications > by FastLeaderElection.sendNotifications() just make things worse. > Here is the related source code: > {code:java} > public void toSend(Long sid, ByteBuffer b) { > /* > * If sending message to myself, then simply enqueue it (loopback). > */ > if (this.mySid == sid) { > b.position(0); > addToRecvQueue(new Message(b.duplicate(), sid)); > /* > * Otherwise send to the corresponding thread to send. > */ > } else { > /* > * Start a new connection if doesn't have one already. > */ > ArrayBlockingQueue bq = new > ArrayBlockingQueue(SEND_CAPACITY); > ArrayBlockingQueue bqExisting = > queueSendMap.putIfAbsent(sid, bq); > if (bqExisting != null) { > addToSendQueue(bqExisting, b); > } else { > addToSendQueue(bq, b); > } > > // This may block!!! > connectOne(sid); > > } > } > {code} > Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the > epoch ack, but in fact the ofs_zk1 does not receive the notification(which > says the leader is ofs_zk3) because the ofs_zk3 has not sent the > notification(which may still exist in the sendqueue of WorkerSender). At > last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, > so it quits the leader and begins a new election. > The log files of ofs_zk1 and ofs_zk3 are attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Affects Version/s: 3.4.12 > Leader cannot be elected due to network timeout of some members. > > > Key: ZOOKEEPER-2930 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.4.10, 3.4.11, 3.4.12 > Environment: Java 8 > ZooKeeper 3.4.11(from github) > Centos6.5 >Reporter: Jiafu Jiang >Priority: Critical > Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log > > > I deploy a cluster of ZooKeeper with three nodes: > ofs_zk1:20.10.11.101, 30.10.11.101 > ofs_zk2:20.10.11.102, 30.10.11.102 > ofs_zk3:20.10.11.103, 30.10.11.103 > I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. > It is supposed that the new Leader should be elected in some seconds, but the > fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of > them can become the new Leader. > I change the log level to DEBUG (the default is INFO), and restart zookeeper > servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. > I read the log and the ZooKeeper source code, and I think I find the reason. > When the potential leader(says ofs_zk3) begins the > election(FastLeaderElection.lookForLeader()), it will send notifications to > all the servers. > When it fails to receive any notification during a timeout, it will resend > the notifications, and double the timeout. This process will repeat until any > notification is received or the timeout reaches a max value. > The FastLeaderElection.sendNotifications() just put the notification message > into a queue and return. The WorkerSender is responsable to send the > notifications. > The WorkerSender just process the notifications one by one by passing the > notifications to QuorumCnxManager. Here comes the problem, the > QuorumCnxManager.toSend() blocks for a long time when the notification is > send to ofs_zk2(whose network is down) and some notifications (which belongs > to ofs_zk1) will thus be blocked for a long time. The repeated notifications > by FastLeaderElection.sendNotifications() just make things worse. > Here is the related source code: > {code:java} > public void toSend(Long sid, ByteBuffer b) { > /* > * If sending message to myself, then simply enqueue it (loopback). > */ > if (this.mySid == sid) { > b.position(0); > addToRecvQueue(new Message(b.duplicate(), sid)); > /* > * Otherwise send to the corresponding thread to send. > */ > } else { > /* > * Start a new connection if doesn't have one already. > */ > ArrayBlockingQueue bq = new > ArrayBlockingQueue(SEND_CAPACITY); > ArrayBlockingQueue bqExisting = > queueSendMap.putIfAbsent(sid, bq); > if (bqExisting != null) { > addToSendQueue(bqExisting, b); > } else { > addToSendQueue(bq, b); > } > > // This may block!!! > connectOne(sid); > > } > } > {code} > Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the > epoch ack, but in fact the ofs_zk1 does not receive the notification(which > says the leader is ofs_zk3) because the ofs_zk3 has not sent the > notification(which may still exist in the sendqueue of WorkerSender). At > last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, > so it quits the leader and begins a new election. > The log files of ofs_zk1 and ofs_zk3 are attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ZOOKEEPER-2962) The function queueEmpty() in FastLeaderElection.Messenger is not used, should be removed.
Jiafu Jiang created ZOOKEEPER-2962: -- Summary: The function queueEmpty() in FastLeaderElection.Messenger is not used, should be removed. Key: ZOOKEEPER-2962 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2962 Project: ZooKeeper Issue Type: Improvement Components: leaderElection Affects Versions: 3.4.11 Reporter: Jiafu Jiang Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Priority: Critical (was: Major) > Leader cannot be elected due to network timeout of some members. > > > Key: ZOOKEEPER-2930 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.4.10, 3.4.11 > Environment: Java 8 > ZooKeeper 3.4.11(from github) > Centos6.5 >Reporter: Jiafu Jiang >Priority: Critical > Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log > > > I deploy a cluster of ZooKeeper with three nodes: > ofs_zk1:20.10.11.101, 30.10.11.101 > ofs_zk2:20.10.11.102, 30.10.11.102 > ofs_zk3:20.10.11.103, 30.10.11.103 > I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. > It is supposed that the new Leader should be elected in some seconds, but the > fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of > them can become the new Leader. > I change the log level to DEBUG (the default is INFO), and restart zookeeper > servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. > I read the log and the ZooKeeper source code, and I think I find the reason. > When the potential leader(says ofs_zk3) begins the > election(FastLeaderElection.lookForLeader()), it will send notifications to > all the servers. > When it fails to receive any notification during a timeout, it will resend > the notifications, and double the timeout. This process will repeat until any > notification is received or the timeout reaches a max value. > The FastLeaderElection.sendNotifications() just put the notification message > into a queue and return. The WorkerSender is responsable to send the > notifications. > The WorkerSender just process the notifications one by one by passing the > notifications to QuorumCnxManager. Here comes the problem, the > QuorumCnxManager.toSend() blocks for a long time when the notification is > send to ofs_zk2(whose network is down) and some notifications (which belongs > to ofs_zk1) will thus be blocked for a long time. The repeated notifications > by FastLeaderElection.sendNotifications() just make things worse. > Here is the related source code: > {code:java} > public void toSend(Long sid, ByteBuffer b) { > /* > * If sending message to myself, then simply enqueue it (loopback). > */ > if (this.mySid == sid) { > b.position(0); > addToRecvQueue(new Message(b.duplicate(), sid)); > /* > * Otherwise send to the corresponding thread to send. > */ > } else { > /* > * Start a new connection if doesn't have one already. > */ > ArrayBlockingQueue bq = new > ArrayBlockingQueue(SEND_CAPACITY); > ArrayBlockingQueue bqExisting = > queueSendMap.putIfAbsent(sid, bq); > if (bqExisting != null) { > addToSendQueue(bqExisting, b); > } else { > addToSendQueue(bq, b); > } > > // This may block!!! > connectOne(sid); > > } > } > {code} > Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the > epoch ack, but in fact the ofs_zk1 does not receive the notification(which > says the leader is ofs_zk3) because the ofs_zk3 has not sent the > notification(which may still exist in the sendqueue of WorkerSender). At > last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, > so it quits the leader and begins a new election. > The log files of ofs_zk1 and ofs_zk3 are attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Affects Version/s: 3.4.11 > Leader cannot be elected due to network timeout of some members. > > > Key: ZOOKEEPER-2930 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.4.10, 3.4.11 > Environment: Java 8 > ZooKeeper 3.4.11(from github) > Centos6.5 >Reporter: Jiafu Jiang > Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log > > > I deploy a cluster of ZooKeeper with three nodes: > ofs_zk1:20.10.11.101, 30.10.11.101 > ofs_zk2:20.10.11.102, 30.10.11.102 > ofs_zk3:20.10.11.103, 30.10.11.103 > I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. > It is supposed that the new Leader should be elected in some seconds, but the > fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of > them can become the new Leader. > I change the log level to DEBUG (the default is INFO), and restart zookeeper > servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. > I read the log and the ZooKeeper source code, and I think I find the reason. > When the potential leader(says ofs_zk3) begins the > election(FastLeaderElection.lookForLeader()), it will send notifications to > all the servers. > When it fails to receive any notification during a timeout, it will resend > the notifications, and double the timeout. This process will repeat until any > notification is received or the timeout reaches a max value. > The FastLeaderElection.sendNotifications() just put the notification message > into a queue and return. The WorkerSender is responsable to send the > notifications. > The WorkerSender just process the notifications one by one by passing the > notifications to QuorumCnxManager. Here comes the problem, the > QuorumCnxManager.toSend() blocks for a long time when the notification is > send to ofs_zk2(whose network is down) and some notifications (which belongs > to ofs_zk1) will thus be blocked for a long time. The repeated notifications > by FastLeaderElection.sendNotifications() just make things worse. > Here is the related source code: > {code:java} > public void toSend(Long sid, ByteBuffer b) { > /* > * If sending message to myself, then simply enqueue it (loopback). > */ > if (this.mySid == sid) { > b.position(0); > addToRecvQueue(new Message(b.duplicate(), sid)); > /* > * Otherwise send to the corresponding thread to send. > */ > } else { > /* > * Start a new connection if doesn't have one already. > */ > ArrayBlockingQueue bq = new > ArrayBlockingQueue(SEND_CAPACITY); > ArrayBlockingQueue bqExisting = > queueSendMap.putIfAbsent(sid, bq); > if (bqExisting != null) { > addToSendQueue(bqExisting, b); > } else { > addToSendQueue(bq, b); > } > > // This may block!!! > connectOne(sid); > > } > } > {code} > Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the > epoch ack, but in fact the ofs_zk1 does not receive the notification(which > says the leader is ofs_zk3) because the ofs_zk3 has not sent the > notification(which may still exist in the sendqueue of WorkerSender). At > last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, > so it quits the leader and begins a new election. > The log files of ofs_zk1 and ofs_zk3 are attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Component/s: server quorum > Leader cannot be elected due to network timeout of some members. > > > Key: ZOOKEEPER-2930 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection, quorum, server >Affects Versions: 3.4.10 > Environment: Java 8 > ZooKeeper 3.4.11(from github) > Centos6.5 >Reporter: Jiafu Jiang > Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log > > > I deploy a cluster of ZooKeeper with three nodes: > ofs_zk1:20.10.11.101, 30.10.11.101 > ofs_zk2:20.10.11.102, 30.10.11.102 > ofs_zk3:20.10.11.103, 30.10.11.103 > I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. > It is supposed that the new Leader should be elected in some seconds, but the > fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of > them can become the new Leader. > I change the log level to DEBUG (the default is INFO), and restart zookeeper > servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. > I read the log and the ZooKeeper source code, and I think I find the reason. > When the potential leader(says ofs_zk3) begins the > election(FastLeaderElection.lookForLeader()), it will send notifications to > all the servers. > When it fails to receive any notification during a timeout, it will resend > the notifications, and double the timeout. This process will repeat until any > notification is received or the timeout reaches a max value. > The FastLeaderElection.sendNotifications() just put the notification message > into a queue and return. The WorkerSender is responsable to send the > notifications. > The WorkerSender just process the notifications one by one by passing the > notifications to QuorumCnxManager. Here comes the problem, the > QuorumCnxManager.toSend() blocks for a long time when the notification is > send to ofs_zk2(whose network is down) and some notifications (which belongs > to ofs_zk1) will thus be blocked for a long time. The repeated notifications > by FastLeaderElection.sendNotifications() just make things worse. > Here is the related source code: > {code:java} > public void toSend(Long sid, ByteBuffer b) { > /* > * If sending message to myself, then simply enqueue it (loopback). > */ > if (this.mySid == sid) { > b.position(0); > addToRecvQueue(new Message(b.duplicate(), sid)); > /* > * Otherwise send to the corresponding thread to send. > */ > } else { > /* > * Start a new connection if doesn't have one already. > */ > ArrayBlockingQueue bq = new > ArrayBlockingQueue(SEND_CAPACITY); > ArrayBlockingQueue bqExisting = > queueSendMap.putIfAbsent(sid, bq); > if (bqExisting != null) { > addToSendQueue(bqExisting, b); > } else { > addToSendQueue(bq, b); > } > > // This may block!!! > connectOne(sid); > > } > } > {code} > Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the > epoch ack, but in fact the ofs_zk1 does not receive the notification(which > says the leader is ofs_zk3) because the ofs_zk3 has not sent the > notification(which may still exist in the sendqueue of WorkerSender). At > last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, > so it quits the leader and begins a new election. > The log files of ofs_zk1 and ofs_zk3 are attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16238828#comment-16238828 ] Jiafu Jiang commented on ZOOKEEPER-2930: I suggest that there can be more that one WorkSender in FastLeaderElection, so that network failure of some zk servers will not affect the notifications to others. > Leader cannot be elected due to network timeout of some members. > > > Key: ZOOKEEPER-2930 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.4.10 > Environment: Java 8 > ZooKeeper 3.4.11(from github) > Centos6.5 >Reporter: Jiafu Jiang >Priority: Major > Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log > > > I deploy a cluster of ZooKeeper with three nodes: > ofs_zk1:20.10.11.101, 30.10.11.101 > ofs_zk2:20.10.11.102, 30.10.11.102 > ofs_zk3:20.10.11.103, 30.10.11.103 > I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. > It is supposed that the new Leader should be elected in some seconds, but the > fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of > them can become the new Leader. > I change the log level to DEBUG (the default is INFO), and restart zookeeper > servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. > I read the log and the ZooKeeper source code, and I think I find the reason. > When the potential leader(says ofs_zk3) begins the > election(FastLeaderElection.lookForLeader()), it will send notifications to > all the servers. > When it fails to receive any notification during a timeout, it will resend > the notifications, and double the timeout. This process will repeat until any > notification is received or the timeout reaches a max value. > The FastLeaderElection.sendNotifications() just put the notification message > into a queue and return. The WorkerSender is responsable to send the > notifications. > The WorkerSender just process the notifications one by one by passing the > notifications to QuorumCnxManager. Here comes the problem, the > QuorumCnxManager.toSend() blocks for a long time when the notification is > send to ofs_zk2(whose network is down) and some notifications (which belongs > to ofs_zk1) will thus be blocked for a long time. The repeated notifications > by FastLeaderElection.sendNotifications() just make things worse. > Here is the related source code: > {code:java} > public void toSend(Long sid, ByteBuffer b) { > /* > * If sending message to myself, then simply enqueue it (loopback). > */ > if (this.mySid == sid) { > b.position(0); > addToRecvQueue(new Message(b.duplicate(), sid)); > /* > * Otherwise send to the corresponding thread to send. > */ > } else { > /* > * Start a new connection if doesn't have one already. > */ > ArrayBlockingQueue bq = new > ArrayBlockingQueue(SEND_CAPACITY); > ArrayBlockingQueue bqExisting = > queueSendMap.putIfAbsent(sid, bq); > if (bqExisting != null) { > addToSendQueue(bqExisting, b); > } else { > addToSendQueue(bq, b); > } > > // This may block!!! > connectOne(sid); > > } > } > {code} > Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the > epoch ack, but in fact the ofs_zk1 does not receive the notification(which > says the leader is ofs_zk3) because the ofs_zk3 has not sent the > notification(which may still exist in the sendqueue of WorkerSender). At > last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, > so it quits the leader and begins a new election. > The log files of ofs_zk1 and ofs_zk3 are attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Description: I deploy a cluster of ZooKeeper with three nodes: ofs_zk1:20.10.11.101, 30.10.11.101 ofs_zk2:20.10.11.102, 30.10.11.102 ofs_zk3:20.10.11.103, 30.10.11.103 I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. It is supposed that the new Leader should be elected in some seconds, but the fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of them can become the new Leader. I change the log level to DEBUG (the default is INFO), and restart zookeeper servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. I read the log and the ZooKeeper source code, and I think I find the reason. When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()), it will send notifications to all the servers. When it fails to receive any notification during a timeout, it will resend the notifications, and double the timeout. This process will repeat until any notification is received or the timeout reaches a max value. The FastLeaderElection.sendNotifications() just put the notification message into a queue and return. The WorkerSender is responsable to send the notifications. The WorkerSender just process the notifications one by one by passing the notifications to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long time when the notification is send to ofs_zk2(whose network is down) and some notifications (which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications by FastLeaderElection.sendNotifications() just make things worse. Here is the related source code: {code:java} public void toSend(Long sid, ByteBuffer b) { /* * If sending message to myself, then simply enqueue it (loopback). */ if (this.mySid == sid) { b.position(0); addToRecvQueue(new Message(b.duplicate(), sid)); /* * Otherwise send to the corresponding thread to send. */ } else { /* * Start a new connection if doesn't have one already. */ ArrayBlockingQueue bq = new ArrayBlockingQueue(SEND_CAPACITY); ArrayBlockingQueue bqExisting = queueSendMap.putIfAbsent(sid, bq); if (bqExisting != null) { addToSendQueue(bqExisting, b); } else { addToSendQueue(bq, b); } // This may block!!! connectOne(sid); } } {code} Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the epoch ack, but in fact the ofs_zk1 does not receive the notification(which says the leader is ofs_zk3) because the ofs_zk3 has not sent the notification(which may still exist in the sendqueue of WorkerSender). At last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, so it quits the leader and begins a new election. The log files of ofs_zk1 and ofs_zk3 are attached. was: I deploy a cluster of ZooKeeper with three nodes: ofs_zk1:20.10.11.101, 30.10.11.101 ofs_zk2:20.10.11.102, 30.10.11.102 ofs_zk3:20.10.11.103, 30.10.11.103 I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. It is supposed that the new Leader should be elected in some seconds, but the fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of them can become the new Leader. I change the log level to DEBUG (the default is INFO), and restart zookeeper servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. I read the log and the ZooKeeper source code, and I think I find the reason. When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()), it will send notifications to all the servers. When it fails to receive any notification during a timeout, it will resend the notifications, and double the timeout. This process will repeat until any notification is received or the timeout reaches a max value. The FastLeaderElection.sendNotifications() just put the notification message into a queue and return. The WorkerSender is responsable to send the notifications. The WorkerSender just process the notifications one by one by passing the notifications to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long time when the notification is send to ofs_zk2(whose network is down) and some notifications (which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications by FastLeaderElection.sendNotifications() just make things worse. Here is the related source code: {code:java} public void toSend(Long sid,
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Description: I deploy a cluster of ZooKeeper with three nodes: ofs_zk1:20.10.11.101, 30.10.11.101 ofs_zk2:20.10.11.102, 30.10.11.102 ofs_zk3:20.10.11.103, 30.10.11.103 I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. It is supposed that the new Leader should be elected in some seconds, but the fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of them can become the new Leader. I change the log level to DEBUG (the default is INFO), and restart zookeeper servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. I read the log and the ZooKeeper source code, and I think I find the reason. When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()), it will send notifications to all the servers. When it fails to receive any notification during a timeout, it will resend the notifications, and double the timeout. This process will repeat until any notification is received or the timeout reaches a max value. The FastLeaderElection.sendNotifications() just put the notification message into a queue and return. The WorkerSender is responsable to send the notifications. The WorkerSender just process the notifications one by one by passing the notifications to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long time when the notification is send to ofs_zk2(whose network is down) and some notifications (which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications by FastLeaderElection.sendNotifications() just make things worse. Here is the related source code: {code:java} public void toSend(Long sid, ByteBuffer b) { /* * If sending message to myself, then simply enqueue it (loopback). */ if (this.mySid == sid) { b.position(0); addToRecvQueue(new Message(b.duplicate(), sid)); /* * Otherwise send to the corresponding thread to send. */ } else { /* * Start a new connection if doesn't have one already. */ ArrayBlockingQueue bq = new ArrayBlockingQueue(SEND_CAPACITY); ArrayBlockingQueue bqExisting = queueSendMap.putIfAbsent(sid, bq); if (bqExisting != null) { addToSendQueue(bqExisting, b); } else { addToSendQueue(bq, b); } // This may block!!! connectOne(sid); } } {code} Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the epoch ack, but in fact the ofs_zk1 does not receive the notification(which says the leader is ofs_zk3) because the ofs_zk3 has not send the notification(the notification may still in the sendqueue of WorkerSender). At last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, so it quit the leader and begins a new election. The log files of ofs_zk1 and ofs_zk3 are attached. was: I deploy a cluster of ZooKeeper with three nodes: ofs_zk1:20.10.11.101, 30.10.11.101 ofs_zk2:20.10.11.102, 30.10.11.102 ofs_zk3:20.10.11.103, 30.10.11.103 I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. It is supposed that the new Leader should be elected in some seconds, but the fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of them can become the new Leader. I change the log level to DEBUG (the default is INFO), and restart zookeeper servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. I read the log and the ZooKeeper source code, and I think I find the reason. When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()), it will send notifications to all the servers. When it fails to receive any notification during a timeout, it will resend the notifications, and double the timeout. This process will repeat until any notification is received or the timeout reaches a max value. The FastLeaderElection.sendNotifications() just put the notification message into a queue and return. The WorkerSender is responsable to send the notifications. The WorkerSender just process the notifications one by one by passing the notifications to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long time when the notification is send to ofs_zk2(whose network is down) and some notifications (which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications by FastLeaderElection.sendNotifications() just make things worse. Here is the related source code: {code:java} public void toSend(Long
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Description: I deploy a cluster of ZooKeeper with three nodes: ofs_zk1:20.10.11.101, 30.10.11.101 ofs_zk2:20.10.11.102, 30.10.11.102 ofs_zk3:20.10.11.103, 30.10.11.103 I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. It is supposed that the new Leader should be elected in some seconds, but the fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of them can become the new Leader. I change the log level to DEBUG (the default is INFO), and restart zookeeper servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. I read the log and the ZooKeeper source code, and I think I find the reason. When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()), it will send notifications to all the servers. When it fails to receive any notification during a timeout, it will resend the notifications, and double the timeout. This process will repeat until any notification is received or the timeout reaches a max value. The FastLeaderElection.sendNotifications() just put the notification message into a queue and return. The WorkerSender is responsable to send the notifications. The WorkerSender just process the notifications one by one by passing the notifications to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long time when the notification is send to ofs_zk2(whose network is down) and some notifications (which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications by FastLeaderElection.sendNotifications() just make things worse. Here is the related source code: {code:java} public void toSend(Long sid, ByteBuffer b) { /* * If sending message to myself, then simply enqueue it (loopback). */ if (this.mySid == sid) { b.position(0); addToRecvQueue(new Message(b.duplicate(), sid)); /* * Otherwise send to the corresponding thread to send. */ } else { /* * Start a new connection if doesn't have one already. */ ArrayBlockingQueue bq = new ArrayBlockingQueue(SEND_CAPACITY); ArrayBlockingQueue bqExisting = queueSendMap.putIfAbsent(sid, bq); if (bqExisting != null) { addToSendQueue(bqExisting, b); } else { addToSendQueue(bq, b); } // This may block!!! connectOne(sid); } } {code} Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the epoch ack, but in fact the ofs_zk1 does not receive the notification(which says the leader is ofs_zk1) because the ofs_zk3 have not send the notification(the notification may still in the sendqueue of WorkerSender). At last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, so it quit the leader and begins a new election. The log files of ofs_zk1 and ofs_zk3 are attached. was: I deploy a cluster of ZooKeeper with three nodes: ofs_zk1:20.10.11.101, 30.10.11.101 ofs_zk2:20.10.11.102, 30.10.11.102 ofs_zk3:20.10.11.103, 30.10.11.103 I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. It is supposed that the new Leader should be elected in some seconds, but the fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of them can become the new Leader. I change the log level to DEBUG (the default is INFO), and restart zookeeper servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. I read the log and the ZooKeeper source code, and I think I find the reason. When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()), it sends notifications to all the servers. When it fails to receive any notification during a timeout, it will resend the notifications, and double the timeout. This process will repeat until any notification is received or the timeout reaches a max value. The FastLeaderElection.sendNotifications() just put the notification message into a queue and return. The WorkerSender is responsable to send the notifications. The WorkerSender just process the notifications one bye one by passing the notifications to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long time when the notification is send to ofs_zk2(whose network is down) and some notifications (which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications by FastLeaderElection.sendNotifications() just make things worse. Here is the related source code: {code:java} public void toSend(Long sid,
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Summary: Leader cannot be elected due to network timeout of some members. (was: Leader cannot be elected due to network timeout of some member.) > Leader cannot be elected due to network timeout of some members. > > > Key: ZOOKEEPER-2930 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.4.10 > Environment: Java 8 > ZooKeeper 3.4.11(from github) > Centos6.5 >Reporter: Jiafu Jiang >Priority: Major > Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log > > > I deploy a cluster of ZooKeeper with three nodes: > ofs_zk1:20.10.11.101, 30.10.11.101 > ofs_zk2:20.10.11.102, 30.10.11.102 > ofs_zk3:20.10.11.103, 30.10.11.103 > I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. > It is supposed that the new Leader should be elected in some seconds, but the > fact is, ofs_zk1 and ofs_zk2 just keep electing again and again, but none of > them can become the new Leader. > I change the log level to DEBUG (the default is INFO), and restart zookeeper > servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. > I read the log and the ZooKeeper source code, and I think I find the reason. > When the potential leader(says ofs_zk3) begins the > election(FastLeaderElection.lookForLeader()), it sends notifications to all > the servers. > When it fails to receive any notification during a timeout, it will resend > the notifications, and double the timeout. This process will repeat until any > notification is received or the timeout reaches a max value. > The FastLeaderElection.sendNotifications() just put the notification message > into a queue and return. The WorkerSender is responsable to send the > notifications. > The WorkerSender just process the notifications one bye one by passing the > notifications to QuorumCnxManager. Here comes the problem, the > QuorumCnxManager.toSend() blocks for a long time when the notification is > send to ofs_zk2(whose network is down) and some notifications (which belongs > to ofs_zk1) will thus be blocked for a long time. The repeated notifications > by FastLeaderElection.sendNotifications() just make things worse. > Here is the related source code: > {code:java} > public void toSend(Long sid, ByteBuffer b) { > /* > * If sending message to myself, then simply enqueue it (loopback). > */ > if (this.mySid == sid) { > b.position(0); > addToRecvQueue(new Message(b.duplicate(), sid)); > /* > * Otherwise send to the corresponding thread to send. > */ > } else { > /* > * Start a new connection if doesn't have one already. > */ > ArrayBlockingQueue bq = new > ArrayBlockingQueue(SEND_CAPACITY); > ArrayBlockingQueue bqExisting = > queueSendMap.putIfAbsent(sid, bq); > if (bqExisting != null) { > addToSendQueue(bqExisting, b); > } else { > addToSendQueue(bq, b); > } > > // This may block!!! > connectOne(sid); > > } > } > {code} > Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the > epoch ack, but in fact the ofs_zk1 does not receive the notification(which > says the leader is ofs_zk1) because the ofs_zk3 have not send the > notification(the notification may still in the sendqueue of WorkerSender). At > last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, > so it quit the leader and begins a new election. > The log files of ofs_zk1 and ofs_zk2 are attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some members.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Description: I deploy a cluster of ZooKeeper with three nodes: ofs_zk1:20.10.11.101, 30.10.11.101 ofs_zk2:20.10.11.102, 30.10.11.102 ofs_zk3:20.10.11.103, 30.10.11.103 I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. It is supposed that the new Leader should be elected in some seconds, but the fact is, ofs_zk1 and ofs_zk3 just keep electing again and again, but none of them can become the new Leader. I change the log level to DEBUG (the default is INFO), and restart zookeeper servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. I read the log and the ZooKeeper source code, and I think I find the reason. When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()), it sends notifications to all the servers. When it fails to receive any notification during a timeout, it will resend the notifications, and double the timeout. This process will repeat until any notification is received or the timeout reaches a max value. The FastLeaderElection.sendNotifications() just put the notification message into a queue and return. The WorkerSender is responsable to send the notifications. The WorkerSender just process the notifications one bye one by passing the notifications to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long time when the notification is send to ofs_zk2(whose network is down) and some notifications (which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications by FastLeaderElection.sendNotifications() just make things worse. Here is the related source code: {code:java} public void toSend(Long sid, ByteBuffer b) { /* * If sending message to myself, then simply enqueue it (loopback). */ if (this.mySid == sid) { b.position(0); addToRecvQueue(new Message(b.duplicate(), sid)); /* * Otherwise send to the corresponding thread to send. */ } else { /* * Start a new connection if doesn't have one already. */ ArrayBlockingQueue bq = new ArrayBlockingQueue(SEND_CAPACITY); ArrayBlockingQueue bqExisting = queueSendMap.putIfAbsent(sid, bq); if (bqExisting != null) { addToSendQueue(bqExisting, b); } else { addToSendQueue(bq, b); } // This may block!!! connectOne(sid); } } {code} Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the epoch ack, but in fact the ofs_zk1 does not receive the notification(which says the leader is ofs_zk1) because the ofs_zk3 have not send the notification(the notification may still in the sendqueue of WorkerSender). At last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, so it quit the leader and begins a new election. The log files of ofs_zk1 and ofs_zk3 are attached. was: I deploy a cluster of ZooKeeper with three nodes: ofs_zk1:20.10.11.101, 30.10.11.101 ofs_zk2:20.10.11.102, 30.10.11.102 ofs_zk3:20.10.11.103, 30.10.11.103 I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. It is supposed that the new Leader should be elected in some seconds, but the fact is, ofs_zk1 and ofs_zk2 just keep electing again and again, but none of them can become the new Leader. I change the log level to DEBUG (the default is INFO), and restart zookeeper servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. I read the log and the ZooKeeper source code, and I think I find the reason. When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()), it sends notifications to all the servers. When it fails to receive any notification during a timeout, it will resend the notifications, and double the timeout. This process will repeat until any notification is received or the timeout reaches a max value. The FastLeaderElection.sendNotifications() just put the notification message into a queue and return. The WorkerSender is responsable to send the notifications. The WorkerSender just process the notifications one bye one by passing the notifications to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long time when the notification is send to ofs_zk2(whose network is down) and some notifications (which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications by FastLeaderElection.sendNotifications() just make things worse. Here is the related source code: {code:java} public void toSend(Long sid,
[jira] [Updated] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some member.
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiafu Jiang updated ZOOKEEPER-2930: --- Attachment: zookeeper1.log zookeeper2.log zoo.cfg zookeeper1.log : ofs_zk1 zookeeper2.log : ofs_zk3 > Leader cannot be elected due to network timeout of some member. > --- > > Key: ZOOKEEPER-2930 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 > Project: ZooKeeper > Issue Type: Bug > Components: leaderElection >Affects Versions: 3.4.10 > Environment: Java 8 > ZooKeeper 3.4.11(from github) > Centos6.5 >Reporter: Jiafu Jiang >Priority: Major > Attachments: zoo.cfg, zookeeper1.log, zookeeper2.log > > > I deploy a cluster of ZooKeeper with three nodes: > ofs_zk1:20.10.11.101, 30.10.11.101 > ofs_zk2:20.10.11.102, 30.10.11.102 > ofs_zk3:20.10.11.103, 30.10.11.103 > I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. > It is supposed that the new Leader should be elected in some seconds, but the > fact is, ofs_zk1 and ofs_zk2 just keep electing again and again, but none of > them can become the new Leader. > I change the log level to DEBUG (the default is INFO), and restart zookeeper > servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. > I read the log and the ZooKeeper source code, and I think I find the reason. > When the potential leader(says ofs_zk3) begins the > election(FastLeaderElection.lookForLeader()), it sends notifications to all > the servers. > When it fails to receive any notification during a timeout, it will resend > the notifications, and double the timeout. This process will repeat until any > notification is received or the timeout reaches a max value. > The FastLeaderElection.sendNotifications() just put the notification message > into a queue and return. The WorkerSender is responsable to send the > notifications. > The WorkerSender just process the notifications one bye one by passing the > notifications to QuorumCnxManager. Here comes the problem, the > QuorumCnxManager.toSend() blocks for a long time when the notification is > send to ofs_zk2(whose network is down) and some notifications (which belongs > to ofs_zk1) will thus be blocked for a long time. The repeated notifications > by FastLeaderElection.sendNotifications() just make things worse. > Here is the related source code: > {code:java} > public void toSend(Long sid, ByteBuffer b) { > /* > * If sending message to myself, then simply enqueue it (loopback). > */ > if (this.mySid == sid) { > b.position(0); > addToRecvQueue(new Message(b.duplicate(), sid)); > /* > * Otherwise send to the corresponding thread to send. > */ > } else { > /* > * Start a new connection if doesn't have one already. > */ > ArrayBlockingQueue bq = new > ArrayBlockingQueue(SEND_CAPACITY); > ArrayBlockingQueue bqExisting = > queueSendMap.putIfAbsent(sid, bq); > if (bqExisting != null) { > addToSendQueue(bqExisting, b); > } else { > addToSendQueue(bq, b); > } > > // This may block!!! > connectOne(sid); > > } > } > {code} > Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the > epoch ack, but in fact the ofs_zk1 does not receive the notification(which > says the leader is ofs_zk1) because the ofs_zk3 have not send the > notification(the notification may still in the sendqueue of WorkerSender). At > last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, > so it quit the leader and begins a new election. > The log files of ofs_zk1 and ofs_zk2 are attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ZOOKEEPER-2930) Leader cannot be elected due to network timeout of some member.
Jiafu Jiang created ZOOKEEPER-2930: -- Summary: Leader cannot be elected due to network timeout of some member. Key: ZOOKEEPER-2930 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2930 Project: ZooKeeper Issue Type: Bug Components: leaderElection Affects Versions: 3.4.10 Environment: Java 8 ZooKeeper 3.4.11(from github) Centos6.5 Reporter: Jiafu Jiang Priority: Major I deploy a cluster of ZooKeeper with three nodes: ofs_zk1:20.10.11.101, 30.10.11.101 ofs_zk2:20.10.11.102, 30.10.11.102 ofs_zk3:20.10.11.103, 30.10.11.103 I shutdown the network interfaces of ofs_zk2 using "ifdown eth0 eth1" command. It is supposed that the new Leader should be elected in some seconds, but the fact is, ofs_zk1 and ofs_zk2 just keep electing again and again, but none of them can become the new Leader. I change the log level to DEBUG (the default is INFO), and restart zookeeper servers on ofs_zk1 and ofs_zk2 again, but it can not fix the problem. I read the log and the ZooKeeper source code, and I think I find the reason. When the potential leader(says ofs_zk3) begins the election(FastLeaderElection.lookForLeader()), it sends notifications to all the servers. When it fails to receive any notification during a timeout, it will resend the notifications, and double the timeout. This process will repeat until any notification is received or the timeout reaches a max value. The FastLeaderElection.sendNotifications() just put the notification message into a queue and return. The WorkerSender is responsable to send the notifications. The WorkerSender just process the notifications one bye one by passing the notifications to QuorumCnxManager. Here comes the problem, the QuorumCnxManager.toSend() blocks for a long time when the notification is send to ofs_zk2(whose network is down) and some notifications (which belongs to ofs_zk1) will thus be blocked for a long time. The repeated notifications by FastLeaderElection.sendNotifications() just make things worse. Here is the related source code: {code:java} public void toSend(Long sid, ByteBuffer b) { /* * If sending message to myself, then simply enqueue it (loopback). */ if (this.mySid == sid) { b.position(0); addToRecvQueue(new Message(b.duplicate(), sid)); /* * Otherwise send to the corresponding thread to send. */ } else { /* * Start a new connection if doesn't have one already. */ ArrayBlockingQueue bq = new ArrayBlockingQueue(SEND_CAPACITY); ArrayBlockingQueue bqExisting = queueSendMap.putIfAbsent(sid, bq); if (bqExisting != null) { addToSendQueue(bqExisting, b); } else { addToSendQueue(bq, b); } // This may block!!! connectOne(sid); } } {code} Therefore, when ofs_zk3 believes that it is the leader, it begins to wait the epoch ack, but in fact the ofs_zk1 does not receive the notification(which says the leader is ofs_zk1) because the ofs_zk3 have not send the notification(the notification may still in the sendqueue of WorkerSender). At last, the potential leader ofs_zk3 fails to receive the epoch ack in timeout, so it quit the leader and begins a new election. The log files of ofs_zk1 and ofs_zk2 are attached. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ZOOKEEPER-2923) The comment of the variable matchSyncs in class CommitProcessor has a mistake.
Jiafu Jiang created ZOOKEEPER-2923: -- Summary: The comment of the variable matchSyncs in class CommitProcessor has a mistake. Key: ZOOKEEPER-2923 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2923 Project: ZooKeeper Issue Type: Bug Components: quorum Affects Versions: 3.5.3, 3.4.10 Reporter: Jiafu Jiang Priority: Minor The comment of the variable matchSyncs in class CommitProcessor says: {code:java} /** * This flag indicates whether we need to wait for a response to come back from the * leader or we just let the sync operation flow through like a read. The flag will * be true if the CommitProcessor is in a Leader pipeline. */ boolean matchSyncs; {code} I search the source code and find that matchSyncs will be false if the CommitProcessor is in a Leader pipeline, and it will be true if the CommitProcessor is in a Follower pipeline. Therefore I think the comment should be modified to match the code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-1626) Zookeeper C client should be tolerant of clock adjustments
[ https://issues.apache.org/jira/browse/ZOOKEEPER-1626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143671#comment-16143671 ] Jiafu Jiang commented on ZOOKEEPER-1626: Is this PB be fixed in 3.4.X version? > Zookeeper C client should be tolerant of clock adjustments > --- > > Key: ZOOKEEPER-1626 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1626 > Project: ZooKeeper > Issue Type: Sub-task > Components: c client >Reporter: Colin P. McCabe >Assignee: Colin P. McCabe > Fix For: 3.5.1, 3.6.0 > > Attachments: ZOOKEEPER-1366.001.patch, ZOOKEEPER-1366.002.patch, > ZOOKEEPER-1366.003.patch, ZOOKEEPER-1366.004.patch, ZOOKEEPER-1366.006.patch, > ZOOKEEPER-1366.007.patch, ZOOKEEPER-1626.patch > > > The Zookeeper C client should use monotonic time when available, in order to > be more tolerant of time adjustments. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (ZOOKEEPER-2802) Zookeeper C client hang @wait_sync_completion
[ https://issues.apache.org/jira/browse/ZOOKEEPER-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16143651#comment-16143651 ] Jiafu Jiang commented on ZOOKEEPER-2802: [~yihao]I have the same PB, have you find the solution? > Zookeeper C client hang @wait_sync_completion > - > > Key: ZOOKEEPER-2802 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2802 > Project: ZooKeeper > Issue Type: Bug > Components: c client >Affects Versions: 3.4.6 > Environment: DISTRIB_DESCRIPTION="Ubuntu 14.04.2 LTS" >Reporter: yihao yang >Priority: Critical > Attachments: zookeeper.out.2017.05.31-10.06.23 > > > I was using zookeeper 3.4.6 c client to access one zookeeper server in a VM. > The VM environment is not stable and I get a lot of EXPIRED_SESSION_STATE > events. I will create another session to ZK when I get an expired event. I > also have a read/write lock to protect session read (get/list/... on zk) and > write(connect, close, reconnect zhandle). > The problem is the session got an EXPIRED_SESSION_STATE event and when it > tried to hold the write lock and reconnect the session, it found there is a > thread was holding the read lock (which was operating sync list on zk). See > the stack below: > GDBStack: > Thread 7 (Thread 0x7f838a43a700 (LWP 62845)): > #0 pthread_cond_wait@@GLIBC_2.3.2 () at > ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > #1 0x00636033 in wait_sync_completion (sc=sc@entry=0x7f8344000af0) > at src/mt_adaptor.c:85 > #2 0x00633248 in zoo_wget_children2_ (zh=, > path=0x7f83440677a8 "/dict/objects/__services/RLS-GSE/_static_nodes", > watcher=0x0, watcherCtx=0x13e6310, strings=0x7f838a4397b0, > stat=0x7f838a4398d0) at src/zookeeper.c:3630 > #3 0x0045e6ff in ZooKeeperContext::getChildren (this=0x13e6310, > path=..., children=children@entry=0x7f838a439890, > stat=stat@entry=0x7f838a4398d0) at zookeeper_context.cpp:xxx > This sync list didn't return a ZINVALIDSTAT but hung. Anyone know the problem? -- This message was sent by Atlassian JIRA (v6.4.14#64029)