Another option would be for the cluster to compare times and note when one member seems to be lagging. Restoration of that lag would then be less remarkable.
I believe that the pattern of these problems is a slow slippage behind and a sudden jump forward. On Thu, Aug 19, 2010 at 7:51 AM, Vishal K <vishalm...@gmail.com> wrote: > Hi, > > I remember Ben had opened a jira for clock jumps earlier: > https://issues.apache.org/jira/browse/ZOOKEEPER-366. It is not uncommon to > have clocks jump forward in virtualized environments. > > It is desirable to modify ZooKeeper to handle this situation (as much as > possible) internally. It would need to be done for both client - server > connections and server - server connections. One obvious solution is to > retry a few times (send ping) after getting a timeout. Another way is to > count the number of pings that have been sent after receiving the timeout. > If number of pings do not match the expected number (say 5 ping attempt > should be finished for a 5 sec timeout), then wait till all the pings are > finished. In effect do not completely rely on the clock. Any comments? > > -Vishal > > On Thu, Aug 19, 2010 at 3:52 AM, Qing Yan <qing...@gmail.com> wrote: > > > Oh.. our servers are also running in a virtualized environment. > > > > On Thu, Aug 19, 2010 at 2:58 PM, Martin Waite <waite....@gmail.com> > wrote: > > > > > Hi, > > > > > > I have tripped over similar problems testing Red Hat Cluster in > > virtualised > > > environments. I don't know whether recent linux kernels have improved > > > their > > > interaction with VMWare, but in our environments clock drift caused by > > lost > > > ticks can be substantial, requiring NTP to sometimes jump the clock > > rather > > > than control acceleration. In one of our internal production rigs, > the > > > local NTP servers themselves were virtualised - causing absolute mayhem > > > when > > > heavy loads hit the other guests on the same physical hosts. > > > > > > The effect on RHCS (v2.0) is quite dramatic. A forward jump in time by > > 10 > > > seconds always causes a member to prematurely time-out on a network > read, > > > causing the member to drop out and trigger a cluster reconfiguration. > > > Apparently NTP is integrated with RHCS version 3, but I don't know what > > is > > > meant by that. > > > > > > I guess this post is not entirely relevent to ZK, but I am just making > > the > > > point that virtualisation (of NTP servers and or clients) can cause > > > repeated > > > premature timeouts. On Linux, I believe that there is a class of > timers > > > provided that is immune to this, but I doubt that there is a platform > > > independent way of coping with this. > > > > > > My 2p. > > > > > > regards, > > > Martin > > > > > > On 18 August 2010 16:53, Patrick Hunt <ph...@apache.org> wrote: > > > > > > > Do you expect the time to be "wrong" frequently? If ntp is running it > > > > should never get out of sync more than a small amount. As long as > this > > is > > > > less than ~your timeout you should be fine. > > > > > > > > Patrick > > > > > > > > > > > > On 08/18/2010 01:04 AM, Qing Yan wrote: > > > > > > > >> Hi, > > > >> > > > >> The testcase is fairly simple. We have a client which connects to > > ZK, > > > >> registers an ephemeral node and watches on it. Now change the client > > > >> machine's time - session killed.. > > > >> > > > >> Here is the log: > > > >> > > > >> *2010-08-18 04:24:57,782 INFO > > > >> com.taobao.timetunnel2.cluster.service.AgentService: Host name > > > >> kgbtest1.corp.alimama.com > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:zookeeper.version=3.2.2-888565, built on 12/08/2009 > 21:51 > > > GMT > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:host.name=kgbtest1.corp.alimama.com > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:java.version=1.6.0_13 > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:java.vendor=Sun Microsystems Inc. > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:java.home=/usr/java/jdk1.6.0_13/jre > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> > > > >> > > > > > > environment:java.class.path=/home/admin/TimeTunnel2/cluster/bin/../conf/agent/:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-log4j12-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/slf4j-api-1.5.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/timetunnel2-cluster-0.0.1-SNAPSHOT.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zookeeper-3.2.2.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/log4j-1.2.14.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/gson-1.4.jar:/home/admin/TimeTunnel2/cluster/bin/../lib/zk-recipes.jar > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> > > > >> > > > > > > environment:java.library.path=/usr/java/jdk1.6.0_13/jre/lib/amd64/server:/usr/java/jdk1.6.0_13/jre/lib/amd64:/usr/java/jdk1.6.0_13/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:java.io.tmpdir=/tmp > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:java.compiler=<NA> > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:os.name=Linux > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:os.arch=amd64 > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:os.version=2.6.18-164.el5 > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:user.name=admin > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:user.home=/home/admin > > > >> 2010-08-18 04:24:57,789 INFO org.apache.zookeeper.ZooKeeper: Client > > > >> environment:user.dir=/home/admin/TimeTunnel2/cluster/log > > > >> 2010-08-18 04:24:57,790 INFO org.apache.zookeeper.ZooKeeper: > > Initiating > > > >> client connection, connectString= > xentest10-vm5.corp.alimama.com:2181, > > > >> xentest10-vm6.corp.alimama.com:2181, > > xentest10-vm9.corp.alimama.com:2181 > > > >> sessionTimeout=600000 > > > >> watcher=com.taobao.timetunnel2.cluster.service.agentserv...@48d6c16c > > > >> 2010-08-18 04:24:57,791 INFO org.apache.zookeeper.ClientCnxn: > > > >> zookeeper.disableAutoWatchReset is false > > > >> 2010-08-18 04:24:57,817 INFO org.apache.zookeeper.ClientCnxn: > > Attempting > > > >> connection to server > xentest10-vm5.corp.alimama.com/10.254.5.160:2181 > > > >> 2010-08-18 04:24:57,818 INFO org.apache.zookeeper.ClientCnxn: > Priming > > > >> connection to java.nio.channels.SocketChannel[connected local=/ > > > >> 10.254.2.160:48536 remote= > > > >> xentest10-vm5.corp.alimama.com/10.254.5.160:2181] > > > >> 2010-08-18 04:24:57,821 INFO org.apache.zookeeper.ClientCnxn: Server > > > >> connection successful > > > >> 2010-08-18 04:24:57,832 INFO > > > >> com.taobao.timetunnel2.cluster.service.AgentService: Register... > > > >> 2010-08-18 04:24:57,834 DEBUG org.apache.zookeeper.ClientCnxn: > Reading > > > >> reply > > > >> sessionid:0x12a8413f71a0003, packet:: clientPath:null > serverPath:null > > > >> finished:false header:: -8,101 replyHeader:: -8,-1,0 request:: > > > >> 0,v{},v{},v{} response:: null > > > >> 2010-08-18 04:24:57,846 DEBUG org.apache.zookeeper.ClientCnxn: > Reading > > > >> reply > > > >> sessionid:0x12a8413f71a0003, packet:: clientPath:null > serverPath:null > > > >> finished:false header:: 1,1 replyHeader:: 1,8589934649,-110 > > request:: > > > >> '/agent,,v{s{31,s{'world,'anyone}}},0 response:: > > > >> 2010-08-18 04:24:57,853 DEBUG > > > >> com.taobao.timetunnel2.cluster.zookeeper.operation.PathCreator: Path > > > >> /agent > > > >> already exist > > > >> 2010-08-18 04:24:57,882 DEBUG org.apache.zookeeper.ClientCnxn: > Reading > > > >> reply > > > >> sessionid:0x12a8413f71a0003, packet:: clientPath:null > serverPath:null > > > >> finished:false header:: 2,1 replyHeader:: 2,8589934650,-110 > > request:: > > > >> '/agent/status,,v{s{31,s{'world,'anyone}}},0 response:: > > > >> 2010-08-18 04:24:57,882 DEBUG > > > >> com.taobao.timetunnel2.cluster.zookeeper.operation.PathCreator: Path > > > >> /agent/status already exist > > > >> 2010-08-18 04:24:57,886 DEBUG org.apache.zookeeper.ClientCnxn: > Reading > > > >> reply > > > >> sessionid:0x12a8413f71a0003, packet:: clientPath:null > serverPath:null > > > >> finished:false header:: 3,1 replyHeader:: 3,8589934651,0 request:: > > > >> '/agent/status/kgbtest1.corp.alimama.com > > ,,v{s{31,s{'world,'anyone}}},1 > > > >> response:: '/agent/status/kgbtest1.corp.alimama.com > > > >> 2010-08-18 04:24:57,886 DEBUG > > > >> com.taobao.timetunnel2.cluster.zookeeper.operation.PathCreator: Path > > > >> /agent/status/kgbtest1.corp.alimama.com creation OK > > > >> 2010-08-18 04:24:57,886 INFO > > > >> com.taobao.timetunnel2.cluster.service.AgentService: Watch for -> > > > >> /agent/status/kgbtest1.corp.alimama.com > > > >> 2010-08-18 04:24:57,892 DEBUG org.apache.zookeeper.ClientCnxn: > Reading > > > >> reply > > > >> sessionid:0x12a8413f71a0003, packet:: clientPath:null > serverPath:null > > > >> finished:false header:: 4,4 replyHeader:: 4,-1,0 request:: > > > >> '/agent/status/ > > > >> kgbtest1.corp.alimama.com,T response:: > > > >> > > > >> > > > > > > ,s{8589934651,8589934651,1282116165777,1282116165777,0,0,0,84024764344696835,0,0,8589934651} > > > >> > > > >> 2010-08-18 04:24:57,892 INFO > > > >> com.taobao.timetunnel2.cluster.service.AgentService: Data changed > > > >> @/agent/status/kgbtest1.corp.alimama.com > > > >> 2010-08-18 04:25:11,226 DEBUG org.apache.zookeeper.ClientCnxn: Got > > ping > > > >> response for sessionid:0x12a8413f71a0003 after 1ms > > > >> 2010-08-18 04:25:24,560 DEBUG org.apache.zookeeper.ClientCnxn: Got > > ping > > > >> response for sessionid:0x12a8413f71a0003 after 2ms > > > >> 2010-08-18 04:25:37,893 DEBUG org.apache.zookeeper.ClientCnxn: Got > > ping > > > >> response for sessionid:0x12a8413f71a0003 after 1ms > > > >> 2010-08-18 04:19:44,853 DEBUG org.apache.zookeeper.ClientCnxn: Got > > > >> notification sessionid:0x12a8413f71a0003 > > > >> 2010-08-18 04:19:44,854 DEBUG org.apache.zookeeper.ClientCnxn: Got > > > >> WatchedEvent: Znode change. Path: > > > >> /agent/status/kgbtest1.corp.alimama.comType: NodeDeleted for > sessionid > > > >> 0x12a8413f71a0003 > > > >> 2010-08-18 04:19:44,854 DEBUG > > > >> com.taobao.timetunnel2.cluster.zookeeper.operation.PathDataWatcher: > > > >> getChildren event =>WatchedEvent: Znode change. Path: /agent/status/ > > > >> kgbtest1.corp.alimama.com Type: NodeDeleted > > > >> 2010-08-18 04:19:44,855 WARN org.apache.zookeeper.ClientCnxn: > > Exception > > > >> closing session 0x12a8413f71a0003 to > > > sun.nio.ch.selectionkeyi...@7ca3d4cf > > > >> java.io.IOException: Read error rc = -1 > > java.nio.DirectByteBuffer[pos=0 > > > >> lim=4 cap=4] > > > >> at > > > >> > > > >> > > > > > > org.apache.zookeeper.ClientCnxn$SendThread.doIO(zookeeper:ClientCnxn.java):701) > > > >> at > > > >> > > > >> > > > > > > org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):945) > > > >> 2010-08-18 04:19:45,591 INFO org.apache.zookeeper.ClientCnxn: > > Attempting > > > >> connection to server > xentest10-vm6.corp.alimama.com/10.254.5.161:2181 > > > >> 2010-08-18 04:19:45,591 INFO org.apache.zookeeper.ClientCnxn: > Priming > > > >> connection to java.nio.channels.SocketChannel[connected local=/ > > > >> 10.254.2.160:45776 remote= > > > >> xentest10-vm6.corp.alimama.com/10.254.5.161:2181] > > > >> 2010-08-18 04:19:45,591 INFO org.apache.zookeeper.ClientCnxn: Server > > > >> connection successful > > > >> 2010-08-18 04:19:45,593 WARN org.apache.zookeeper.ClientCnxn: > > Exception > > > >> closing session 0x12a8413f71a0003 to > > > sun.nio.ch.selectionkeyi...@2fdb7df8 > > > >> java.io.IOException: Session Expired > > > >> at > > > >> > > > >> > > > > > > org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(zookeeper:ClientCnxn.java):589) > > > >> at > > > >> > > > >> > > > > > > org.apache.zookeeper.ClientCnxn$SendThread.doIO(zookeeper:ClientCnxn.java):709) > > > >> at > > > >> > > > >> > > > > > > org.apache.zookeeper.ClientCnxn$SendThread.run(zookeeper:ClientCnxn.java):945) > > > >> > > > >> * I guess ZK might rely on timestamp to keep sessions alive, but > we > > > >> have > > > >> NTP daemon running so machine time can get changed > > > >> automatically, is there a conflict? > > > >> > > > >> > > > > > >