Hi Kishore, Kanak, any updates?
On Thu, May 19, 2016 at 4:13 PM, kishore g <g.kish...@gmail.com> wrote: > Thanks Shawn. Will review it tonight. Kanak, It will be great if you can > take a look at it as well. > > On Thu, May 19, 2016 at 3:45 PM, Neutron sharc <neutronsh...@gmail.com> > wrote: > >> Hi Helix team, >> >> I uploaded a PR to fix this bug: https://github.com/apache/helix/pull/44 >> >> Thanks. >> >> On Wed, May 18, 2016 at 11:01 PM, Neutron sharc <neutronsh...@gmail.com> >> wrote: >> > Hi Kanak, >> > >> > The same problem with zk helix lock re-appears. I found some clues >> > about the potential bug. This potential bug causes all threads >> > competing for a same zk helix lock to be blocked. >> > >> > In my test there are two java threads blocked when trying to grab zk >> > lock (thread 15 and thread 19) >> > >> > Here are related logs before the threads are blocked (inlined with my >> comments) >> > >> > [INFO 2016-05-18 22:19:54,057 com.hcd.hcdadmin.M1Rebalancer:70] >> > rebalancer thread 15 before zklock >> > => T15 enters >> > >> > [DEBUG 2016-05-18 22:19:54,069 org.apache.helix.lock.zk.WriteLock:193] >> > Created id: /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 >> > => T15 creates znode, T15 is the smallest so it owns lock >> > >> > [INFO 2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:70] >> > rebalancer thread 19 before zklock >> > => T19 enters >> > >> > [INFO 2016-05-18 22:19:54,071 com.hcd.hcdadmin.M1Rebalancer:72] >> > rebalancer thread 15 start computing for controller host1_admin >> > => T15 performs its work >> > >> > [DEBUG 2016-05-18 22:19:54,080 org.apache.helix.lock.zk.WriteLock:193] >> > Created id: /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-0000000079 >> > => T19 creates its znode >> > >> > [DEBUG 2016-05-18 22:19:54,081 org.apache.helix.lock.zk.WriteLock:233] >> > watching less than me node: >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 >> > => T19 found its predecessor to wait for, which is T15 >> > >> > [WARN 2016-05-18 22:19:54,084 org.apache.helix.lock.zk.WriteLock:239] >> > Could not find the stats for less than me: >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911661-0000000078 >> > => T19 calls zookeeper.exist() to register a watcher on T15, but >> > T15 has called unlock() to delete the znode at the same moment. So >> > T19 continues to check while(id==null) loop. Because T19 id is not >> > null now, T19's LockZooKeeperOperation.execute() returns false. T19 >> > will block at wait(), hoping somebody else will notify it. But since >> > T19 is currently the smallest so nobody else can grab the lock and >> > wait up T19; T19 blocks, and every subsequent caller also blocks. >> > >> > The code that leads to the problem is here: >> > >> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L238 >> > >> > One possible fix is to just set id to null at line 240 and let while() >> > loop to retry. >> > >> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/WriteLock.java#L240 >> > >> > >> > [INFO 2016-05-18 22:19:54,092 com.hcd.hcdadmin.M1Rebalancer:125] >> > rebalancer thread 15 released zklock and returned >> > => T15 has deleted znode a short while ago and returns from this >> method >> > >> > >> > [INFO 2016-05-18 22:19:54,179 com.hcd.hcdadmin.M1Rebalancer:70] >> > rebalancer thread 15 before zklock >> > => T15 calls this method again, >> > >> > [DEBUG 2016-05-18 22:19:54,191 org.apache.helix.lock.zk.WriteLock:193] >> > Created id: /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911663-0000000080 >> > => T15 creates znode >> > >> > [DEBUG 2016-05-18 22:19:54,193 org.apache.helix.lock.zk.WriteLock:233] >> > watching less than me node: >> > /shawn1/LOCKS/RESOURCE_Pool0/x-72233245264911662-0000000079 >> > >> > => T15 found T19 to be smallest so it waits for T19. Nobody will >> > wake up T19, so T15 is also blocked. >> > >> > >> > >> > >> > Any comments appreciated. Thanks. >> > >> > >> > -Neutronsharc >> > >> > >> > >> > On Sat, May 14, 2016 at 5:20 PM, Neutron sharc <neutronsh...@gmail.com> >> wrote: >> >> We increased the max connections allowed per client at zk server side. >> >> The problem is gone now. >> >> >> >> On Tue, May 10, 2016 at 2:50 PM, Neutron sharc <neutronsh...@gmail.com> >> wrote: >> >>> Hi Kanak, thanks for reply. >> >>> >> >>> The problem is gone if we set a constraint of 1 on "STATE_TRANSITION" >> >>> for the resource. If we allow multiple state transitions to be >> >>> executed in the resource, then this zklock problem occurs. >> >>> >> >>> btw, we run multiple participants in a same jvm in our test. In >> >>> other words, there are multiple java threads in a same jvm competing >> >>> for zklock. >> >>> >> >>> We haven't profiled the ZKHelixLock._listener.lockAcquired() since we >> >>> bypassed this problem using constraint. Will revisit it later. >> >>> >> >>> >> >>> >> >>> >> >>> On Mon, May 9, 2016 at 8:28 PM, Kanak Biscuitwala <kana...@hotmail.com> >> wrote: >> >>>> Hi, >> >>>> >> >>>> ZkHelixLock is a thin wrapper around the ZooKeeper WriteLock recipe >> (which was last changed over 5 years ago). Though we haven't extensively >> tested it in production, but we haven't seen it fail to return as described. >> >>>> >> >>>> Do you know if ZKHelixLock._listener.lockAcquired() is ever called? >> >>>> >> >>>> Feel free to examine the code here: >> https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/lock/zk/ZKHelixLock.java >> >>>> >> >>>>> From: neutronsh...@gmail.com >> >>>>> Date: Mon, 9 May 2016 14:26:43 -0700 >> >>>>> Subject: calling ZKHelixLock from state machine transition >> >>>>> To: dev@helix.apache.org >> >>>>> >> >>>>> Hi Helix team, >> >>>>> >> >>>>> We observed an issue at state machine transition handle: >> >>>>> >> >>>>> // statemodel.java: >> >>>>> >> >>>>> public void offlineToSlave(Message message, NotificationContext >> context) { >> >>>>> >> >>>>> // do work to start a local shard >> >>>>> >> >>>>> // we want to save the new shard info to resource config >> >>>>> >> >>>>> >> >>>>> ZKHelixLock zklock = new ZKHelixLock(clusterId, resource, >> zkclient); >> >>>>> try { >> >>>>> zklock.lock(); // ==> will be blocked here >> >>>>> >> >>>>> ZNRecord record = zkclient.readData(scope.getZkPath(), true); >> >>>>> update record fields; >> >>>>> zkclient.writeData(scope.getZkPath(), record); >> >>>>> } finally { >> >>>>> zklock.unlock(); >> >>>>> } >> >>>>> } >> >>>>> >> >>>>> After several invocation of this method, zklock.lock() method >> doesn't >> >>>>> return (so the lock is not acquired). State machine threads become >> >>>>> blocked. >> >>>>> >> >>>>> At zk path "<cluster>/LOCKS/RESOURCE_resource" I see several znodes >> >>>>> there, representing outstanding lock requests. >> >>>>> >> >>>>> Are there any special care we should be aware of about zk lock ? >> Thanks. >> >>>>> >> >>>>> >> >>>>> -neutronsharc >> >>>> >>