Can you describe the change then? Because kill session doesn't seem to now ensure that ephemeral nodes bound the the killed session disappear in a timely manner On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <[email protected]> wrote:
> Are we using a new zookeeper? > > > In Curator 3.0 with “new” connection state handling which simulates a > Session timeout. However, all tests are run twice. First with the old > handling and then with the new. > > Or did something change with our implementation of KillSession.kill()? > > > KillSession did change though. A change I made to ZK got added in 3.5 and > we now use that. > > -Jordan > > On Feb 7, 2016, at 1:55 PM, Scott Blum <[email protected]> wrote: > > Are we using a new zookeeper? Or did something change with our > implementation of KillSession.kill()? > > Or maybe there's a timing issue with Curator's ConnectionStateManager > State change: LOST? I don't understand how we could get a LOST event > without the ephemeral node attached to that session having disappeared? > > On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman < > [email protected]> wrote: > >> I don’t know if anything changed in ZooKeeper itself. I know that the >> connection states changed in Curator, but Curator now tests both the old >> mode and the new mode and they both fail here. >> >> -JZ >> >> On Feb 7, 2016, at 1:06 AM, Scott Blum <[email protected]> wrote: >> >> I need to analyze this a bit deeper, but what I'm seeing on the 3.0 >> branch is that the ephemeral node /test/me created in testKilledSession() >> really >> isn't disappearing when it should. >> >> After the session loss and the reconnect, /test still shows 2 children >> [foo, me] and /test/me still returns a node. >> >> Any idea why the timing here would have changed? >> >> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <[email protected]> wrote: >> >>> https://issues.apache.org/jira/browse/CURATOR-302 >>> >>> I need to trace through what's really going on under the hood rather >>> than band-aid the test. Should be able to in next couple of days. >>> >>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman < >>> [email protected]> wrote: >>> >>>> Any update on this? I think you should create a Jira for it. >>>> >>>> -Jordan >>>> >>>> >>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <[email protected]> wrote: >>>> >>>> BTW, this test passes on master... so it's some kind of 3.0 vs. master >>>> issue. I think I'm going to just have to dump in a ton of log messages and >>>> see what differs. >>>> >>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman < >>>> [email protected]> wrote: >>>> >>>>> OK - please create a new Issue in Jira for this. >>>>> >>>>> -Jordan >>>>> >>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <[email protected]> wrote: >>>>> >>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been >>>>> broken for a while. Maybe I'll have to git bisect... >>>>> >>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <[email protected]> >>>>> wrote: >>>>> >>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall. I >>>>>> think there is a legit bug/race in TreeCache, and the following patch >>>>>> *should* remedy: >>>>>> >>>>>> diff --git >>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>> index df4403c..a4a022b 100644 >>>>>> --- >>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>> +++ >>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable >>>>>> void wasDeleted() throws Exception >>>>>> { >>>>>> ChildData oldChildData = childData.getAndSet(null); >>>>>> - >>>>>> >>>>>> client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path); >>>>>> ConcurrentMap<String, TreeNode> childMap = >>>>>> children.getAndSet(null); >>>>>> if ( childMap != null ) >>>>>> { >>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable >>>>>> case RECONNECTED: >>>>>> try >>>>>> { >>>>>> + outstandingOps.incrementAndGet(); >>>>>> root.wasReconnected(); >>>>>> >>>>>> publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED); >>>>>> + if ( outstandingOps.decrementAndGet() == 0 ) >>>>>> + { >>>>>> + if ( isInitialized.compareAndSet(false, true) ) >>>>>> + { >>>>>> + >>>>>> publishEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>> + } >>>>>> + } >>>>>> } >>>>>> catch ( Exception e ) >>>>>> { >>>>>> >>>>>> That should guarantee that the initialized event gets deferred until >>>>>> all outstanding refreshes finish.. but it's not. Something seems to have >>>>>> changed under the hood in how background events are getting sent to >>>>>> TreeCache, and I don't really understand it yet. And running the >>>>>> debugger >>>>>> seems to affect the timing, like something racy is going on. :( >>>>>> >>>>>> >>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Ok, that is kind of weird. I'll take a look. >>>>>>> >>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> No, sorry. The last few lines of the test currently are: >>>>>>>> >>>>>>>> >>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", >>>>>>>> "data".getBytes()); >>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>> >>>>>>>> This fails. But, if I switch them it works: >>>>>>>> >>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>> >>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", >>>>>>>> "data".getBytes()); >>>>>>>> >>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> So you end up with 2 initialized events? >>>>>>>> >>>>>>>> You mean this? >>>>>>>> >>>>>>>> assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED); >>>>>>>> + assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", >>>>>>>> "data".getBytes()); >>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>> >>>>>>>> Seems weird if there are two, but I can help look. >>>>>>>> >>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hey Scott, >>>>>>>>> >>>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at: >>>>>>>>> >>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", >>>>>>>>> "data".getBytes()); >>>>>>>>> >>>>>>>>> However, if I change the two asserts to: >>>>>>>>> >>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", >>>>>>>>> "data".getBytes()); >>>>>>>>> >>>>>>>>> it works. Does that make any sense? >>>>>>>>> >>>>>>>>> -Jordan >>>>>>>>> >>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman < >>>>>>>>> [email protected]> wrote: >>>>>>>>> > >>>>>>>>> > Devs, >>>>>>>>> > >>>>>>>>> > In trying to fix the bad log message "Failed to find watcher” >>>>>>>>> (which turns out to be a ZK client issue), I realize that the >>>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m still >>>>>>>>> working on getting all tests to pass but I’d appreciate more sets of >>>>>>>>> eyes >>>>>>>>> on this change. Please review carefully if you can. >>>>>>>>> > >>>>>>>>> > https://github.com/apache/curator/pull/131 >>>>>>>>> > >>>>>>>>> > -Jordan >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >> >> > >
