That totally fixed it! Thanks, Jordan! https://github.com/apache/curator/pull/132
On Mon, Feb 8, 2016 at 12:08 PM, Jordan Zimmerman < [email protected]> wrote: > Well, you could add back an alternate version KillSession that works as it > used it. I’m OK with that. > > -JZ > > > On Feb 8, 2016, at 12:06 PM, Scott Blum <[email protected]> wrote: > > Okay, I took a look, and I couldn't see how to use > timing.forSessionSleep() in this test correctly. It's not the test harness > I'm trying to block. What I'm trying to test involves seeing how TreeCache > reacts to a *real* session loss. The fact that KillSession isn't a real > session loss anymore is therefore impeding my ability to test what I was > trying to test before. > > Let me explain. The following manual test does what I want: > > 1) Connect a TreeCache to a remote ZK server through an ssh tunnel. > 2) Follow the node creation steps in testKilledSession() > 3) Kill the tunnel, wait for session loss, restore the tunnel. > > So I guess my question is, is there any way to "really" kill the session, > server side? Or should I formulate this test completely differently? > > Another way to look at this is that I can't synchronize my test steps to > TreeCache's internal operations effectively. > > > On Sun, Feb 7, 2016 at 10:36 PM, Scott Blum <[email protected]> wrote: > >> Oooh... let me try that. >> >> On Sun, Feb 7, 2016 at 8:29 PM, Jordan Zimmerman < >> [email protected]> wrote: >> >>> FYI - I added a new method to Timing to help with this: >>> >>> forSessionSleep() >>> >>> The new version of KillSession inserts an eventOfDeath directly into the >>> client. So, no, it’s not a real session kill like it used to be. But, now >>> it’s more reliable. Use timing.forSessionSleep() to wait for session >>> timeout. >>> >>> -JZ >>> >>> >>> On Feb 7, 2016, at 8:21 PM, Scott Blum <[email protected]> wrote: >>> >>> Can you describe the change then? Because kill session doesn't seem to >>> now ensure that ephemeral nodes bound the the killed session disappear in a >>> timely manner >>> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <[email protected]> >>> wrote: >>> >>>> Are we using a new zookeeper? >>>> >>>> >>>> In Curator 3.0 with “new” connection state handling which simulates a >>>> Session timeout. However, all tests are run twice. First with the old >>>> handling and then with the new. >>>> >>>> Or did something change with our implementation of KillSession.kill()? >>>> >>>> >>>> KillSession did change though. A change I made to ZK got added in 3.5 >>>> and we now use that. >>>> >>>> -Jordan >>>> >>>> On Feb 7, 2016, at 1:55 PM, Scott Blum <[email protected]> wrote: >>>> >>>> Are we using a new zookeeper? Or did something change with our >>>> implementation of KillSession.kill()? >>>> >>>> Or maybe there's a timing issue with Curator's ConnectionStateManager >>>> State change: LOST? I don't understand how we could get a LOST event >>>> without the ephemeral node attached to that session having disappeared? >>>> >>>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman < >>>> [email protected]> wrote: >>>> >>>>> I don’t know if anything changed in ZooKeeper itself. I know that the >>>>> connection states changed in Curator, but Curator now tests both the old >>>>> mode and the new mode and they both fail here. >>>>> >>>>> -JZ >>>>> >>>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <[email protected]> wrote: >>>>> >>>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0 >>>>> branch is that the ephemeral node /test/me created in >>>>> testKilledSession() really isn't disappearing when it should. >>>>> >>>>> After the session loss and the reconnect, /test still shows 2 children >>>>> [foo, me] and /test/me still returns a node. >>>>> >>>>> Any idea why the timing here would have changed? >>>>> >>>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <[email protected]> >>>>> wrote: >>>>> >>>>>> https://issues.apache.org/jira/browse/CURATOR-302 >>>>>> >>>>>> I need to trace through what's really going on under the hood rather >>>>>> than band-aid the test. Should be able to in next couple of days. >>>>>> >>>>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Any update on this? I think you should create a Jira for it. >>>>>>> >>>>>>> -Jordan >>>>>>> >>>>>>> >>>>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> BTW, this test passes on master... so it's some kind of 3.0 vs. >>>>>>> master issue. I think I'm going to just have to dump in a ton of log >>>>>>> messages and see what differs. >>>>>>> >>>>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> OK - please create a new Issue in Jira for this. >>>>>>>> >>>>>>>> -Jordan >>>>>>>> >>>>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have >>>>>>>> been broken for a while. Maybe I'll have to git bisect... >>>>>>>> >>>>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall. >>>>>>>>> I think there is a legit bug/race in TreeCache, and the following >>>>>>>>> patch >>>>>>>>> *should* remedy: >>>>>>>>> >>>>>>>>> diff --git >>>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>>>>> index df4403c..a4a022b 100644 >>>>>>>>> --- >>>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>>>>> +++ >>>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable >>>>>>>>> void wasDeleted() throws Exception >>>>>>>>> { >>>>>>>>> ChildData oldChildData = childData.getAndSet(null); >>>>>>>>> - >>>>>>>>> >>>>>>>>> client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path); >>>>>>>>> ConcurrentMap<String, TreeNode> childMap = >>>>>>>>> children.getAndSet(null); >>>>>>>>> if ( childMap != null ) >>>>>>>>> { >>>>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable >>>>>>>>> case RECONNECTED: >>>>>>>>> try >>>>>>>>> { >>>>>>>>> + outstandingOps.incrementAndGet(); >>>>>>>>> root.wasReconnected(); >>>>>>>>> >>>>>>>>> publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED); >>>>>>>>> + if ( outstandingOps.decrementAndGet() == 0 ) >>>>>>>>> + { >>>>>>>>> + if ( isInitialized.compareAndSet(false, true) >>>>>>>>> ) >>>>>>>>> + { >>>>>>>>> + >>>>>>>>> publishEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>> + } >>>>>>>>> + } >>>>>>>>> } >>>>>>>>> catch ( Exception e ) >>>>>>>>> { >>>>>>>>> >>>>>>>>> That should guarantee that the initialized event gets deferred >>>>>>>>> until all outstanding refreshes finish.. but it's not. Something >>>>>>>>> seems to >>>>>>>>> have changed under the hood in how background events are getting sent >>>>>>>>> to >>>>>>>>> TreeCache, and I don't really understand it yet. And running the >>>>>>>>> debugger >>>>>>>>> seems to affect the timing, like something racy is going on. :( >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <[email protected] >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Ok, that is kind of weird. I'll take a look. >>>>>>>>>> >>>>>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> No, sorry. The last few lines of the test currently are: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", >>>>>>>>>>> "data".getBytes()); >>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>>>> >>>>>>>>>>> This fails. But, if I switch them it works: >>>>>>>>>>> >>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>>>> >>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", >>>>>>>>>>> "data".getBytes()); >>>>>>>>>>> >>>>>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> So you end up with 2 initialized events? >>>>>>>>>>> >>>>>>>>>>> You mean this? >>>>>>>>>>> >>>>>>>>>>> assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED) >>>>>>>>>>> ; >>>>>>>>>>> + assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, >>>>>>>>>>> "/test/me", "data".getBytes()); >>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>>>> >>>>>>>>>>> Seems weird if there are two, but I can help look. >>>>>>>>>>> >>>>>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey Scott, >>>>>>>>>>>> >>>>>>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at: >>>>>>>>>>>> >>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, >>>>>>>>>>>> "/test/me", "data".getBytes()); >>>>>>>>>>>> >>>>>>>>>>>> However, if I change the two asserts to: >>>>>>>>>>>> >>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, >>>>>>>>>>>> "/test/me", "data".getBytes()); >>>>>>>>>>>> >>>>>>>>>>>> it works. Does that make any sense? >>>>>>>>>>>> >>>>>>>>>>>> -Jordan >>>>>>>>>>>> >>>>>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> > >>>>>>>>>>>> > Devs, >>>>>>>>>>>> > >>>>>>>>>>>> > In trying to fix the bad log message "Failed to find watcher” >>>>>>>>>>>> (which turns out to be a ZK client issue), I realize that the >>>>>>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m >>>>>>>>>>>> still >>>>>>>>>>>> working on getting all tests to pass but I’d appreciate more sets >>>>>>>>>>>> of eyes >>>>>>>>>>>> on this change. Please review carefully if you can. >>>>>>>>>>>> > >>>>>>>>>>>> > https://github.com/apache/curator/pull/131 >>>>>>>>>>>> > >>>>>>>>>>>> > -Jordan >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >> > >
