Okay, I took a look, and I couldn't see how to use timing.forSessionSleep() in this test correctly. It's not the test harness I'm trying to block. What I'm trying to test involves seeing how TreeCache reacts to a *real* session loss. The fact that KillSession isn't a real session loss anymore is therefore impeding my ability to test what I was trying to test before.
Let me explain. The following manual test does what I want: 1) Connect a TreeCache to a remote ZK server through an ssh tunnel. 2) Follow the node creation steps in testKilledSession() 3) Kill the tunnel, wait for session loss, restore the tunnel. So I guess my question is, is there any way to "really" kill the session, server side? Or should I formulate this test completely differently? Another way to look at this is that I can't synchronize my test steps to TreeCache's internal operations effectively. On Sun, Feb 7, 2016 at 10:36 PM, Scott Blum <[email protected]> wrote: > Oooh... let me try that. > > On Sun, Feb 7, 2016 at 8:29 PM, Jordan Zimmerman < > [email protected]> wrote: > >> FYI - I added a new method to Timing to help with this: >> >> forSessionSleep() >> >> The new version of KillSession inserts an eventOfDeath directly into the >> client. So, no, it’s not a real session kill like it used to be. But, now >> it’s more reliable. Use timing.forSessionSleep() to wait for session >> timeout. >> >> -JZ >> >> >> On Feb 7, 2016, at 8:21 PM, Scott Blum <[email protected]> wrote: >> >> Can you describe the change then? Because kill session doesn't seem to >> now ensure that ephemeral nodes bound the the killed session disappear in a >> timely manner >> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <[email protected]> >> wrote: >> >>> Are we using a new zookeeper? >>> >>> >>> In Curator 3.0 with “new” connection state handling which simulates a >>> Session timeout. However, all tests are run twice. First with the old >>> handling and then with the new. >>> >>> Or did something change with our implementation of KillSession.kill()? >>> >>> >>> KillSession did change though. A change I made to ZK got added in 3.5 >>> and we now use that. >>> >>> -Jordan >>> >>> On Feb 7, 2016, at 1:55 PM, Scott Blum <[email protected]> wrote: >>> >>> Are we using a new zookeeper? Or did something change with our >>> implementation of KillSession.kill()? >>> >>> Or maybe there's a timing issue with Curator's ConnectionStateManager >>> State change: LOST? I don't understand how we could get a LOST event >>> without the ephemeral node attached to that session having disappeared? >>> >>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman < >>> [email protected]> wrote: >>> >>>> I don’t know if anything changed in ZooKeeper itself. I know that the >>>> connection states changed in Curator, but Curator now tests both the old >>>> mode and the new mode and they both fail here. >>>> >>>> -JZ >>>> >>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <[email protected]> wrote: >>>> >>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0 >>>> branch is that the ephemeral node /test/me created in >>>> testKilledSession() really isn't disappearing when it should. >>>> >>>> After the session loss and the reconnect, /test still shows 2 children >>>> [foo, me] and /test/me still returns a node. >>>> >>>> Any idea why the timing here would have changed? >>>> >>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <[email protected]> >>>> wrote: >>>> >>>>> https://issues.apache.org/jira/browse/CURATOR-302 >>>>> >>>>> I need to trace through what's really going on under the hood rather >>>>> than band-aid the test. Should be able to in next couple of days. >>>>> >>>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman < >>>>> [email protected]> wrote: >>>>> >>>>>> Any update on this? I think you should create a Jira for it. >>>>>> >>>>>> -Jordan >>>>>> >>>>>> >>>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <[email protected]> >>>>>> wrote: >>>>>> >>>>>> BTW, this test passes on master... so it's some kind of 3.0 vs. >>>>>> master issue. I think I'm going to just have to dump in a ton of log >>>>>> messages and see what differs. >>>>>> >>>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> OK - please create a new Issue in Jira for this. >>>>>>> >>>>>>> -Jordan >>>>>>> >>>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have >>>>>>> been broken for a while. Maybe I'll have to git bisect... >>>>>>> >>>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall. I >>>>>>>> think there is a legit bug/race in TreeCache, and the following patch >>>>>>>> *should* remedy: >>>>>>>> >>>>>>>> diff --git >>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>>>> index df4403c..a4a022b 100644 >>>>>>>> --- >>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>>>> +++ >>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java >>>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable >>>>>>>> void wasDeleted() throws Exception >>>>>>>> { >>>>>>>> ChildData oldChildData = childData.getAndSet(null); >>>>>>>> - >>>>>>>> >>>>>>>> client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path); >>>>>>>> ConcurrentMap<String, TreeNode> childMap = >>>>>>>> children.getAndSet(null); >>>>>>>> if ( childMap != null ) >>>>>>>> { >>>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable >>>>>>>> case RECONNECTED: >>>>>>>> try >>>>>>>> { >>>>>>>> + outstandingOps.incrementAndGet(); >>>>>>>> root.wasReconnected(); >>>>>>>> >>>>>>>> publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED); >>>>>>>> + if ( outstandingOps.decrementAndGet() == 0 ) >>>>>>>> + { >>>>>>>> + if ( isInitialized.compareAndSet(false, true) ) >>>>>>>> + { >>>>>>>> + >>>>>>>> publishEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>> + } >>>>>>>> + } >>>>>>>> } >>>>>>>> catch ( Exception e ) >>>>>>>> { >>>>>>>> >>>>>>>> That should guarantee that the initialized event gets deferred >>>>>>>> until all outstanding refreshes finish.. but it's not. Something >>>>>>>> seems to >>>>>>>> have changed under the hood in how background events are getting sent >>>>>>>> to >>>>>>>> TreeCache, and I don't really understand it yet. And running the >>>>>>>> debugger >>>>>>>> seems to affect the timing, like something racy is going on. :( >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Ok, that is kind of weird. I'll take a look. >>>>>>>>> >>>>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> No, sorry. The last few lines of the test currently are: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", >>>>>>>>>> "data".getBytes()); >>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>>> >>>>>>>>>> This fails. But, if I switch them it works: >>>>>>>>>> >>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>>> >>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", >>>>>>>>>> "data".getBytes()); >>>>>>>>>> >>>>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> So you end up with 2 initialized events? >>>>>>>>>> >>>>>>>>>> You mean this? >>>>>>>>>> >>>>>>>>>> assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED); >>>>>>>>>> + assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, >>>>>>>>>> "/test/me", "data".getBytes()); >>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>>> >>>>>>>>>> Seems weird if there are two, but I can help look. >>>>>>>>>> >>>>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hey Scott, >>>>>>>>>>> >>>>>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at: >>>>>>>>>>> >>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, >>>>>>>>>>> "/test/me", "data".getBytes()); >>>>>>>>>>> >>>>>>>>>>> However, if I change the two asserts to: >>>>>>>>>>> >>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED); >>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, >>>>>>>>>>> "/test/me", "data".getBytes()); >>>>>>>>>>> >>>>>>>>>>> it works. Does that make any sense? >>>>>>>>>>> >>>>>>>>>>> -Jordan >>>>>>>>>>> >>>>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> > >>>>>>>>>>> > Devs, >>>>>>>>>>> > >>>>>>>>>>> > In trying to fix the bad log message "Failed to find watcher” >>>>>>>>>>> (which turns out to be a ZK client issue), I realize that the >>>>>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m >>>>>>>>>>> still >>>>>>>>>>> working on getting all tests to pass but I’d appreciate more sets >>>>>>>>>>> of eyes >>>>>>>>>>> on this change. Please review carefully if you can. >>>>>>>>>>> > >>>>>>>>>>> > https://github.com/apache/curator/pull/131 >>>>>>>>>>> > >>>>>>>>>>> > -Jordan >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> >> >
