Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Scott Blum Mon, 08 Feb 2016 09:07:43 -0800

Okay, I took a look, and I couldn't see how to use timing.forSessionSleep()
in this test correctly.  It's not the test harness I'm trying to block.
What I'm trying to test involves seeing how TreeCache reacts to a *real*
session loss.  The fact that KillSession isn't a real session loss anymore
is therefore impeding my ability to test what I was trying to test before.


Let me explain.  The following manual test does what I want:

1) Connect a TreeCache to a remote ZK server through an ssh tunnel.
2) Follow the node creation steps in testKilledSession()
3) Kill the tunnel, wait for session loss, restore the tunnel.

So I guess my question is, is there any way to "really" kill the session,
server side?  Or should I formulate this test completely differently?

Another way to look at this is that I can't synchronize my test steps to
TreeCache's internal operations effectively.


On Sun, Feb 7, 2016 at 10:36 PM, Scott Blum <[email protected]> wrote:

> Oooh... let me try that.
>
> On Sun, Feb 7, 2016 at 8:29 PM, Jordan Zimmerman <
> [email protected]> wrote:
>
>> FYI - I added a new method to Timing to help with this:
>>
>> forSessionSleep()
>>
>> The new version of KillSession inserts an eventOfDeath directly into the
>> client. So, no, it’s not a real session kill like it used to be. But, now
>> it’s more reliable. Use timing.forSessionSleep() to wait for session
>> timeout.
>>
>> -JZ
>>
>>
>> On Feb 7, 2016, at 8:21 PM, Scott Blum <[email protected]> wrote:
>>
>> Can you describe the change then? Because kill session doesn't seem to
>> now ensure that ephemeral nodes bound the the killed session disappear in a
>> timely manner
>> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <[email protected]>
>> wrote:
>>
>>> Are we using a new zookeeper?
>>>
>>>
>>> In Curator 3.0 with “new” connection state handling which simulates a
>>> Session timeout. However, all tests are run twice. First with the old
>>> handling and then with the new.
>>>
>>> Or did something change with our implementation of KillSession.kill()?
>>>
>>>
>>> KillSession did change though. A change I made to ZK got added in 3.5
>>> and we now use that.
>>>
>>> -Jordan
>>>
>>> On Feb 7, 2016, at 1:55 PM, Scott Blum <[email protected]> wrote:
>>>
>>> Are we using a new zookeeper?  Or did something change with our
>>> implementation of KillSession.kill()?
>>>
>>> Or maybe there's a timing issue with Curator's ConnectionStateManager
>>> State change: LOST?  I don't understand how we could get a LOST event
>>> without the ephemeral node attached to that session having disappeared?
>>>
>>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <
>>> [email protected]> wrote:
>>>
>>>> I don’t know if anything changed in ZooKeeper itself. I know that the
>>>> connection states changed in Curator, but Curator now tests both the old
>>>> mode and the new mode and they both fail here.
>>>>
>>>> -JZ
>>>>
>>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <[email protected]> wrote:
>>>>
>>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0
>>>> branch is that the ephemeral node /test/me created in
>>>> testKilledSession() really isn't disappearing when it should.
>>>>
>>>> After the session loss and the reconnect, /test still shows 2 children
>>>> [foo, me] and /test/me still returns a node.
>>>>
>>>> Any idea why the timing here would have changed?
>>>>
>>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <[email protected]>
>>>> wrote:
>>>>
>>>>> https://issues.apache.org/jira/browse/CURATOR-302
>>>>>
>>>>> I need to trace through what's really going on under the hood rather
>>>>> than band-aid the test.  Should be able to in next couple of days.
>>>>>
>>>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Any update on this? I think you should create a Jira for it.
>>>>>>
>>>>>> -Jordan
>>>>>>
>>>>>>
>>>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> BTW, this test passes on master... so it's some kind of 3.0 vs.
>>>>>> master issue.  I think I'm going to just have to dump in a ton of log
>>>>>> messages and see what differs.
>>>>>>
>>>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> OK - please create a new Issue in Jira for this.
>>>>>>>
>>>>>>> -Jordan
>>>>>>>
>>>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have
>>>>>>> been broken for a while.  Maybe I'll have to git bisect...
>>>>>>>
>>>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I
>>>>>>>> think there is a legit bug/race in TreeCache, and the following patch
>>>>>>>> *should* remedy:
>>>>>>>>
>>>>>>>> diff --git
>>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>> index df4403c..a4a022b 100644
>>>>>>>> ---
>>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>> +++
>>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>>>>          void wasDeleted() throws Exception
>>>>>>>>          {
>>>>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>>>>> -
>>>>>>>>  
>>>>>>>> client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>>>>              ConcurrentMap<String, TreeNode> childMap =
>>>>>>>> children.getAndSet(null);
>>>>>>>>              if ( childMap != null )
>>>>>>>>              {
>>>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>>>>          case RECONNECTED:
>>>>>>>>              try
>>>>>>>>              {
>>>>>>>> +                outstandingOps.incrementAndGet();
>>>>>>>>                  root.wasReconnected();
>>>>>>>>
>>>>>>>>  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>>>>> +                {
>>>>>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>>>>>> +                    {
>>>>>>>> +
>>>>>>>>  publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>> +                    }
>>>>>>>> +                }
>>>>>>>>              }
>>>>>>>>              catch ( Exception e )
>>>>>>>>              {
>>>>>>>>
>>>>>>>> That should guarantee that the initialized event gets deferred
>>>>>>>> until all outstanding refreshes finish.. but it's not.  Something 
>>>>>>>> seems to
>>>>>>>> have changed under the hood in how background events are getting sent 
>>>>>>>> to
>>>>>>>> TreeCache, and I don't really understand it yet.  And running the 
>>>>>>>> debugger
>>>>>>>> seems to affect the timing, like something racy is going on. :(
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>>>>>
>>>>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> No, sorry. The last few lines of the test currently are:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", 
>>>>>>>>>> "data".getBytes());
>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>
>>>>>>>>>> This fails. But, if I switch them it works:
>>>>>>>>>>
>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>
>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", 
>>>>>>>>>> "data".getBytes());
>>>>>>>>>>
>>>>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> So you end up with 2 initialized events?
>>>>>>>>>>
>>>>>>>>>> You mean this?
>>>>>>>>>>
>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>
>>>>>>>>>> Seems weird if there are two, but I can help look.
>>>>>>>>>>
>>>>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey Scott,
>>>>>>>>>>>
>>>>>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>>>>>>>
>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>>
>>>>>>>>>>> However, if I change the two asserts to:
>>>>>>>>>>>
>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>>
>>>>>>>>>>> it works. Does that make any sense?
>>>>>>>>>>>
>>>>>>>>>>> -Jordan
>>>>>>>>>>>
>>>>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>> >
>>>>>>>>>>> > Devs,
>>>>>>>>>>> >
>>>>>>>>>>> > In trying to fix the bad log message "Failed to find watcher”
>>>>>>>>>>> (which turns out to be a ZK client issue), I realize that the
>>>>>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be improved. I’m 
>>>>>>>>>>> still
>>>>>>>>>>> working on getting all tests to pass but I’d appreciate more sets 
>>>>>>>>>>> of eyes
>>>>>>>>>>> on this change. Please review carefully if you can.
>>>>>>>>>>> >
>>>>>>>>>>> > https://github.com/apache/curator/pull/131
>>>>>>>>>>> >
>>>>>>>>>>> > -Jordan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Reply via email to