Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Jordan Zimmerman Mon, 08 Feb 2016 09:09:09 -0800

Well, you could add back an alternate version KillSession that works as it used 
it. I’m OK with that.


-JZ

> On Feb 8, 2016, at 12:06 PM, Scott Blum <[email protected]> wrote:
> 
> Okay, I took a look, and I couldn't see how to use timing.forSessionSleep() 
> in this test correctly.  It's not the test harness I'm trying to block.  What 
> I'm trying to test involves seeing how TreeCache reacts to a real session 
> loss.  The fact that KillSession isn't a real session loss anymore is 
> therefore impeding my ability to test what I was trying to test before.
> 
> Let me explain.  The following manual test does what I want:
> 
> 1) Connect a TreeCache to a remote ZK server through an ssh tunnel.
> 2) Follow the node creation steps in testKilledSession()
> 3) Kill the tunnel, wait for session loss, restore the tunnel.
> 
> So I guess my question is, is there any way to "really" kill the session, 
> server side?  Or should I formulate this test completely differently?
> 
> Another way to look at this is that I can't synchronize my test steps to 
> TreeCache's internal operations effectively.
> 
> 
> On Sun, Feb 7, 2016 at 10:36 PM, Scott Blum <[email protected] 
> <mailto:[email protected]>> wrote:
> Oooh... let me try that.
> 
> On Sun, Feb 7, 2016 at 8:29 PM, Jordan Zimmerman <[email protected] 
> <mailto:[email protected]>> wrote:
> FYI - I added a new method to Timing to help with this:
> forSessionSleep()
> The new version of KillSession inserts an eventOfDeath directly into the 
> client. So, no, it’s not a real session kill like it used to be. But, now 
> it’s more reliable. Use timing.forSessionSleep() to wait for session timeout.
> 
> -JZ
> 
> 
>> On Feb 7, 2016, at 8:21 PM, Scott Blum <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Can you describe the change then? Because kill session doesn't seem to now 
>> ensure that ephemeral nodes bound the the killed session disappear in a 
>> timely manner
>> 
>> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> Are we using a new zookeeper?
>> 
>> 
>> In Curator 3.0 with “new” connection state handling which simulates a 
>> Session timeout. However, all tests are run twice. First with the old 
>> handling and then with the new.
>> 
>>> Or did something change with our implementation of KillSession.kill()?
>> 
>> KillSession did change though. A change I made to ZK got added in 3.5 and we 
>> now use that.
>> 
>> -Jordan
>> 
>>> On Feb 7, 2016, at 1:55 PM, Scott Blum <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Are we using a new zookeeper?  Or did something change with our 
>>> implementation of KillSession.kill()?
>>> 
>>> Or maybe there's a timing issue with Curator's ConnectionStateManager State 
>>> change: LOST?  I don't understand how we could get a LOST event without the 
>>> ephemeral node attached to that session having disappeared?
>>> 
>>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman 
>>> <[email protected] <mailto:[email protected]>> wrote:
>>> I don’t know if anything changed in ZooKeeper itself. I know that the 
>>> connection states changed in Curator, but Curator now tests both the old 
>>> mode and the new mode and they both fail here.
>>> 
>>> -JZ
>>> 
>>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0 branch 
>>>> is that the ephemeral node /test/me created in testKilledSession() really 
>>>> isn't disappearing when it should.
>>>> 
>>>> After the session loss and the reconnect, /test still shows 2 children 
>>>> [foo, me] and /test/me still returns a node.
>>>> 
>>>> Any idea why the timing here would have changed?
>>>> 
>>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> https://issues.apache.org/jira/browse/CURATOR-302 
>>>> <https://issues.apache.org/jira/browse/CURATOR-302>
>>>> 
>>>> I need to trace through what's really going on under the hood rather than 
>>>> band-aid the test.  Should be able to in next couple of days.
>>>> 
>>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman 
>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>> Any update on this? I think you should create a Jira for it.
>>>> 
>>>> -Jordan
>>>> 
>>>> 
>>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> BTW, this test passes on master... so it's some kind of 3.0 vs. master 
>>>>> issue.  I think I'm going to just have to dump in a ton of log messages 
>>>>> and see what differs.
>>>>> 
>>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman 
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>> OK - please create a new Issue in Jira for this.
>>>>> 
>>>>> -Jordan
>>>>> 
>>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been 
>>>>>> broken for a while.  Maybe I'll have to git bisect...
>>>>>> 
>>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I 
>>>>>> think there is a legit bug/race in TreeCache, and the following patch 
>>>>>> *should* remedy:
>>>>>> 
>>>>>> diff --git 
>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>  
>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>> index df4403c..a4a022b 100644
>>>>>> --- 
>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>> +++ 
>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>>          void wasDeleted() throws Exception
>>>>>>          {
>>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>>> -            
>>>>>> client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>>              ConcurrentMap<String, TreeNode> childMap = 
>>>>>> children.getAndSet(null);
>>>>>>              if ( childMap != null )
>>>>>>              {
>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>>          case RECONNECTED:
>>>>>>              try
>>>>>>              {
>>>>>> +                outstandingOps.incrementAndGet();
>>>>>>                  root.wasReconnected();
>>>>>>                  
>>>>>> publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>>> +                {
>>>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>>>> +                    {
>>>>>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>> +                    }
>>>>>> +                }
>>>>>>              }
>>>>>>              catch ( Exception e )
>>>>>>              {
>>>>>> 
>>>>>> That should guarantee that the initialized event gets deferred until all 
>>>>>> outstanding refreshes finish.. but it's not.  Something seems to have 
>>>>>> changed under the hood in how background events are getting sent to 
>>>>>> TreeCache, and I don't really understand it yet.  And running the 
>>>>>> debugger seems to affect the timing, like something racy is going on. :(
>>>>>> 
>>>>>> 
>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>> 
>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman 
>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>> No, sorry. The last few lines of the test currently are:
>>>>>> 
>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", 
>>>>>> "data".getBytes());
>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>> 
>>>>>> This fails. But, if I switch them it works:
>>>>>> 
>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", 
>>>>>> "data".getBytes());
>>>>>> 
>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> So you end up with 2 initialized events?
>>>>>>> 
>>>>>>> You mean this?
>>>>>>> 
>>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", 
>>>>>>> "data".getBytes());
>>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>> 
>>>>>>> Seems weird if there are two, but I can help look.
>>>>>>> 
>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman 
>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>> Hey Scott,
>>>>>>> 
>>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>>> 
>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", 
>>>>>>> "data".getBytes());
>>>>>>> 
>>>>>>> However, if I change the two asserts to:
>>>>>>> 
>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", 
>>>>>>> "data".getBytes());
>>>>>>> 
>>>>>>> it works. Does that make any sense?
>>>>>>> 
>>>>>>> -Jordan
>>>>>>> 
>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman 
>>>>>>> > <[email protected] <mailto:[email protected]>> 
>>>>>>> > wrote:
>>>>>>> >
>>>>>>> > Devs,
>>>>>>> >
>>>>>>> > In trying to fix the bad log message "Failed to find watcher” (which 
>>>>>>> > turns out to be a ZK client issue), I realize that the 
>>>>>>> > NamespaceWatcher and WatcherWrapper stuff could be improved. I’m 
>>>>>>> > still working on getting all tests to pass but I’d appreciate more 
>>>>>>> > sets of eyes on this change. Please review carefully if you can.
>>>>>>> >
>>>>>>> > https://github.com/apache/curator/pull/131 
>>>>>>> > <https://github.com/apache/curator/pull/131>
>>>>>>> >
>>>>>>> > -Jordan
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
>

Re: PLEASE REVIEW - Major re-work of Watcher wrappers

Reply via email to