TestInterProcessSemaphoreCluster.testCluster() is failling (assertion at line 294). Again, it seems like some sort of race condition with the watcher removal.
When I run it in Eclipse, it fails maybe 25% of the time. When it fails it seems that it's got something to do with watcher removal. When the test passes, this error is not logged. org.apache.zookeeper.KeeperException$NoWatcherException: KeeperErrorCode = No such watcher for /foo/bar/lock/leases at org.apache.zookeeper.ZooKeeper$ZKWatchManager.containsWatcher(ZooKeeper.java:377) at org.apache.zookeeper.ZooKeeper$ZKWatchManager.removeWatcher(ZooKeeper.java:252) at org.apache.zookeeper.WatchDeregistration.unregister(WatchDeregistration.java:58) at org.apache.zookeeper.ClientCnxn.finishPacket(ClientCnxn.java:712) at org.apache.zookeeper.ClientCnxn.access$1500(ClientCnxn.java:97) at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:948) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:99) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1236) Is it possible it's something to do with the way that the cluster is restarted at line 282? The old cluster is not shutdown, a new one is just created. On Wed, Jun 1, 2016 at 10:44 AM, Jordan Zimmerman < [email protected]> wrote: > I’ll try to address this as part of CURATOR-333 > > > On May 31, 2016, at 7:08 PM, Cameron McKenzie <[email protected]> > wrote: > > > > Maybe we need to look at some way of providing a hook for tests to wait > > reliably for asynch tasks to finish? > > > > The latest round of tests ran OK. One test failed on an unrelated thing > > (ConnectionLoss), but this appears to be a transient thing as it's worked > > ok the next time around. > > > > I will start getting a release together. Thanks for you help with the > > updated tests. > > cheers > > > > On Wed, Jun 1, 2016 at 9:12 AM, Jordan Zimmerman < > [email protected] > >> wrote: > > > >> The problem is in-flight watchers and async background calls. There’s no > >> way to cancel these and they can take time to occur - even after a > recipe > >> instance is closed. > >> > >> -Jordan > >> > >>> On May 31, 2016, at 5:11 PM, Cameron McKenzie <[email protected]> > >> wrote: > >>> > >>> Ok, running it again now. > >>> > >>> Is the problem that the watcher clean up for the recipes is done > >>> asynchronously after they are closed? > >>> > >>> On Wed, Jun 1, 2016 at 1:35 AM, Jordan Zimmerman < > >> [email protected] > >>>> wrote: > >>> > >>>> OK - please try now. I added a loop in the “no watchers” checker. If > >> there > >>>> are remaining watchers, it will sleep a bit and try again. > >>>> > >>>> -Jordan > >>>> > >>>>> On May 31, 2016, at 1:33 AM, Cameron McKenzie < > [email protected]> > >>>> wrote: > >>>>> > >>>>> Looks like these failures are intermittent. Running them directly in > >>>>> Eclipse they seem to be passing. I will run the whole thing again in > >> the > >>>>> morning and see how it goes. > >>>>> > >>>>> On Tue, May 31, 2016 at 2:29 PM, Cameron McKenzie < > >>>> [email protected]> > >>>>> wrote: > >>>>> > >>>>>> There are still 2 tests failing for me: > >>>>>> > >>>>>> FAILURE! - in > >>>>>> org.apache.curator.framework.recipes.cache.TestPathChildrenCache > >>>>>> > >>>> > >> > testKilledSession(org.apache.curator.framework.recipes.cache.TestPathChildrenCache) > >>>>>> Time elapsed: 17.488 sec <<< FAILURE! > >>>>>> java.lang.AssertionError: One or more child watchers are still > >>>> registered: > >>>>>> [/test] > >>>>>> at > >>>>>> > >>>> > >> > org.apache.curator.framework.imps.TestCleanState.closeAndTestClean(TestCleanState.java:53) > >>>>>> at > >>>>>> > >>>> > >> > org.apache.curator.framework.recipes.cache.TestPathChildrenCache.testKilledSession(TestPathChildrenCache.java:707) > >>>>>> > >>>>>> FAILURE! - in > >>>>>> > >>>> > >> > org.apache.curator.framework.recipes.locks.TestInterProcessSemaphoreCluster > >>>>>> > >>>> > >> > testCluster(org.apache.curator.framework.recipes.locks.TestInterProcessSemaphoreCluster) > >>>>>> Time elapsed: 96.641 sec <<< FAILURE! > >>>>>> java.lang.AssertionError: expected [true] but found [false] > >>>>>> at org.testng.Assert.fail(Assert.java:94) > >>>>>> at org.testng.Assert.failNotEquals(Assert.java:494) > >>>>>> at org.testng.Assert.assertTrue(Assert.java:42) > >>>>>> at org.testng.Assert.assertTrue(Assert.java:52) > >>>>>> at > >>>>>> > >>>> > >> > org.apache.curator.framework.recipes.locks.TestInterProcessSemaphoreCluster.testCluster(TestInterProcessSemaphoreCluster.java:294) > >>>>>> > >>>>>> Failed tests: > >>>>>> > >>>>>> > >>>> > >> > org.apache.curator.framework.recipes.cache.TestPathChildrenCache.testKilledSession(org.apache.curator.framework.recipes.cache.TestPathChildrenCache) > >>>>>> Run 1: TestPathChildrenCache.testKilledSession:707 One or more child > >>>>>> watchers are still registered: [/test] > >>>>>> Run 2: PASS > >>>>>> > >>>>>> TestInterProcessSemaphoreCluster.testCluster:294 expected [true] but > >>>>>> found [false] > >>>>>> > >>>>>> Tests run: 495, Failures: 2, Errors: 0, Skipped: 0 > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, May 31, 2016 at 12:52 PM, Cameron McKenzie < > >>>> [email protected] > >>>>>>> wrote: > >>>>>> > >>>>>>> Thanks, CURATOR-332 wasn't pushed. I will run the tests against > that, > >>>> and > >>>>>>> if it's all good will merge into CURATOR-3.0 > >>>>>>> cheers > >>>>>>> > >>>>>>> On Tue, May 31, 2016 at 12:32 PM, Jordan Zimmerman < > >>>>>>> [email protected]> wrote: > >>>>>>> > >>>>>>>> Actually - I don’t remember if branch CURATOR-332 is merged yet. I > >>>>>>>> made/pushed my changes in CURATOR-332 > >>>>>>>> > >>>>>>>> -jordan > >>>>>>>> > >>>>>>>>> On May 26, 2016, at 10:04 PM, Cameron McKenzie < > >>>> [email protected]> > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> I'm still seeing 6 failed tests that seem related to the same > stuff > >>>>>>>> after > >>>>>>>>> merging your fix: > >>>>>>>>> > >>>>>>>>> Failed tests: > >>>>>>>>> > >>>>>>>> > >>>> > >> > org.apache.curator.framework.recipes.cache.TestPathChildrenCache.testBasics(org.apache.curator.framework.recipes.cache.TestPathChildrenCache) > >>>>>>>>> Run 1: TestPathChildrenCache.testBasics:863 One or more child > >>>> watchers > >>>>>>>>> are still registered: [/test] > >>>>>>>>> Run 2: TestPathChildrenCache.testBasics:863 One or more child > >>>> watchers > >>>>>>>>> are still registered: [/test] > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>> > >> > org.apache.curator.framework.recipes.cache.TestPathChildrenCache.testBasicsOnTwoCachesWithSameExecutor(org.apache.curator.framework.recipes.cache.TestPathChildrenCache) > >>>>>>>>> Run 1: > >>>> TestPathChildrenCache.testBasicsOnTwoCachesWithSameExecutor:934 > >>>>>>>>> One or more child watchers are still registered: [/test] > >>>>>>>>> Run 2: > >>>> TestPathChildrenCache.testBasicsOnTwoCachesWithSameExecutor:934 > >>>>>>>>> One or more child watchers are still registered: [/test] > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>> > >> > org.apache.curator.framework.recipes.cache.TestPathChildrenCache.testEnsurePath(org.apache.curator.framework.recipes.cache.TestPathChildrenCache) > >>>>>>>>> Run 1: TestPathChildrenCache.testEnsurePath:363 One or more child > >>>>>>>>> watchers are still registered: [/one/two/three] > >>>>>>>>> Run 2: TestPathChildrenCache.testEnsurePath:363 One or more child > >>>>>>>>> watchers are still registered: [/one/two/three] > >>>>>>>>> > >>>>>>>>> TestInterProcessSemaphoreCluster.testCluster:294 expected [true] > >> but > >>>>>>>>> found [false] > >>>>>>>>> > >>>>>>>> > >>>> > >> > org.apache.curator.framework.recipes.shared.TestSharedCount.testMultiClientVersioned(org.apache.curator.framework.recipes.shared.TestSharedCount) > >>>>>>>>> Run 1: PASS > >>>>>>>>> Run 2: TestSharedCount.testMultiClientVersioned:256 One or more > >> data > >>>>>>>>> watchers are still registered: [/count] > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>> > >> > org.apache.curator.framework.recipes.shared.TestSharedCount.testSimple(org.apache.curator.framework.recipes.shared.TestSharedCount) > >>>>>>>>> Run 1: TestSharedCount.testSimple:174 One or more data watchers > are > >>>>>>>> still > >>>>>>>>> registered: [/count] > >>>>>>>>> Run 2: TestSharedCount.testSimple:174 One or more data watchers > are > >>>>>>>> still > >>>>>>>>> registered: [/count] > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Tests run: 491, Failures: 6, Errors: 0, Skipped: 0 > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Fri, May 27, 2016 at 3:30 AM, Jordan Zimmerman < > >>>>>>>>> [email protected]> wrote: > >>>>>>>>> > >>>>>>>>>> I see the problem. The fix is not simple though so I’ll spend > some > >>>>>>>> time on > >>>>>>>>>> it. The TL;DR is that exists watchers are still supposed to get > >> set > >>>>>>>> when > >>>>>>>>>> there is a KeeperException.NoNode and the code isn’t handling > it. > >>>> But, > >>>>>>>>>> while I was looking at the code I realized there are some > >>>> significant > >>>>>>>>>> additional problems. Curator, here, is trying to mirror what > >>>>>>>> ZooKeeper does > >>>>>>>>>> internally which is insanely complicated. In hindsight, the > whole > >> ZK > >>>>>>>>>> watcher mechanism should’ve been decoupled from the mutator > APIs. > >>>>>>>> But, of > >>>>>>>>>> course, that’s easy for me to say now. > >>>>>>>>>> > >>>>>>>>>> -Jordan > >>>>>>>>>> > >>>>>>>>>>> On May 26, 2016, at 1:10 AM, Cameron McKenzie < > >>>>>>>> [email protected]> > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Thanks Scott, > >>>>>>>>>>> Those tests are now passing for me. > >>>>>>>>>>> > >>>>>>>>>>> Jordan, testNodeCache:testBasics() is failing consistently on > the > >>>> 3.0 > >>>>>>>>>>> branch. It appears that this is actually potentially a bug in > the > >>>>>>>>>>> NodeCache. It ends up leaking a Watcher reference. I've had a > >> quick > >>>>>>>> look > >>>>>>>>>>> through, but I haven't dived in in any detail. It's the > >>>>>>>>>>> WatcherRemovalManager stuff I think. If you've got time, can > you > >>>>>>>> have a > >>>>>>>>>>> look? If not, let me know and I'll do some more digging. > >>>>>>>>>>> > >>>>>>>>>>> cheers > >>>>>>>>>>> > >>>>>>>>>>> On Thu, May 26, 2016 at 11:47 AM, Cameron McKenzie < > >>>>>>>>>> [email protected]> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Thanks Scott. > >>>>>>>>>>>> > >>>>>>>>>>>> Push the fix to master and merge it into 3.0. > >>>>>>>>>>>> > >>>>>>>>>>>> Then I guess, I'll push new versions of 2.11 and 3.2 onto > Nexus. > >>>>>>>>>>>> cheers > >>>>>>>>>>>> > >>>>>>>>>>>> On Thu, May 26, 2016 at 11:44 AM, Scott Blum < > >>>> [email protected] > >>>>>>>>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Alright, I have a fix, but it wants to be applied to both > >> master > >>>>>>>> and > >>>>>>>>>> 3.0. > >>>>>>>>>>>>> Where should I push the fix? > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Wed, May 25, 2016 at 6:10 PM, Cameron McKenzie < > >>>>>>>>>> [email protected] > >>>>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks Scott, > >>>>>>>>>>>>>> If you just checkout the CURATOR-3.0 branch, they are > failing > >>>>>>>> there. > >>>>>>>>>>>>>> cheers > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Thu, May 26, 2016 at 2:06 AM, Scott Blum < > >>>>>>>> [email protected]> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Sure, what SHA are they failing at Cam? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Wed, May 25, 2016 at 9:36 AM, Jordan Zimmerman < > >>>>>>>>>>>>>>> [email protected]> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Scott can you take a look? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> -Jordan > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On May 25, 2016, at 4:35 AM, Cameron McKenzie < > >>>>>>>>>>>>>> [email protected]> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Tree cache tests are still failing. I've tried a few > times > >>>> but > >>>>>>>> no > >>>>>>>>>>>>>> love: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>> TestTreeCacheEventOrdering>TestEventOrdering.testEventOrdering:151 > >>>>>>>>>>>>>>>> actual 6 > >>>>>>>>>>>>>>>>> expected -31: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> I will have a look into what's going on in the morning. > >> Given > >>>>>>>> that > >>>>>>>>>>>>>>> these > >>>>>>>>>>>>>>>>> may take some messing about to fix up, do we just want to > >>>> vote > >>>>>>>> on > >>>>>>>>>>>>>>> 2.11.0 > >>>>>>>>>>>>>>>>> separately, as that is all ready to go? > >>>>>>>>>>>>>>>>> cheers > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Wed, May 25, 2016 at 5:34 PM, Jordan Zimmerman < > >>>>>>>>>>>>>>>>> [email protected]> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Great news. Thanks. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> ==================== > >>>>>>>>>>>>>>>>>> Jordan Zimmerman > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On May 25, 2016, at 12:37 AM, Cameron McKenzie < > >>>>>>>>>>>>>>> [email protected] > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I have fixed up the test case, all good now. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Wed, May 25, 2016 at 1:45 PM, Cameron McKenzie < > >>>>>>>>>>>>>>>>>> [email protected]> > >>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Looks like it was introduced with the schema > validation > >>>>>>>> stuff. > >>>>>>>>>>>>> It > >>>>>>>>>>>>>>> now > >>>>>>>>>>>>>>>>>> does > >>>>>>>>>>>>>>>>>>>> an ACL check prior to the backgrounding call. Because > >> the > >>>>>>>> unit > >>>>>>>>>>>>>> test > >>>>>>>>>>>>>>>>>> uses a > >>>>>>>>>>>>>>>>>>>> bogus ACL provider it just throws an exception > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> final String adjustedPath = > >>>>>>>>>>>>>>>>>>>> adjustPath(client.fixForNamespace(givenPath, > >>>>>>>>>>>>>>>>>> createMode.isSequential())); > >>>>>>>>>>>>>>>>>>>> List<ACL> aclList = acling.getAclList(adjustedPath); > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>> > >> client.getSchemaSet().getSchema(givenPath).validateCreate(createMode, > >>>>>>>>>>>>>>>>>> data, > >>>>>>>>>>>>>>>>>>>> aclList); > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> String returnPath = null; > >>>>>>>>>>>>>>>>>>>> if ( backgrounding.inBackground() ) > >>>>>>>>>>>>>>>>>>>> { > >>>>>>>>>>>>>>>>>>>> pathInBackground(adjustedPath, data, givenPath); > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> So, I guess the answer is to get the test to force a > >>>> failure > >>>>>>>>>>>>> in a > >>>>>>>>>>>>>>>>>>>> different way. With the UnhandledErrorListener, the > >>>>>>>>>>>>> expectation is > >>>>>>>>>>>>>>>> that > >>>>>>>>>>>>>>>>>>>> this only gets called on backgrounding operations? > >>>>>>>>>>>>>>>>>>>> cheers > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Wed, May 25, 2016 at 1:39 PM, Cameron McKenzie < > >>>>>>>>>>>>>>>>>> [email protected]> > >>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Same on the master branch, but it passes there, so > >> maybe > >>>>>>>>>>>>>> something > >>>>>>>>>>>>>>>> has > >>>>>>>>>>>>>>>>>>>>> legitimately broken the test. Will let you know if I > >> get > >>>>>>>>>>>>> stuck. > >>>>>>>>>>>>>>>>>>>>> cheers > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> On Wed, May 25, 2016 at 1:36 PM, Jordan Zimmerman < > >>>>>>>>>>>>>>>>>>>>> [email protected]> wrote: > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Let me know if you need help. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> It might be a bad merge. Have you compared it to the > >>>>>>>> master > >>>>>>>>>>>>>>> branch? > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> -JZ > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>>> On May 24, 2016, at 10:31 PM, Cameron McKenzie < > >>>>>>>>>>>>>>>>>> [email protected]> > >>>>>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Guys, > >>>>>>>>>>>>>>>>>>>>>>> There's a test > >>>> TestFrameworkBackground:testErrorListener > >>>>>>>>>>>>> that > >>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>>>>>> failing > >>>>>>>>>>>>>>>>>>>>>>> consistently on the CURATOR-3.0 branch. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> I can't see how it has ever worked. It seems to try > >> and > >>>>>>>>>>>>> provoke > >>>>>>>>>>>>>>> an > >>>>>>>>>>>>>>>>>>>>>> error > >>>>>>>>>>>>>>>>>>>>>>> via a bad ACL provider. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> But this ACL provider is called by the > >>>> CreateBuilderImpl > >>>>>>>>>>>>> prior > >>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>>>>>> backgrounding call, which means that the exception > >> that > >>>>>>>> it > >>>>>>>>>>>>>> throws > >>>>>>>>>>>>>>>>>>>>>> happens > >>>>>>>>>>>>>>>>>>>>>>> in the main Thread of the unit test. So, it just > >> throws > >>>>>>>> an > >>>>>>>>>>>>>>>>>>>>>>> UnsupportedOperationException which is propogated > up > >>>> the > >>>>>>>>>>>>> stack > >>>>>>>>>>>>>> at > >>>>>>>>>>>>>>>>>> which > >>>>>>>>>>>>>>>>>>>>>>> point the unit test fails. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> So, I will look at fixing this, but I just don't > >>>>>>>> understand > >>>>>>>>>>>>> how > >>>>>>>>>>>>>>> it > >>>>>>>>>>>>>>>>>> ever > >>>>>>>>>>>>>>>>>>>>>>> worked? > >>>>>>>>>>>>>>>>>>>>>>> cheers > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>>> > >> > >> > >
