I forgot to fill in the name of the test giving the connection errors below, it is testFirstServerDown in Zookeeper_simpleSystem (TestClient.cc <http://testclient.cc/>).
-Flavio > On 04 Jul 2016, at 23:53, Flavio Junqueira <[email protected]> wrote: > >> >> On 04 Jul 2016, at 22:01, Michael Han <[email protected] >> <mailto:[email protected]>> wrote: >> >> Both Java and C unit tests coming with 3.5.2-alpha passed for me in 5 runs. >> Are the failed tests deterministically reproducible? > > They fail consistently for me. When I run xxx, I get this output in the logs, > which is weird because it looks like the client is trying 127.0.0.1:22181 > only once and after that it only tries 127.0.0.1:22182, it sounds wrong to me: > > 016-07-04 15:04:08,523:33750:ZOO_INFO@zookeeper_init_internal@1111: > Initiating client connection, host=127.0.0.1:22182,127.0.0.1:22181 > sessionTimeout=10000 watcher=0x447050 sessionId=0 sessionPasswd=<null> > context=0x7fff8e504910 flags=0 > 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22181] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2016-07-04 15:04:09,523:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2016-07-04 15:04:09,524:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > 2016-07-04 15:04:09,524:33750:ZOO_ERROR@handle_socket_error_msg@2350: Socket > [127.0.0.1:22182] zk retcode=-4, errno=111(Connection refused): server > refused to accept the client > <This line keeps repeating until the test times out> > > Also, if you check ZK-2463, it looks like the multi tests are failing > silently. They are timing out, but the framework isn't picking it up. I > haven't had a chance to look at these multi tests to determine whether it is > timing or what. > >> If not, it seems we >> have more flaky tests related to threading / timing that needs to be taken >> care of, and they don't sound blocker for the release to me. >> > > From what I can tell, none of these issues are new, so I have no reason to > suspect that an issue we resolved for 3.5.2 is introducing these problems. If > we are to be strict, then we cannot release it, but I'd say we benefit from > it still being alpha and proceed. We are solving a number of issue that it is > good to have out. For 3.5.3, I think we really need to spend some time on the > C client. > > -Flavio > >> On Sun, Jul 3, 2016 at 9:48 PM, Rakesh Radhakrishnan <[email protected]> >> wrote: >> >>>>> I'm suggesting as a blocker for 3.5.3, I think we should proceed with >>> 3.5.2 as is and give some love to the C client in the next release. >>> >>> Since the current release is alpha I also feel its OK to go ahead with RC1 >>> and address the C client issue in 3.5.3. That way we'll get more folks >>> trying it out and stabilize 3.5 version eventually. Probably will listen to >>> others opinion as well. >>> >>> -Rakesh >>> >>> On Mon, Jul 4, 2016 at 12:32 AM, Flavio Junqueira <[email protected]> wrote: >>> >>>> >>>>> On 03 Jul 2016, at 17:53, Chris Nauroth <[email protected]> >>>> wrote: >>>>> >>>>> For my part, I got a successful full test run from RC1 before starting >>>> the >>>>> [VOTE]. The problem with the silent failure of multi tests could have >>>>> snuck past me easily though. (Flavio, thank you for filing >>>>> ZOOKEEPER-2463.) I'm curious to hear test results from others who are >>>>> trying RC1. >>>> >>>> The test failures seem to be related to test timing, not bugs, but I >>>> haven't been able to confirm for the last two I mentioned. Granted that >>>> timing is in some sense a bug, all I'm saying is that it doesn't seem to >>>> indicate a regression or anything. >>>> >>>>> >>>>> It looks like we also need an issue to track updating the copyright >>>> notice >>>>> in the docs. I don't believe this is an ASF compliance problem in the >>>>> same way that an erroneous NOTICE file would be, so I propose that we >>>>> address it in 3.5.3. >>>> >>>> Agreed, we need an issue for that. >>>> >>>>> >>>>> Flavio, you suggested filing a blocker for the ZooKeeperQuorumServer.cc >>>>> failure. Did you want that targeted to 3.5.2 or 3.5.3? >>>>> >>>> >>>> I'm suggesting as a blocker for 3.5.3, I think we should proceed with >>>> 3.5.2 as is and give some love to the C client in the next release. >>>> >>>>> Overall, how are people feeling about the RC1 [VOTE] at this point? Is >>>>> anyone considering a -1, or shall we proceed (keeping in mind it's an >>>>> alpha) with the intent of fixing things in a more rapid 3.5.3 release >>>>> cycle? >>>> >>>> I'd say we proceed. >>>> >>>> -Flavio >>>> >>>>> >>>>> >>>>> >>>>> On 7/3/16, 8:43 AM, "Flavio Junqueira" <[email protected]> wrote: >>>>> >>>>>> The issue with the TestReconfigServer test is that the client port is >>>>>> still used and we get a bind exception, which prevents the server from >>>>>> starting. To verify this locally, I simply added some code to retry >>> and >>>>>> it works fine with that fix. Going forward we need a better fox. >>>>>> >>>>>> I haven't able to figure out yet the issue with the >>>>>> Zookeeper_simpleSystem tests. >>>>>> >>>>>> I have also found something strange with the multi tests. I have >>> created >>>>>> ZK-2463 for this problem and made it a blocker for 3.5.3. >>>>>> >>>>>> -Flavio >>>>>> >>>>>>> On 03 Jul 2016, at 15:25, Flavio Junqueira <[email protected]> wrote: >>>>>>> >>>>>>> I have spun a new ubuntu VM to check the C failures. I get three >>>>>>> failures with the new installation: >>>>>>> >>>>>>> Zookeeper_simpleSystem::testFirstServerDown : assertion : elapsed >>> 10911 >>>>>>> tests/TestClient.cc:411: Assertion: equality assertion failed >>>>>>> [Expected: -101, Actual : -4] >>>>>>> tests/TestClient.cc:322: Assertion: assertion failed [Expression: >>>>>>> ctx.waitForConnected(zk)] >>>>>>> Failures !!! >>>>>>> Run: 43 Failure total: 2 Failures: 2 Errors: 0 >>>>>>> >>>>>>> >>>>>>> >>>>>>> TestReconfigServer::testRemoveFollower/usr/bin/java >>>>>>> ZooKeeper JMX enabled by default >>>>>>> Using config: ./../../build/test/test-cppunit/conf/0.conf >>>>>>> Starting zookeeper ... FAILED TO START >>>>>>> zktest-mt: tests/ZooKeeperQuorumServer.cc:61: void >>>>>>> ZooKeeperQuorumServer::start(): Assertion `system(command.c_str()) == >>>> 0' >>>>>>> failed. >>>>>>> /bin/bash: line 5: 47059 Aborted (core dumped) >>>>>>> ZKROOT=./../.. CLASSPATH=$CLASSPATH:$CLOVER_HOME/lib/clover.jar >>>>>>> ${dir}$tst >>>>>>> >>>>>>> -Flavio >>>>>>> >>>>>>> >>>>>>>> On 03 Jul 2016, at 15:19, Edward Ribeiro <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Flavio, >>>>>>>> >>>>>>>> >>>>>>>> On Sun, Jul 3, 2016 at 5:54 AM, Flavio Junqueira <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> Hey Eddie, >>>>>>>> >>>>>>>> A few comments on your points: >>>>>>>> >>>>>>>>> >>>>>>>>> - the copyright notice is still dating "2008-2013". It's worth >>>>>>>>> updating to >>>>>>>>> the current year? >>>>>>>> >>>>>>>> Where are you seeing this? The NOTICE file is correct from what I >>> can >>>>>>>> see. >>>>>>>> >>>>>>>> Ops, sorry. I was referring to the PDFs and HTMLs in the docs/ >>>>>>>> folder. Even after running "ant docs" the footnote has "2008-2013" >>>>>>>> copyright. Images attached. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> - I consistently ran on an test error equals to the one at >>>>>>>>> https://builds.apache.org/job/ZooKeeper-trunk/2982/console >>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console> >>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console >>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console>> >>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console >>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console> >>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console >>>>>>>>> <https://builds.apache.org/job/ZooKeeper-trunk/2982/console>>> >>>>>>>> >>>>>>>> I think this is ZK-2152, which Chris has moved to 3.5.3, so even >>>>>>>> though it isn't ideal. it is expected. >>>>>>>> >>>>>>>> Got it. :) >>>>>>>> >>>>>>>> >>>>>>>>> - Also this one: >>>>>>>>> >>>>>>>>> >>>> https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201601.mbox/%3C >>>> <https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201601.mbox/%3C> >>>>>>>>> 1279938263.1283.1453526737790.JavaMail.jenkins@crius%3E >>>>>>>>> < >>>> https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201601.mbox/%3 >>>> <https://mail-archives.apache.org/mod_mbox/zookeeper-dev/201601.mbox/%3> >>>>>>>>> C1279938263.1283.1453526737790.JavaMail.jenkins@crius%3E> >>>>>>>>> >>>>>>>> >>>>>>>> I don't know if there is a jira for this one. If not, better create >>>>>>>> one and make it a blocker. >>>>>>>> >>>>>>>> Okay, gonna look for and do this. >>>>>>>> >>>>>>>> >>>>>>>>> - In fact, there were 14 failing tests total (I suspect all of them >>>>>>>>> related >>>>>>>>> to the C tests). Any ideas? A couple of flacky tests? >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> In general, having a release with so many tests failing is bad. I >>>>>>>> didn't get these test failures, so it would be great to report them >>> or >>>>>>>> make sure that there are jiras for it. >>>>>>>> >>>>>>>> Right. I was only skeptical of my own tests because I ran the unit >>>>>>>> tests on a relatively old Ubuntu version, even though it was Java >>> 1.7. >>>>>>>> So, I am running the tests on a newer Linux soon just to make sure >>> it >>>>>>>> was not a false negative. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Test failures are possibly an indication that something is bad with >>>>>>>> the RC, so I wouldn't have +1 it if I had observed all those. It >>> might >>>>>>>> be ok given that this is still labeled alpha. >>>>>>>> >>>>>>>> Excuse me. I only +1'ed because I suspect the errors are restricted >>>>>>>> to the C binding and my Ubuntu version, etc. But I should have >>>>>>>> researched further before giving +1, nevertheless. Point taken. :) >>>>>>>> >>>>>>>> Edward >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >> >> >> >> -- >> Cheers >> Michael.
