Thanks! I assigned 2778 to myself. ZOOKEEPER-2778: A port to the master branch of the current state of my patch is in https://github.com/apache/zookeeper/pull/719. Be aware that there are a couple of touches to the code needed in 3.5 that aren't needed in master: https://github.com/apache/zookeeper/pull/707/files#diff-7a209d890686bcba351d758b64b22a7dR413 and https://github.com/apache/zookeeper/pull/707/files#diff-b2dd09c58f745da275fee3c6d8681503R974 (both of these are obviated by cleanups that have taken place on master).
ZOOKEEPER-1636: By "clean" I just mean "in isolation"; previously I had stacked this patch in a branch on top of the 2778 work. ZOOKEEPER-1818: PR #714 is a port of Fangmin's patch to 3.5 (which split off before the refactor from termCondition to getVoteTracker). PR #718 is Fangmin's patch unchanged, just cherry-picked onto current master and poked until we got a green Jenkins build. "Address already in use": https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2708/consoleText (search for BindException). You generally have to look at the raw consoleText in order to find these. I don't see any way of getting at the untruncated text for https://builds.apache.org/job/ZooKeeper_branch35_jdk8/1195/testReport/junit/org.apache.zookeeper.server.quorum/StandaloneDisabledTest/startSingleServerTest/ , but I suspect there's a similar BindException hidden inside "...[truncated 395348 chars]..." On Fri, Nov 23, 2018 at 1:50 AM Andor Molnar <an...@apache.org> wrote: > > Hi Michael, > > I added you to the contributors list in Jira, now you can assign tickets to > yourself. > > 3.5 > ~~~ > ZOOKEEPER-2778 - I already accepted the patch, but I’d like to kindly ask you > to create a separate pull request for the master branch which I can backport > to 3.5 after committing it. This will help us follow the standard procedure > of making changes. > > ZOOKEEPER-1636 - Thanks for picking it up, I’ll review your patch shortly. > Btw I’m not sure what do you mean by “clean” pull request. > > ZOOKEEPER-1818 - This issue is already taken care by Fangmin (PR #703), why > have you created the new PR? > > Flakies > ~~~~~~~ > We’re already aware of the downside of PortAssignment class, but haven’t > really seen too many "Address already in use” problems in tests. (Except in > Java 11 builds, but those are unrelated) Would you please provide some > evidence about your findings with links to builds that you’re talking about > and specific error messages? > > Thanks, > Andor > > > > > > On 2018. Nov 22., at 23:20, Michael K. Edwards <m.k.edwa...@gmail.com> > > wrote: > > > > For what it's worth, builds 2732 and 2733 ran concurrently on H19, and > > both failed for what I think are resource-conflict reasons. It would > > probably help to modify the PreCommit-ZOOKEEPER-github-pr-build queue > > so that it doesn't attempt concurrent builds on the same > > (uncontainerized) host. > > On Thu, Nov 22, 2018 at 1:44 PM Michael K. Edwards > > <m.k.edwa...@gmail.com> wrote: > >> > >> Thanks for the guidance. Feel free to assign ZOOKEEPER-2778 to me (I > >> don't seem to be able to do it myself). I've updated that pull > >> request against 3.5 to address all reviewer comments. When it looks > >> ready to land, I'll port it to master as well. > >> > >> I have updated ZOOKEEPER-1636 and ZOOKEEPER-1818 with clean pull > >> requests based on Thawan's and Fangmin's patches. I'll poke at them > >> until they build green, and try to handle anything reviewers bring up. > >> > >> With regard to flaky tests: a fair fraction of spurious test failures > >> appear to result from failure to bind a dynamically-assigned > >> client/election/quorum port. The prevailing hypothesis is that > >> something else, running concurrently on the machine, is binding the > >> port in between the check in PortAssignment (which binds it, to verify > >> that it's not otherwise in use, and then closes that socket to free it > >> again) and the subsequent use as a service port. If that's the case, > >> then we could eliminate this class of test failures by running the > >> tests inside a container (with a dedicated network namespace). Any > >> failures of this kind that persist in a containerized test setup are > >> the test fighting itself, not fighting unrelated concurrent processes. > >> On Thu, Nov 22, 2018 at 8:23 AM Andor Molnar <an...@cloudera.com> wrote: > >>> > >>> Hi Michael! > >>> > >>> Thanks for the great help to get 3.5 out of the door. We're getting > >>> closer with each commit. > >>> > >>> You asked a lot of questions in your email, which I'm trying to answer, > >>> but I believe the best approach is to deal with one problem at a time. > >>> Especially in email communication is not ideal to mix different topics, > >>> because it makes things hard to follow. > >>> > >>> I focus on 3.5 release in this thread according to the subject. There's > >>> another thread btw I usually update every so often, but your list is > >>> pretty much accurate too. I use the following query for 3.5 blockers: > >>> > >>> project = ZooKeeper AND resolution = Unresolved AND fixVersion = 3.5.5 > >>> AND priority in (blocker, critical) ORDER BY priority DESC, key ASC > >>> > >>> ZOOKEEPER-1818 - Fangmin is working on it and patch is available on > >>> github. > >>> ZOOKEEPER-2778 - You're working on it, patch is available. You should > >>> assign the Jira to yourself to avoid somebody else picking it up. > >>> ZOOKEEPER-1636 - An ancient C issue which has patch available in Jira. > >>> I'm planning to rebase it on master, but didn't have a chance yet. > >>> > >>> All of the others are Maven/Doc related which Tamas and Norbert are > >>> working on. > >>> > >>> Flaky tests are related, but we don't tackle it as a blocker issue. > >>> Here's the umbrella Jira that I've created to track the progress: > >>> https://issues.apache.org/jira/browse/ZOOKEEPER-3170 > >>> > >>> Feel free to pick up any of the open ones or create new ones if you think > >>> it's necessary. It's generally better to open individual Jiras for every > >>> issue you're working on and discuss the details in it. You can open an > >>> email thread too, if you feel convenient, but Jira is preferred. > >>> > >>> Preferred workflow is Open Jira -> GitHub PR -> Commit to master -> > >>> Backport to 3.5/3.4 if necessary -> Close Jira. > >>> > >>> Thank you for your contribution again! > >>> > >>> Andor > >>> > >>> > >>> > >>> On Thu, Nov 22, 2018 at 12:51 PM Michael K. Edwards > >>> <m.k.edwa...@gmail.com> wrote: > >>>> > >>>> I think it's mostly a problem in CI, where other processes on the same > >>>> machine may compete for the port range, producing spurious Jenkins > >>>> failures. The only failures I'm seeing locally are unrelated SSL > >>>> issues. > >>>> On Thu, Nov 22, 2018 at 3:45 AM Enrico Olivelli <eolive...@gmail.com> > >>>> wrote: > >>>>> > >>>>> Il giorno gio 22 nov 2018 alle ore 12:44 Michael K. Edwards > >>>>> <m.k.edwa...@gmail.com> ha scritto: > >>>>>> > >>>>>> I'm glad to be able to help. > >>>>>> > >>>>>> It appears as though some of the "flaky tests" result from another > >>>>>> process stealing a server port between the time that it is assigned > >>>>>> (in org.apache.zookeeper.PortAssignment.unique()) and the time that it > >>>>>> is bound. > >>>>> > >>>>> You can try running tests using a single thread, this will "mitigate" > >>>>> the problem a bit > >>>>> > >>>>> Enrico > >>>>> > >>>>> This happened, for example, in > >>>>>> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2708/; > >>>>>> looking in the console text, I found: > >>>>>> > >>>>>> [exec] [junit] 2018-11-22 00:18:30,336 [myid:] - INFO > >>>>>> [QuorumPeerListener:QuorumCnxManager$Listener@884] - My election bind > >>>>>> port: localhost/127.0.0.1:19459 > >>>>>> [exec] [junit] 2018-11-22 00:18:30,337 [myid:] - INFO > >>>>>> [QuorumPeer[myid=1](plain=/127.0.0.1:19457)(secure=disabled):NettyServerCnxnFactory@493] > >>>>>> - binding to port localhost/127.0.0.1:19466 > >>>>>> [exec] [junit] 2018-11-22 00:18:30,337 [myid:] - ERROR > >>>>>> [QuorumPeer[myid=1](plain=/127.0.0.1:19457)(secure=disabled):NettyServerCnxnFactory@497] > >>>>>> - Error while reconfiguring > >>>>>> [exec] [junit] org.jboss.netty.channel.ChannelException: > >>>>>> Failed to bind to: localhost/127.0.0.1:19466 > >>>>>> [exec] [junit] at > >>>>>> org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > >>>>>> [exec] [junit] at > >>>>>> org.apache.zookeeper.server.NettyServerCnxnFactory.reconfigure(NettyServerCnxnFactory.java:494) > >>>>>> [exec] [junit] at > >>>>>> org.apache.zookeeper.server.quorum.QuorumPeer.processReconfig(QuorumPeer.java:1947) > >>>>>> [exec] [junit] at > >>>>>> org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:154) > >>>>>> [exec] [junit] at > >>>>>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:93) > >>>>>> [exec] [junit] at > >>>>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1263) > >>>>>> [exec] [junit] Caused by: java.net.BindException: Address > >>>>>> already in use > >>>>>> [exec] [junit] at sun.nio.ch.Net.bind0(Native Method) > >>>>>> [exec] [junit] at sun.nio.ch.Net.bind(Net.java:433) > >>>>>> [exec] [junit] at sun.nio.ch.Net.bind(Net.java:425) > >>>>>> [exec] [junit] at > >>>>>> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > >>>>>> [exec] [junit] at > >>>>>> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > >>>>>> [exec] [junit] at > >>>>>> org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193) > >>>>>> [exec] [junit] at > >>>>>> org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391) > >>>>>> [exec] [junit] at > >>>>>> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315) > >>>>>> [exec] [junit] at > >>>>>> org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42) > >>>>>> [exec] [junit] at > >>>>>> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) > >>>>>> [exec] [junit] at > >>>>>> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) > >>>>>> [exec] [junit] at > >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > >>>>>> [exec] [junit] at > >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > >>>>>> [exec] [junit] at java.lang.Thread.run(Thread.java:748) > >>>>>> > >>>>>> We currently log-and-swallow this exception (and many others) down in > >>>>>> NettyServerCnxnFactory.reconfigure() and > >>>>>> NIOServerCnxnFactory.reconfigure(), which is ... not ideal. > >>>>>> > >>>>>> How should we handle a bind failure in the real world? Seems like we > >>>>>> ought to throw a BindException out at least as far as the caller of > >>>>>> QuorumPeer.processReconfig(). That's either > >>>>>> Follower/Leader/Learner/Observer or FastLeaderElection. Presumably > >>>>>> they should immediately go read-only when they can't bind the client > >>>>>> port? > >>>>>> On Thu, Nov 22, 2018 at 1:23 AM Enrico Olivelli <eolive...@gmail.com> > >>>>>> wrote: > >>>>>>> > >>>>>>> Thank you very much Michael > >>>>>>> I am following and reviewing your patches > >>>>>>> > >>>>>>> Enrico > >>>>>>> Il giorno gio 22 nov 2018 alle ore 10:14 Michael K. Edwards > >>>>>>> <m.k.edwa...@gmail.com> ha scritto: > >>>>>>>> > >>>>>>>> Hmm. Jira's a bit of a boneyard, isn't it? And timeouts in flaky > >>>>>>>> tests are a problem. > >>>>>>>> > >>>>>>>> I scrubbed through the open bugs and picked the ones that looked to > >>>>>>>> me > >>>>>>>> like they might deserve attention for 3.5.5 or soon thereafter. > >>>>>>>> They're all on my watchlist: > >>>>>>>> https://issues.apache.org/jira/issues/?filter=-1&jql=watcher%20%3D%20mkedwards%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20created%20ASC > >>>>>>>> (I'm not counting the Ant->Maven transition in that, which I don't > >>>>>>>> know much about.) > >>>>>>>> > >>>>>>>> I'm trying out some more verbose logging for the junit tests, to try > >>>>>>>> to understand test flakiness. But the Jenkins pre-commit pipeline > >>>>>>>> appears to be down? > >>>>>>>> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/ > >>>>>>>> On Wed, Nov 21, 2018 at 2:29 PM Michael K. Edwards > >>>>>>>> <m.k.edwa...@gmail.com> wrote: > >>>>>>>>> > >>>>>>>>> Looks like we're really close. Can I help? > >>>>>>>>> > >>>>>>>>> I think this is the list of release blockers: > >>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ZooKeeper%20and%20resolution%20%3D%20Unresolved%20and%20fixVersion%20%3D%203.5.5%20AND%20priority%20in%20(blocker%2C%20critical)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC > >>>>>>>>> > >>>>>>>>> I currently see 7 issues in that search, of which 4 are aspects of > >>>>>>>>> the > >>>>>>>>> ongoing switch from ant to maven. Setting that aside for the > >>>>>>>>> moment, > >>>>>>>>> there are 3 critical bugs: > >>>>>>>>> > >>>>>>>>> ZOOKEEPER-2778 Potential server deadlock between follower sync with > >>>>>>>>> leader and follower receiving external connection requests. > >>>>>>>>> > >>>>>>>>> ZOOKEEPER-1636 c-client crash when zoo_amulti failed > >>>>>>>>> > >>>>>>>>> ZOOKEEPER-1818 Fix don't care for trunk > >>>>>>>>> > >>>>>>>>> I put them in that order because that's the order in which I've > >>>>>>>>> stacked the fixes in > >>>>>>>>> https://github.com/mkedwards/zookeeper/tree/branch-3.5. Then on top > >>>>>>>>> of that, I've updated the versions of the external library > >>>>>>>>> dependencies I think it's important to update: Jetty, Jackson, and > >>>>>>>>> BouncyCastle. The result seems to be a green build in Jenkins: > >>>>>>>>> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2705/ > >>>>>>>>> > >>>>>>>>> Are these fixes in principle landable on the 3.5 branch, or do they > >>>>>>>>> have to go to master first? Does master need help to build green > >>>>>>>>> before these can land there? Are there other bugs that are > >>>>>>>>> similarly > >>>>>>>>> critical to fix, and not tagged for 3.5.5 in Jira? Is there other > >>>>>>>>> testing that I can help with? Are more hands needed on the Maven > >>>>>>>>> work? > >>>>>>>>> > >>>>>>>>> Thanks for all the work that goes into keeping Zookeeper healthy and > >>>>>>>>> advancing; it's a critical infrastructure component in several > >>>>>>>>> systems > >>>>>>>>> I help develop and operate, and I like being able to rely on it. > >>>>>>>>> > >>>>>>>>> Cheers, > >>>>>>>>> - Michael >