Il giorno gio 22 nov 2018 alle ore 12:44 Michael K. Edwards <m.k.edwa...@gmail.com> ha scritto: > > I'm glad to be able to help. > > It appears as though some of the "flaky tests" result from another > process stealing a server port between the time that it is assigned > (in org.apache.zookeeper.PortAssignment.unique()) and the time that it > is bound.
You can try running tests using a single thread, this will "mitigate" the problem a bit Enrico This happened, for example, in > https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2708/; > looking in the console text, I found: > > [exec] [junit] 2018-11-22 00:18:30,336 [myid:] - INFO > [QuorumPeerListener:QuorumCnxManager$Listener@884] - My election bind > port: localhost/127.0.0.1:19459 > [exec] [junit] 2018-11-22 00:18:30,337 [myid:] - INFO > [QuorumPeer[myid=1](plain=/127.0.0.1:19457)(secure=disabled):NettyServerCnxnFactory@493] > - binding to port localhost/127.0.0.1:19466 > [exec] [junit] 2018-11-22 00:18:30,337 [myid:] - ERROR > [QuorumPeer[myid=1](plain=/127.0.0.1:19457)(secure=disabled):NettyServerCnxnFactory@497] > - Error while reconfiguring > [exec] [junit] org.jboss.netty.channel.ChannelException: > Failed to bind to: localhost/127.0.0.1:19466 > [exec] [junit] at > org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272) > [exec] [junit] at > org.apache.zookeeper.server.NettyServerCnxnFactory.reconfigure(NettyServerCnxnFactory.java:494) > [exec] [junit] at > org.apache.zookeeper.server.quorum.QuorumPeer.processReconfig(QuorumPeer.java:1947) > [exec] [junit] at > org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:154) > [exec] [junit] at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:93) > [exec] [junit] at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1263) > [exec] [junit] Caused by: java.net.BindException: Address > already in use > [exec] [junit] at sun.nio.ch.Net.bind0(Native Method) > [exec] [junit] at sun.nio.ch.Net.bind(Net.java:433) > [exec] [junit] at sun.nio.ch.Net.bind(Net.java:425) > [exec] [junit] at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > [exec] [junit] at > sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > [exec] [junit] at > org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193) > [exec] [junit] at > org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391) > [exec] [junit] at > org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315) > [exec] [junit] at > org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42) > [exec] [junit] at > org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) > [exec] [junit] at > org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) > [exec] [junit] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [exec] [junit] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [exec] [junit] at java.lang.Thread.run(Thread.java:748) > > We currently log-and-swallow this exception (and many others) down in > NettyServerCnxnFactory.reconfigure() and > NIOServerCnxnFactory.reconfigure(), which is ... not ideal. > > How should we handle a bind failure in the real world? Seems like we > ought to throw a BindException out at least as far as the caller of > QuorumPeer.processReconfig(). That's either > Follower/Leader/Learner/Observer or FastLeaderElection. Presumably > they should immediately go read-only when they can't bind the client > port? > On Thu, Nov 22, 2018 at 1:23 AM Enrico Olivelli <eolive...@gmail.com> wrote: > > > > Thank you very much Michael > > I am following and reviewing your patches > > > > Enrico > > Il giorno gio 22 nov 2018 alle ore 10:14 Michael K. Edwards > > <m.k.edwa...@gmail.com> ha scritto: > > > > > > Hmm. Jira's a bit of a boneyard, isn't it? And timeouts in flaky > > > tests are a problem. > > > > > > I scrubbed through the open bugs and picked the ones that looked to me > > > like they might deserve attention for 3.5.5 or soon thereafter. > > > They're all on my watchlist: > > > https://issues.apache.org/jira/issues/?filter=-1&jql=watcher%20%3D%20mkedwards%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20created%20ASC > > > (I'm not counting the Ant->Maven transition in that, which I don't > > > know much about.) > > > > > > I'm trying out some more verbose logging for the junit tests, to try > > > to understand test flakiness. But the Jenkins pre-commit pipeline > > > appears to be down? > > > https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/ > > > On Wed, Nov 21, 2018 at 2:29 PM Michael K. Edwards > > > <m.k.edwa...@gmail.com> wrote: > > > > > > > > Looks like we're really close. Can I help? > > > > > > > > I think this is the list of release blockers: > > > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ZooKeeper%20and%20resolution%20%3D%20Unresolved%20and%20fixVersion%20%3D%203.5.5%20AND%20priority%20in%20(blocker%2C%20critical)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC > > > > > > > > I currently see 7 issues in that search, of which 4 are aspects of the > > > > ongoing switch from ant to maven. Setting that aside for the moment, > > > > there are 3 critical bugs: > > > > > > > > ZOOKEEPER-2778 Potential server deadlock between follower sync with > > > > leader and follower receiving external connection requests. > > > > > > > > ZOOKEEPER-1636 c-client crash when zoo_amulti failed > > > > > > > > ZOOKEEPER-1818 Fix don't care for trunk > > > > > > > > I put them in that order because that's the order in which I've > > > > stacked the fixes in > > > > https://github.com/mkedwards/zookeeper/tree/branch-3.5. Then on top > > > > of that, I've updated the versions of the external library > > > > dependencies I think it's important to update: Jetty, Jackson, and > > > > BouncyCastle. The result seems to be a green build in Jenkins: > > > > https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2705/ > > > > > > > > Are these fixes in principle landable on the 3.5 branch, or do they > > > > have to go to master first? Does master need help to build green > > > > before these can land there? Are there other bugs that are similarly > > > > critical to fix, and not tagged for 3.5.5 in Jira? Is there other > > > > testing that I can help with? Are more hands needed on the Maven > > > > work? > > > > > > > > Thanks for all the work that goes into keeping Zookeeper healthy and > > > > advancing; it's a critical infrastructure component in several systems > > > > I help develop and operate, and I like being able to rely on it. > > > > > > > > Cheers, > > > > - Michael