I'm glad to be able to help.

It appears as though some of the "flaky tests" result from another
process stealing a server port between the time that it is assigned
(in org.apache.zookeeper.PortAssignment.unique()) and the time that it
is bound.  This happened, for example, in
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2708/;
looking in the console text, I found:

     [exec]     [junit] 2018-11-22 00:18:30,336 [myid:] - INFO
[QuorumPeerListener:QuorumCnxManager$Listener@884] - My election bind
port: localhost/127.0.0.1:19459
     [exec]     [junit] 2018-11-22 00:18:30,337 [myid:] - INFO
[QuorumPeer[myid=1](plain=/127.0.0.1:19457)(secure=disabled):NettyServerCnxnFactory@493]
- binding to port localhost/127.0.0.1:19466
     [exec]     [junit] 2018-11-22 00:18:30,337 [myid:] - ERROR
[QuorumPeer[myid=1](plain=/127.0.0.1:19457)(secure=disabled):NettyServerCnxnFactory@497]
- Error while reconfiguring
     [exec]     [junit] org.jboss.netty.channel.ChannelException:
Failed to bind to: localhost/127.0.0.1:19466
     [exec]     [junit] at
org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
     [exec]     [junit] at
org.apache.zookeeper.server.NettyServerCnxnFactory.reconfigure(NettyServerCnxnFactory.java:494)
     [exec]     [junit] at
org.apache.zookeeper.server.quorum.QuorumPeer.processReconfig(QuorumPeer.java:1947)
     [exec]     [junit] at
org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:154)
     [exec]     [junit] at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:93)
     [exec]     [junit] at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1263)
     [exec]     [junit] Caused by: java.net.BindException: Address
already in use
     [exec]     [junit] at sun.nio.ch.Net.bind0(Native Method)
     [exec]     [junit] at sun.nio.ch.Net.bind(Net.java:433)
     [exec]     [junit] at sun.nio.ch.Net.bind(Net.java:425)
     [exec]     [junit] at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
     [exec]     [junit] at
sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
     [exec]     [junit] at
org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
     [exec]     [junit] at
org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
     [exec]     [junit] at
org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
     [exec]     [junit] at
org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
     [exec]     [junit] at
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
     [exec]     [junit] at
org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
     [exec]     [junit] at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     [exec]     [junit] at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     [exec]     [junit] at java.lang.Thread.run(Thread.java:748)

We currently log-and-swallow this exception (and many others) down in
NettyServerCnxnFactory.reconfigure() and
NIOServerCnxnFactory.reconfigure(), which is ... not ideal.

How should we handle a bind failure in the real world?  Seems like we
ought to throw a BindException out at least as far as the caller of
QuorumPeer.processReconfig().  That's either
Follower/Leader/Learner/Observer or FastLeaderElection.  Presumably
they should immediately go read-only when they can't bind the client
port?
On Thu, Nov 22, 2018 at 1:23 AM Enrico Olivelli <eolive...@gmail.com> wrote:
>
> Thank you very much Michael
> I am following and reviewing your patches
>
> Enrico
> Il giorno gio 22 nov 2018 alle ore 10:14 Michael K. Edwards
> <m.k.edwa...@gmail.com> ha scritto:
> >
> > Hmm.  Jira's a bit of a boneyard, isn't it?  And timeouts in flaky
> > tests are a problem.
> >
> > I scrubbed through the open bugs and picked the ones that looked to me
> > like they might deserve attention for 3.5.5 or soon thereafter.
> > They're all on my watchlist:
> > https://issues.apache.org/jira/issues/?filter=-1&jql=watcher%20%3D%20mkedwards%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20created%20ASC
> > (I'm not counting the Ant->Maven transition in that, which I don't
> > know much about.)
> >
> > I'm trying out some more verbose logging for the junit tests, to try
> > to understand test flakiness.  But the Jenkins pre-commit pipeline
> > appears to be down?
> > https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/
> > On Wed, Nov 21, 2018 at 2:29 PM Michael K. Edwards
> > <m.k.edwa...@gmail.com> wrote:
> > >
> > > Looks like we're really close.  Can I help?
> > >
> > > I think this is the list of release blockers:
> > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20ZooKeeper%20and%20resolution%20%3D%20Unresolved%20and%20fixVersion%20%3D%203.5.5%20AND%20priority%20in%20(blocker%2C%20critical)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
> > >
> > > I currently see 7 issues in that search, of which 4 are aspects of the
> > > ongoing switch from ant to maven.  Setting that aside for the moment,
> > > there are 3 critical bugs:
> > >
> > > ZOOKEEPER-2778  Potential server deadlock between follower sync with
> > > leader and follower receiving external connection requests.
> > >
> > > ZOOKEEPER-1636  c-client crash when zoo_amulti failed
> > >
> > > ZOOKEEPER-1818  Fix don't care for trunk
> > >
> > > I put them in that order because that's the order in which I've
> > > stacked the fixes in
> > > https://github.com/mkedwards/zookeeper/tree/branch-3.5.  Then on top
> > > of that, I've updated the versions of the external library
> > > dependencies I think it's important to update: Jetty, Jackson, and
> > > BouncyCastle.  The result seems to be a green build in Jenkins:
> > > https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2705/
> > >
> > > Are these fixes in principle landable on the 3.5 branch, or do they
> > > have to go to master first?  Does master need help to build green
> > > before these can land there?  Are there other bugs that are similarly
> > > critical to fix, and not tagged for 3.5.5 in Jira?  Is there other
> > > testing that I can help with?  Are more hands needed on the Maven
> > > work?
> > >
> > > Thanks for all the work that goes into keeping Zookeeper healthy and
> > > advancing; it's a critical infrastructure component in several systems
> > > I help develop and operate, and I like being able to rely on it.
> > >
> > > Cheers,
> > > - Michael

Reply via email to