Re: Glide path to getting 3.5.x out of beta

Michael K. Edwards Fri, 23 Nov 2018 07:53:43 -0800

Thanks!  I assigned 2778 to myself.

ZOOKEEPER-2778:  A port to the master branch of the current state of
my patch is in https://github.com/apache/zookeeper/pull/719.  Be aware
that there are a couple of touches to the code needed in 3.5 that
aren't needed in master:
https://github.com/apache/zookeeper/pull/707/files#diff-7a209d890686bcba351d758b64b22a7dR413
and 
https://github.com/apache/zookeeper/pull/707/files#diff-b2dd09c58f745da275fee3c6d8681503R974
(both of these are obviated by cleanups that have taken place on
master).


ZOOKEEPER-1636:  By "clean" I just mean "in isolation"; previously I
had stacked this patch in a branch on top of the 2778 work.

ZOOKEEPER-1818:  PR #714 is a port of Fangmin's patch to 3.5 (which
split off before the refactor from termCondition to getVoteTracker).
PR #718 is Fangmin's patch unchanged, just cherry-picked onto current
master and poked until we got a green Jenkins build.

"Address already in use":
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2708/consoleText
(search for BindException).  You generally have to look at the raw
consoleText in order to find these.  I don't see any way of getting at
the untruncated text for
https://builds.apache.org/job/ZooKeeper_branch35_jdk8/1195/testReport/junit/org.apache.zookeeper.server.quorum/StandaloneDisabledTest/startSingleServerTest/
, but I suspect there's a similar BindException hidden inside
"...[truncated 395348 chars]..."

On Fri, Nov 23, 2018 at 1:50 AM Andor Molnar <an...@apache.org> wrote:
>
> Hi Michael,
>
> I added you to the contributors list in Jira, now you can assign tickets to 
> yourself.
>
> 3.5
> ~~~
> ZOOKEEPER-2778 - I already accepted the patch, but I’d like to kindly ask you 
> to create a separate pull request for the master branch which I can backport 
> to 3.5 after committing it. This will help us follow the standard procedure 
> of making changes.
>
> ZOOKEEPER-1636 - Thanks for picking it up, I’ll review your patch shortly. 
> Btw I’m not sure what do you mean by “clean” pull request.
>
> ZOOKEEPER-1818 - This issue is already taken care by Fangmin (PR #703), why 
> have you created the new PR?
>
> Flakies
> ~~~~~~~
> We’re already aware of the downside of PortAssignment class, but haven’t 
> really seen too many "Address already in use” problems in tests. (Except in 
> Java 11 builds, but those are unrelated) Would you please provide some 
> evidence about your findings with links to builds that you’re talking about 
> and specific error messages?
>
> Thanks,
> Andor
>
>
>
>
> > On 2018. Nov 22., at 23:20, Michael K. Edwards <m.k.edwa...@gmail.com> 
> > wrote:
> >
> > For what it's worth, builds 2732 and 2733 ran concurrently on H19, and
> > both failed for what I think are resource-conflict reasons.  It would
> > probably help to modify the PreCommit-ZOOKEEPER-github-pr-build queue
> > so that it doesn't attempt concurrent builds on the same
> > (uncontainerized) host.
> > On Thu, Nov 22, 2018 at 1:44 PM Michael K. Edwards
> > <m.k.edwa...@gmail.com> wrote:
> >>
> >> Thanks for the guidance.  Feel free to assign ZOOKEEPER-2778 to me (I
> >> don't seem to be able to do it myself).  I've updated that pull
> >> request against 3.5 to address all reviewer comments.  When it looks
> >> ready to land, I'll port it to master as well.
> >>
> >> I have updated ZOOKEEPER-1636 and ZOOKEEPER-1818 with clean pull
> >> requests based on Thawan's and Fangmin's patches.  I'll poke at them
> >> until they build green, and try to handle anything reviewers bring up.
> >>
> >> With regard to flaky tests:  a fair fraction of spurious test failures
> >> appear to result from failure to bind a dynamically-assigned
> >> client/election/quorum port.  The prevailing hypothesis is that
> >> something else, running concurrently on the machine, is binding the
> >> port in between the check in PortAssignment (which binds it, to verify
> >> that it's not otherwise in use, and then closes that socket to free it
> >> again) and the subsequent use as a service port.  If that's the case,
> >> then we could eliminate this class of test failures by running the
> >> tests inside a container (with a dedicated network namespace).  Any
> >> failures of this kind that persist in a containerized test setup are
> >> the test fighting itself, not fighting unrelated concurrent processes.
> >> On Thu, Nov 22, 2018 at 8:23 AM Andor Molnar <an...@cloudera.com> wrote:
> >>>
> >>> Hi Michael!
> >>>
> >>> Thanks for the great help to get 3.5 out of the door. We're getting 
> >>> closer with each commit.
> >>>
> >>> You asked a lot of questions in your email, which I'm trying to answer, 
> >>> but I believe the best approach is to deal with one problem at a time. 
> >>> Especially in email communication is not ideal to mix different topics, 
> >>> because it makes things hard to follow.
> >>>
> >>> I focus on 3.5 release in this thread according to the subject. There's 
> >>> another thread btw I usually update every so often, but your list is 
> >>> pretty much accurate too. I use the following query for 3.5 blockers:
> >>>
> >>> project = ZooKeeper AND resolution = Unresolved AND fixVersion = 3.5.5 
> >>> AND priority in (blocker, critical) ORDER BY priority DESC, key ASC
> >>>
> >>> ZOOKEEPER-1818 - Fangmin is working on it and patch is available on 
> >>> github.
> >>> ZOOKEEPER-2778 - You're working on it, patch is available. You should 
> >>> assign the Jira to yourself to avoid somebody else picking it up.
> >>> ZOOKEEPER-1636 - An ancient C issue which has patch available in Jira. 
> >>> I'm planning to rebase it on master, but didn't have a chance yet.
> >>>
> >>> All of the others are Maven/Doc related which Tamas and Norbert are 
> >>> working on.
> >>>
> >>> Flaky tests are related, but we don't tackle it as a blocker issue. 
> >>> Here's the umbrella Jira that I've created to track the progress:
> >>> https://issues.apache.org/jira/browse/ZOOKEEPER-3170
> >>>
> >>> Feel free to pick up any of the open ones or create new ones if you think 
> >>> it's necessary. It's generally better to open individual Jiras for every 
> >>> issue you're working on and discuss the details in it. You can open an 
> >>> email thread too, if you feel convenient, but Jira is preferred.
> >>>
> >>> Preferred workflow is Open Jira -> GitHub PR -> Commit to master -> 
> >>> Backport to 3.5/3.4 if necessary -> Close Jira.
> >>>
> >>> Thank you for your contribution again!
> >>>
> >>> Andor
> >>>
> >>>
> >>>
> >>> On Thu, Nov 22, 2018 at 12:51 PM Michael K. Edwards 
> >>> <m.k.edwa...@gmail.com> wrote:
> >>>>
> >>>> I think it's mostly a problem in CI, where other processes on the same
> >>>> machine may compete for the port range, producing spurious Jenkins
> >>>> failures.  The only failures I'm seeing locally are unrelated SSL
> >>>> issues.
> >>>> On Thu, Nov 22, 2018 at 3:45 AM Enrico Olivelli <eolive...@gmail.com> 
> >>>> wrote:
> >>>>>
> >>>>> Il giorno gio 22 nov 2018 alle ore 12:44 Michael K. Edwards
> >>>>> <m.k.edwa...@gmail.com> ha scritto:
> >>>>>>
> >>>>>> I'm glad to be able to help.
> >>>>>>
> >>>>>> It appears as though some of the "flaky tests" result from another
> >>>>>> process stealing a server port between the time that it is assigned
> >>>>>> (in org.apache.zookeeper.PortAssignment.unique()) and the time that it
> >>>>>> is bound.
> >>>>>
> >>>>> You can try running tests using a single thread, this will "mitigate"
> >>>>> the problem a bit
> >>>>>
> >>>>> Enrico
> >>>>>
> >>>>> This happened, for example, in
> >>>>>> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2708/;
> >>>>>> looking in the console text, I found:
> >>>>>>
> >>>>>>     [exec]     [junit] 2018-11-22 00:18:30,336 [myid:] - INFO
> >>>>>> [QuorumPeerListener:QuorumCnxManager$Listener@884] - My election bind
> >>>>>> port: localhost/127.0.0.1:19459
> >>>>>>     [exec]     [junit] 2018-11-22 00:18:30,337 [myid:] - INFO
> >>>>>> [QuorumPeer[myid=1](plain=/127.0.0.1:19457)(secure=disabled):NettyServerCnxnFactory@493]
> >>>>>> - binding to port localhost/127.0.0.1:19466
> >>>>>>     [exec]     [junit] 2018-11-22 00:18:30,337 [myid:] - ERROR
> >>>>>> [QuorumPeer[myid=1](plain=/127.0.0.1:19457)(secure=disabled):NettyServerCnxnFactory@497]
> >>>>>> - Error while reconfiguring
> >>>>>>     [exec]     [junit] org.jboss.netty.channel.ChannelException:
> >>>>>> Failed to bind to: localhost/127.0.0.1:19466
> >>>>>>     [exec]     [junit] at
> >>>>>> org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.apache.zookeeper.server.NettyServerCnxnFactory.reconfigure(NettyServerCnxnFactory.java:494)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.apache.zookeeper.server.quorum.QuorumPeer.processReconfig(QuorumPeer.java:1947)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:154)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:93)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1263)
> >>>>>>     [exec]     [junit] Caused by: java.net.BindException: Address
> >>>>>> already in use
> >>>>>>     [exec]     [junit] at sun.nio.ch.Net.bind0(Native Method)
> >>>>>>     [exec]     [junit] at sun.nio.ch.Net.bind(Net.java:433)
> >>>>>>     [exec]     [junit] at sun.nio.ch.Net.bind(Net.java:425)
> >>>>>>     [exec]     [junit] at
> >>>>>> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
> >>>>>>     [exec]     [junit] at
> >>>>>> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> >>>>>>     [exec]     [junit] at
> >>>>>> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> >>>>>>     [exec]     [junit] at
> >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> >>>>>>     [exec]     [junit] at
> >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> >>>>>>     [exec]     [junit] at java.lang.Thread.run(Thread.java:748)
> >>>>>>
> >>>>>> We currently log-and-swallow this exception (and many others) down in
> >>>>>> NettyServerCnxnFactory.reconfigure() and
> >>>>>> NIOServerCnxnFactory.reconfigure(), which is ... not ideal.
> >>>>>>
> >>>>>> How should we handle a bind failure in the real world?  Seems like we
> >>>>>> ought to throw a BindException out at least as far as the caller of
> >>>>>> QuorumPeer.processReconfig().  That's either
> >>>>>> Follower/Leader/Learner/Observer or FastLeaderElection.  Presumably
> >>>>>> they should immediately go read-only when they can't bind the client
> >>>>>> port?
> >>>>>> On Thu, Nov 22, 2018 at 1:23 AM Enrico Olivelli <eolive...@gmail.com> 
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Thank you very much Michael
> >>>>>>> I am following and reviewing your patches
> >>>>>>>
> >>>>>>> Enrico
> >>>>>>> Il giorno gio 22 nov 2018 alle ore 10:14 Michael K. Edwards
> >>>>>>> <m.k.edwa...@gmail.com> ha scritto:
> >>>>>>>>
> >>>>>>>> Hmm.  Jira's a bit of a boneyard, isn't it?  And timeouts in flaky
> >>>>>>>> tests are a problem.
> >>>>>>>>
> >>>>>>>> I scrubbed through the open bugs and picked the ones that looked to 
> >>>>>>>> me
> >>>>>>>> like they might deserve attention for 3.5.5 or soon thereafter.
> >>>>>>>> They're all on my watchlist:
> >>>>>>>> https://issues.apache.org/jira/issues/?filter=-1&jql=watcher%20%3D%20mkedwards%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20created%20ASC
> >>>>>>>> (I'm not counting the Ant->Maven transition in that, which I don't
> >>>>>>>> know much about.)
> >>>>>>>>
> >>>>>>>> I'm trying out some more verbose logging for the junit tests, to try
> >>>>>>>> to understand test flakiness.  But the Jenkins pre-commit pipeline
> >>>>>>>> appears to be down?
> >>>>>>>> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/
> >>>>>>>> On Wed, Nov 21, 2018 at 2:29 PM Michael K. Edwards
> >>>>>>>> <m.k.edwa...@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> Looks like we're really close.  Can I help?
> >>>>>>>>>
> >>>>>>>>> I think this is the list of release blockers:
> >>>>>>>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ZooKeeper%20and%20resolution%20%3D%20Unresolved%20and%20fixVersion%20%3D%203.5.5%20AND%20priority%20in%20(blocker%2C%20critical)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
> >>>>>>>>>
> >>>>>>>>> I currently see 7 issues in that search, of which 4 are aspects of 
> >>>>>>>>> the
> >>>>>>>>> ongoing switch from ant to maven.  Setting that aside for the 
> >>>>>>>>> moment,
> >>>>>>>>> there are 3 critical bugs:
> >>>>>>>>>
> >>>>>>>>> ZOOKEEPER-2778  Potential server deadlock between follower sync with
> >>>>>>>>> leader and follower receiving external connection requests.
> >>>>>>>>>
> >>>>>>>>> ZOOKEEPER-1636  c-client crash when zoo_amulti failed
> >>>>>>>>>
> >>>>>>>>> ZOOKEEPER-1818  Fix don't care for trunk
> >>>>>>>>>
> >>>>>>>>> I put them in that order because that's the order in which I've
> >>>>>>>>> stacked the fixes in
> >>>>>>>>> https://github.com/mkedwards/zookeeper/tree/branch-3.5.  Then on top
> >>>>>>>>> of that, I've updated the versions of the external library
> >>>>>>>>> dependencies I think it's important to update: Jetty, Jackson, and
> >>>>>>>>> BouncyCastle.  The result seems to be a green build in Jenkins:
> >>>>>>>>> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2705/
> >>>>>>>>>
> >>>>>>>>> Are these fixes in principle landable on the 3.5 branch, or do they
> >>>>>>>>> have to go to master first?  Does master need help to build green
> >>>>>>>>> before these can land there?  Are there other bugs that are 
> >>>>>>>>> similarly
> >>>>>>>>> critical to fix, and not tagged for 3.5.5 in Jira?  Is there other
> >>>>>>>>> testing that I can help with?  Are more hands needed on the Maven
> >>>>>>>>> work?
> >>>>>>>>>
> >>>>>>>>> Thanks for all the work that goes into keeping Zookeeper healthy and
> >>>>>>>>> advancing; it's a critical infrastructure component in several 
> >>>>>>>>> systems
> >>>>>>>>> I help develop and operate, and I like being able to rely on it.
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> - Michael
>

Re: Glide path to getting 3.5.x out of beta

Reply via email to