Thanks for the feedback folks! I'll post a flag default switch shortly. On Wed, Aug 24, 2016 at 12:20 PM, Joshua Cohen <jco...@apache.org> wrote:
> I have this enabled in a test cluster and have not noticed any issues with > it yet. I'd like to roll it out to production before we drop the old code > though. > Agreed. This deserves caution, and fwict the jvm leader code is ~never in the refactor path; so even though I too am eager to delete the code, it is not an active refactoring burden. > On Wed, Aug 24, 2016 at 1:10 PM, Zameer Manji <zma...@apache.org> wrote: > >> Could we change the default and drop the old code at the same time? I >> don't >> see any benefit of letting that hang around. >> >> I have not tested this code yet, but I hope to do it soon. >> >> On Wed, Aug 24, 2016 at 5:19 AM, Erb, Stephan < >> stephan....@blue-yonder.com> >> wrote: >> >> > The curator backend has been working well for us so far. I believe it is >> > safe to make it the default for the next release, and to drop the old >> code >> > in the release after that. >> > >> > >> > >> > *From: *John Sirois <jsir...@apache.org> >> > *Reply-To: *"u...@aurora.apache.org" <u...@aurora.apache.org>, " >> > jsir...@apache.org" <jsir...@apache.org> >> > *Date: *Thursday 7 July 2016 at 01:13 >> > *To: *Martin Hrabovčin <martin.hrabov...@gmail.com> >> > *Cc: *"dev@aurora.apache.org" <dev@aurora.apache.org>, Jake Farrell < >> > jfarr...@apache.org>, "u...@aurora.apache.org" <u...@aurora.apache.org> >> > *Subject: *Re: [FEEDBACK] Transitioning Aurora leader election to Apache >> >> > Curator (`-zk_use_curator`) >> > >> > >> > >> > Now that 0.15.0 has been released, I thought I'd check in on any >> progress >> > folks have made with testing/deploying the 0.14.0+ with the Aurora >> > Scheduler `-zk_use_curator` flag in-place. >> > >> > There has been 1 fix that will go out in the 0.16.0 release to reduce >> > logger noise on shutdown [1][2] but I have heard no negative (or >> positive) >> > feedback otherwise. >> > >> > >> > >> > [1] https://issues.apache.org/jira/browse/AURORA-1729 >> > >> > [2] https://reviews.apache.org/r/49578/ >> > >> > >> > >> > On Thu, Jun 16, 2016 at 1:18 PM, John Sirois <jsir...@apache.org> >> wrote: >> > >> > >> > >> > >> > >> > On Thu, Jun 16, 2016 at 12:03 AM, Martin Hrabovčin < >> > martin.hrabov...@gmail.com> wrote: >> > >> > How should be this flag rolled to existing running cluster? Can it be >> done >> > using rolling update instance by instance or we need to stop the whole >> > cluster and then bring all nodes with new flag? >> > >> > >> > >> > I recommend a whole cluster down, upgrade + new flag, up. >> > >> > >> > >> > A rolling update should work, but will likely be rocky. My analysis: >> > >> > >> > >> > The Aurora leader election consists of 2 components, the actual leader >> > election and the resulting advertisement by the leader of itself as the >> > Aurora service endpoint. These 2 components each use zookeeper and of >> the >> > 2 I only ensured that the advertisement was compatible with old releases >> > (old clients). The leader election portion is completely internal to the >> > Aurora scheduler instances vying for leadership and, under Curator, >> uses a >> > different (enhanced), zookeeper node scheme. As a result, this is what >> > could happen in a slow roll: >> > >> > >> > >> > before upgrade: 0: old-lead, 1: old-follow, 2: old-follow >> > >> > upgrade 0: new-lead, 1: old-lead, 2: old-follow >> > >> > >> > >> > Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1 >> > as leader. The result will be both node 0 and node 1 attempting to read >> the >> > mesos distributed log. Now the log uses its own leader election and the >> > reader must be the leader as things stand, so the Aurora-level >> leadership >> > "tie" will be broken by one of the 2 Aurora-level leaders failing to >> become >> > the mesos distributed log leader, and that node will restart its >> lifecycle >> > - ie flap. This will continue to be the case with second node upgrade >> and >> > will not stabilize until the 3rd node is upgraded. >> > >> > >> > >> > >> > >> > 2016-06-16 5:03 GMT+02:00 Jake Farrell <jfarr...@apache.org>: >> > >> > +1, will enable on our test clusters to help verify >> > >> > -Jake >> > >> > >> > On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <jsir...@apache.org> >> wrote: >> > >> > > I'd like to move forward with >> > > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing >> > > legacy >> > > (Twitter) commons zookeeper libraries used for Aurora leader election >> in >> > > favor of Apache Curator libraries. The change submitted in >> > > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and >> > > Apache >> > > Curator based service discovery can be enabled with the Aurora >> scheduler >> > > flag `-zk_use_curator`. I'd like feedback from users who enable this >> > > option. If you have a test cluster where you can enable >> > `-zk_use_curator` >> > > and exercise leader failure and failover, I'd be grateful. If you have >> > > moved to using this option in production with demonstrable >> improvements >> > or >> > > even maintenance of status quo, I'd also be grateful for this news. If >> > > you've found regressions or new bugs, I'd love to know about those as >> > well. >> > > >> > > Thanks in advance to all those who find time to test this out on real >> > > systems! >> > > >> > >> > >> > >> > >> > >> > >> > >> > >> > >