I managed to deploy this code in a test cluster and observed no issues. I still advocate for dropping the old code when we change the default but I understand concerns that it is risky.
On Mon, Aug 29, 2016 at 1:39 PM, John Sirois <jsir...@apache.org> wrote: > Thanks for the feedback folks! I'll post a flag default switch shortly. > > On Wed, Aug 24, 2016 at 12:20 PM, Joshua Cohen <jco...@apache.org> wrote: > > > I have this enabled in a test cluster and have not noticed any issues > with > > it yet. I'd like to roll it out to production before we drop the old code > > though. > > > > Agreed. This deserves caution, and fwict the jvm leader code is ~never in > the refactor path; so even though I too am eager to delete the code, it is > not an active refactoring burden. > > > > On Wed, Aug 24, 2016 at 1:10 PM, Zameer Manji <zma...@apache.org> wrote: > > > >> Could we change the default and drop the old code at the same time? I > >> don't > >> see any benefit of letting that hang around. > >> > >> I have not tested this code yet, but I hope to do it soon. > >> > >> On Wed, Aug 24, 2016 at 5:19 AM, Erb, Stephan < > >> stephan....@blue-yonder.com> > >> wrote: > >> > >> > The curator backend has been working well for us so far. I believe it > is > >> > safe to make it the default for the next release, and to drop the old > >> code > >> > in the release after that. > >> > > >> > > >> > > >> > *From: *John Sirois <jsir...@apache.org> > >> > *Reply-To: *"u...@aurora.apache.org" <u...@aurora.apache.org>, " > >> > jsir...@apache.org" <jsir...@apache.org> > >> > *Date: *Thursday 7 July 2016 at 01:13 > >> > *To: *Martin Hrabovčin <martin.hrabov...@gmail.com> > >> > *Cc: *"dev@aurora.apache.org" <dev@aurora.apache.org>, Jake Farrell < > >> > jfarr...@apache.org>, "u...@aurora.apache.org" < > u...@aurora.apache.org> > >> > *Subject: *Re: [FEEDBACK] Transitioning Aurora leader election to > Apache > >> > >> > Curator (`-zk_use_curator`) > >> > > >> > > >> > > >> > Now that 0.15.0 has been released, I thought I'd check in on any > >> progress > >> > folks have made with testing/deploying the 0.14.0+ with the Aurora > >> > Scheduler `-zk_use_curator` flag in-place. > >> > > >> > There has been 1 fix that will go out in the 0.16.0 release to reduce > >> > logger noise on shutdown [1][2] but I have heard no negative (or > >> positive) > >> > feedback otherwise. > >> > > >> > > >> > > >> > [1] https://issues.apache.org/jira/browse/AURORA-1729 > >> > > >> > [2] https://reviews.apache.org/r/49578/ > >> > > >> > > >> > > >> > On Thu, Jun 16, 2016 at 1:18 PM, John Sirois <jsir...@apache.org> > >> wrote: > >> > > >> > > >> > > >> > > >> > > >> > On Thu, Jun 16, 2016 at 12:03 AM, Martin Hrabovčin < > >> > martin.hrabov...@gmail.com> wrote: > >> > > >> > How should be this flag rolled to existing running cluster? Can it be > >> done > >> > using rolling update instance by instance or we need to stop the whole > >> > cluster and then bring all nodes with new flag? > >> > > >> > > >> > > >> > I recommend a whole cluster down, upgrade + new flag, up. > >> > > >> > > >> > > >> > A rolling update should work, but will likely be rocky. My analysis: > >> > > >> > > >> > > >> > The Aurora leader election consists of 2 components, the actual leader > >> > election and the resulting advertisement by the leader of itself as > the > >> > Aurora service endpoint. These 2 components each use zookeeper and of > >> the > >> > 2 I only ensured that the advertisement was compatible with old > releases > >> > (old clients). The leader election portion is completely internal to > the > >> > Aurora scheduler instances vying for leadership and, under Curator, > >> uses a > >> > different (enhanced), zookeeper node scheme. As a result, this is > what > >> > could happen in a slow roll: > >> > > >> > > >> > > >> > before upgrade: 0: old-lead, 1: old-follow, 2: old-follow > >> > > >> > upgrade 0: new-lead, 1: old-lead, 2: old-follow > >> > > >> > > >> > > >> > Here, node 0 will see itself as leader and nodes 1 and 2 will see > node 1 > >> > as leader. The result will be both node 0 and node 1 attempting to > read > >> the > >> > mesos distributed log. Now the log uses its own leader election and > the > >> > reader must be the leader as things stand, so the Aurora-level > >> leadership > >> > "tie" will be broken by one of the 2 Aurora-level leaders failing to > >> become > >> > the mesos distributed log leader, and that node will restart its > >> lifecycle > >> > - ie flap. This will continue to be the case with second node upgrade > >> and > >> > will not stabilize until the 3rd node is upgraded. > >> > > >> > > >> > > >> > > >> > > >> > 2016-06-16 5:03 GMT+02:00 Jake Farrell <jfarr...@apache.org>: > >> > > >> > +1, will enable on our test clusters to help verify > >> > > >> > -Jake > >> > > >> > > >> > On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <jsir...@apache.org> > >> wrote: > >> > > >> > > I'd like to move forward with > >> > > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: > removing > >> > > legacy > >> > > (Twitter) commons zookeeper libraries used for Aurora leader > election > >> in > >> > > favor of Apache Curator libraries. The change submitted in > >> > > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 > and > >> > > Apache > >> > > Curator based service discovery can be enabled with the Aurora > >> scheduler > >> > > flag `-zk_use_curator`. I'd like feedback from users who enable > this > >> > > option. If you have a test cluster where you can enable > >> > `-zk_use_curator` > >> > > and exercise leader failure and failover, I'd be grateful. If you > have > >> > > moved to using this option in production with demonstrable > >> improvements > >> > or > >> > > even maintenance of status quo, I'd also be grateful for this news. > If > >> > > you've found regressions or new bugs, I'd love to know about those > as > >> > well. > >> > > > >> > > Thanks in advance to all those who find time to test this out on > real > >> > > systems! > >> > > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > > > >