Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)

John Sirois Mon, 29 Aug 2016 13:40:13 -0700

Thanks for the feedback folks! I'll post a flag default switch shortly.

On Wed, Aug 24, 2016 at 12:20 PM, Joshua Cohen <[email protected]> wrote:


> I have this enabled in a test cluster and have not noticed any issues with
> it yet. I'd like to roll it out to production before we drop the old code
> though.
>

Agreed.  This deserves caution, and fwict the jvm leader code is ~never in
the refactor path; so even though I too am eager to delete the code, it is
not an active refactoring burden.


> On Wed, Aug 24, 2016 at 1:10 PM, Zameer Manji <[email protected]> wrote:
>
>> Could we change the default and drop the old code at the same time? I
>> don't
>> see any benefit of letting that hang around.
>>
>> I have not tested this code yet, but I hope to do it soon.
>>
>> On Wed, Aug 24, 2016 at 5:19 AM, Erb, Stephan <
>> [email protected]>
>> wrote:
>>
>> > The curator backend has been working well for us so far. I believe it is
>> > safe to make it the default for the next release, and to drop the old
>> code
>> > in the release after that.
>> >
>> >
>> >
>> > *From: *John Sirois <[email protected]>
>> > *Reply-To: *"[email protected]" <[email protected]>, "
>> > [email protected]" <[email protected]>
>> > *Date: *Thursday 7 July 2016 at 01:13
>> > *To: *Martin Hrabovčin <[email protected]>
>> > *Cc: *"[email protected]" <[email protected]>, Jake Farrell <
>> > [email protected]>, "[email protected]" <[email protected]>
>> > *Subject: *Re: [FEEDBACK] Transitioning Aurora leader election to Apache
>>
>> > Curator (`-zk_use_curator`)
>> >
>> >
>> >
>> > Now that 0.15.0 has been released, I thought I'd check in on any
>> progress
>> > folks have made with testing/deploying the 0.14.0+ with the Aurora
>> > Scheduler `-zk_use_curator` flag in-place.
>> >
>> > There has been 1 fix that will go out in the 0.16.0 release to reduce
>> > logger noise on shutdown [1][2] but I have heard no negative (or
>> positive)
>> > feedback otherwise.
>> >
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/AURORA-1729
>> >
>> > [2] https://reviews.apache.org/r/49578/
>> >
>> >
>> >
>> > On Thu, Jun 16, 2016 at 1:18 PM, John Sirois <[email protected]>
>> wrote:
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Jun 16, 2016 at 12:03 AM, Martin Hrabovčin <
>> > [email protected]> wrote:
>> >
>> > How should be this flag rolled to existing running cluster? Can it be
>> done
>> > using rolling update instance by instance or we need to stop the whole
>> > cluster and then bring all nodes with new flag?
>> >
>> >
>> >
>> > I recommend a whole cluster down, upgrade +  new flag, up.
>> >
>> >
>> >
>> > A rolling update should work, but will likely be rocky.  My analysis:
>> >
>> >
>> >
>> > The Aurora leader election consists of 2 components, the actual leader
>> > election and the resulting advertisement by the leader of itself as the
>> > Aurora service endpoint.  These 2 components each use zookeeper and of
>> the
>> > 2 I only ensured that the advertisement was compatible with old releases
>> > (old clients). The leader election portion is completely internal to the
>> > Aurora scheduler instances vying for leadership and, under Curator,
>> uses a
>> > different (enhanced), zookeeper node scheme.  As a result, this is what
>> > could happen in a slow roll:
>> >
>> >
>> >
>> > before upgrade: 0: old-lead, 1: old-follow, 2: old-follow
>> >
>> > upgrade 0: new-lead, 1: old-lead, 2: old-follow
>> >
>> >
>> >
>> > Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1
>> > as leader. The result will be both node 0 and node 1 attempting to read
>> the
>> > mesos distributed log.  Now the log uses its own leader election and the
>> > reader must be the leader as things stand, so the Aurora-level
>> leadership
>> > "tie" will be broken by one of the 2 Aurora-level leaders failing to
>> become
>> > the mesos distributed log leader, and that node will restart its
>> lifecycle
>> > - ie flap.  This will continue to be the case with second node upgrade
>> and
>> > will not stabilize until the 3rd node is upgraded.
>> >
>> >
>> >
>> >
>> >
>> > 2016-06-16 5:03 GMT+02:00 Jake Farrell <[email protected]>:
>> >
>> > +1, will enable on our test clusters to help verify
>> >
>> > -Jake
>> >
>> >
>> > On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <[email protected]>
>> wrote:
>> >
>> > > I'd like to move forward with
>> > > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing
>> > > legacy
>> > > (Twitter) commons zookeeper libraries used for Aurora leader election
>> in
>> > > favor of Apache Curator libraries. The change submitted in
>> > > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and
>> > > Apache
>> > > Curator based service discovery can be enabled with the Aurora
>> scheduler
>> > > flag `-zk_use_curator`.  I'd like feedback from users who enable this
>> > > option.  If you have a test cluster where you can enable
>> > `-zk_use_curator`
>> > > and exercise leader failure and failover, I'd be grateful. If you have
>> > > moved to using this option in production with demonstrable
>> improvements
>> > or
>> > > even maintenance of status quo, I'd also be grateful for this news. If
>> > > you've found regressions or new bugs, I'd love to know about those as
>> > well.
>> > >
>> > > Thanks in advance to all those who find time to test this out on real
>> > > systems!
>> > >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>
>

Re: [FEEDBACK] Transitioning Aurora leader election to Apache Curator (`-zk_use_curator`)

Reply via email to