Now that 0.15.0 has been released, I thought I'd check in on any progress folks have made with testing/deploying the 0.14.0+ with the Aurora Scheduler `-zk_use_curator` flag in-place. There has been 1 fix that will go out in the 0.16.0 release to reduce logger noise on shutdown [1][2] but I have heard no negative (or positive) feedback otherwise.
[1] https://issues.apache.org/jira/browse/AURORA-1729 [2] https://reviews.apache.org/r/49578/ On Thu, Jun 16, 2016 at 1:18 PM, John Sirois <jsir...@apache.org> wrote: > > > On Thu, Jun 16, 2016 at 12:03 AM, Martin HrabovĨin < > martin.hrabov...@gmail.com> wrote: > >> How should be this flag rolled to existing running cluster? Can it be >> done using rolling update instance by instance or we need to stop the whole >> cluster and then bring all nodes with new flag? >> > > I recommend a whole cluster down, upgrade + new flag, up. > > A rolling update should work, but will likely be rocky. My analysis: > > The Aurora leader election consists of 2 components, the actual leader > election and the resulting advertisement by the leader of itself as the > Aurora service endpoint. These 2 components each use zookeeper and of the > 2 I only ensured that the advertisement was compatible with old releases > (old clients). The leader election portion is completely internal to the > Aurora scheduler instances vying for leadership and, under Curator, uses a > different (enhanced), zookeeper node scheme. As a result, this is what > could happen in a slow roll: > > before upgrade: 0: old-lead, 1: old-follow, 2: old-follow > upgrade 0: new-lead, 1: old-lead, 2: old-follow > > Here, node 0 will see itself as leader and nodes 1 and 2 will see node 1 > as leader. The result will be both node 0 and node 1 attempting to read the > mesos distributed log. Now the log uses its own leader election and the > reader must be the leader as things stand, so the Aurora-level leadership > "tie" will be broken by one of the 2 Aurora-level leaders failing to become > the mesos distributed log leader, and that node will restart its lifecycle > - ie flap. This will continue to be the case with second node upgrade and > will not stabilize until the 3rd node is upgraded. > > >> >> 2016-06-16 5:03 GMT+02:00 Jake Farrell <jfarr...@apache.org>: >> >>> +1, will enable on our test clusters to help verify >>> >>> -Jake >>> >>> On Tue, Jun 14, 2016 at 7:43 PM, John Sirois <jsir...@apache.org> wrote: >>> >>> > I'd like to move forward with >>> > https://issues.apache.org/jira/browse/AURORA-1669 asap; ie: removing >>> > legacy >>> > (Twitter) commons zookeeper libraries used for Aurora leader election >>> in >>> > favor of Apache Curator libraries. The change submitted in >>> > https://reviews.apache.org/r/46286/ is now live in Aurora 0.14.0 and >>> > Apache >>> > Curator based service discovery can be enabled with the Aurora >>> scheduler >>> > flag `-zk_use_curator`. I'd like feedback from users who enable this >>> > option. If you have a test cluster where you can enable >>> `-zk_use_curator` >>> > and exercise leader failure and failover, I'd be grateful. If you have >>> > moved to using this option in production with demonstrable >>> improvements or >>> > even maintenance of status quo, I'd also be grateful for this news. If >>> > you've found regressions or new bugs, I'd love to know about those as >>> well. >>> > >>> > Thanks in advance to all those who find time to test this out on real >>> > systems! >>> > >>> >> >> >