Re: [DISCUSS] VR upgrade downtime reduction

Daan Hoogland Wed, 07 Feb 2018 00:59:18 -0800

Reading all the reactions I am getting wary of all the possible solutions
that we have.
 We do have a fragile VR and Remi's way seems the only one to stabilise it.
It also answers the question on which of my two tactics we should follow.
 Wido's abjection may be valid but services that are not started are not
crashing and thus should not hinder him.
 As for Wei's changes I think the most important one is in the PR I ported
forward to master, using his older commit. I metntioned it in
>     [1] https://github.com/apache/cloudstack/pull/2435
I am looking forward to any of your PRs as well Wei.


 Making all VRs redundant is a bit of a hack and the biggest risk in it is
making sure that only one will get started.

 There is one point I'd like consensus on; We have only one system
template and we are well served by letting it have only one form as VR. Do
we agree on that?

comments, flames, questions, regards,


On Tue, Feb 6, 2018 at 9:04 PM, Wei ZHOU <ustcweiz...@gmail.com> wrote:

> Hi Remi,
>
> Actually in our fork, there are more changes than restartnetwork and
> restart vpc, similar as your changes.
> (1) edit networks from offering with single VR to offerings with RVR, will
> hack VR (set new guest IP, start keepalived and conntrackd, blablabla)
> (2) restart vpc from single VR to RVR. similar changes will be made.
> The downtime is around 5s. However, these changes are based 4.7.1, we are
> not sure if it still work in 4.11
>
> We have lots of changes , we will port the changes to 4.11 LTS and create
> PRs in the next months.
>
> -Wei
>
>
> 2018-02-06 14:47 GMT+01:00 Remi Bergsma <rberg...@schubergphilis.com>:
>
> > Hi Daan,
> >
> > In my opinion the biggest issue is the fact that there are a lot of
> > different code paths: VPC versus non-VPC, VPC versus redundant-VPC, etc.
> > That's why you cannot simply switch from a single VPC to a redundant VPC
> > for example.
> >
> > For SBP, we mitigated that in Cosmic by converting all non-VPCs to a VPC
> > with a single tier and made sure all features are supported. Next we
> merged
> > the single and redundant VPC code paths. The idea here is that redundancy
> > or not should only be a difference in the number of routers. Code should
> be
> > the same. A single router, is also "master" but there just is no
> "backup".
> >
> > That simplifies things A LOT, as keepalived is now the master of the
> whole
> > thing. No more assigning ip addresses in Python, but leave that to
> > keepalived instead. Lots of code deleted. Easier to maintain, way more
> > stable. We just released Cosmic 6 that has this feature and are now
> rolling
> > it out in production. Looking good so far. This change unlocks a lot of
> > possibilities, like live upgrading from a single VPC to a redundant one
> > (and back). In the end, if the redundant VPC is rock solid, you most
> likely
> > don't even want single VPCs any more. But that will come.
> >
> > As I said, we're rolling this out as we speak. In a few weeks when
> > everything is upgraded I can share what we learned and how well it works.
> > CloudStack could use a similar approach.
> >
> > Kind Regards,
> > Remi
> >
> >
> >
> > On 05/02/2018, 16:44, "Daan Hoogland" <daan.hoogl...@gmail.com> wrote:
> >
> >     H devs,
> >
> >     I have recently (re-)submitted two PRs, one by Wei [1] and one by
> Remi
> > [2],
> >     that reduce downtime for redundant routers and redundant VPCs
> > respectively.
> >     (please review those)
> >     Now from customers we hear that they also want to reduce downtime for
> >     regular VRs so as we discussed this we came to two possible solutions
> > that
> >     we want to implement one of:
> >
> >     1. start and configure a new router before destroying the old one and
> > then
> >     as a last minute action stop the old one.
> >     2. make all routers start up redundancy services but for regular
> > routers
> >     start only one until an upgrade is required at which time a new,
> second
> >     router can be started before killing the old one.
> >
> >     obviously both solutions have their merits, so I want to have your
> > input
> >     to make the broadest supported implementation.
> >     -1 means there will be an overlap or a small delay and interruption
> of
> >     service.
> >     +1 It can be argued, "they got what they payed for".
> >     -2 means a overhead in memory usage by the router by the extra
> services
> >     running on it.
> >     +2 the number of router-varieties will be further reduced.
> >
> >     -1&-2 We have to deal with potentially large upgrade steps from way
> > before
> >     the cloudstack era even and might be stuck to 1 because of that,
> > needing to
> >     hack around it. Any dealing with older VRs, pre 4.5 and especially
> pre
> > 4.0
> >     will be hard.
> >
> >     I am not cross posting though this might be one of these occasions
> > where it
> >     is appropriate to include users@. Just my puristic inhibitions.
> >
> >     Of course I have preferences but can you share your thoughts, please?
> >     
> >     And don't forget to review Wei's [1] and Remi's [2] work please.
> >
> >     [1] https://github.com/apache/cloudstack/pull/2435
> >     [2] https://github.com/apache/cloudstack/pull/2436
> >
> >     --
> >     Daan
> >
> >
> >
>



-- 
Daan

Re: [DISCUSS] VR upgrade downtime reduction

Reply via email to