Re: [users] Avoid rebooting payload modules after losing system controller

Mathivanan Naickan Palanivelu Tue, 22 Dec 2015 03:09:07 -0800

Hi Tony,

Yes, it has been agreed to support more than two controllers in a phased manner.


The ticket http://sourceforge.net/p/opensaf/tickets/79/ will be available
as a part of the next major release 5.0 in April 2016.
This ticket #79 will provide support for configuring spare system controllers.
i.e. Everytime a controller dies, a spare will be promoted as an ACTIVE.
In your case, one of the line cards can be configured as a spare!
Please see the attachment in the ticket for a overview of the feature.


At the same time, it has been agreed to include the solution(in 5.0)  also in
https://sourceforge.net/u/anders-w/opensaf-headless/ (currently 
maintained as a fork) into the main branch. This feature
will be configurable to be turned on/off. 


Subsequently, 5.1 will have support for a mix of multiplestandbys + spares. 

The details if this has been updated in the following two tickets:
http://sourceforge.net/p/opensaf/tickets/439/
and
http://sourceforge.net/p/opensaf/tickets/1170/


Cheers,
Mathi.






----- [email protected] wrote:

> Hi Mathi,
> Any updates on this topic?
> thanks
> —
> tony
> 
> 
> > On Oct 14, 2015, at 7:50 AM, Mathivanan Naickan Palanivelu
> <[email protected]> wrote:
> > 
> > Tony,
> > 
> > We(in the TLC) have discussed this before. At this point of time
> > you indeed have the solution that AndersW has pointed to.
> > 
> > There are also other discussions going on w.r.t multiple-standbys
> etc
> > for the next major release. Will tag you in the appropriate ticket
> > (one such ticket is http://sourceforge.net/p/opensaf/tickets/1170/)
> > once those discussions conclude.
> > 
> > Cheers,
> > Mathi.
> > 
> > ----- [email protected] wrote:
> > 
> >> Thanks all, this is a good discussion.
> >> 
> >> Let me again state my use case, which as I said I don’t think is
> >> unique, because selfishly I want to make sure that this is covered
> >> :-)
> >> 
> >> I want to keep the system running in the case where both
> controller
> >> fail, or more specifically in my case, when the only server in the
> >> system fails (cost reduced system).
> >> 
> >> The "don’t reboot until a server re-appears" would work for me.  I
> >> think its a poorer solution but it would be workable (Mathi’s
> >> suggestion).  Especially if its available sooner.
> >> 
> >> A better option is to allow one of the linecards to take over the
> >> controller role when the preferred one dies.  Since linecards are
> >> assumed to be transitory (they may or may not be there or could be
> >> removed) so I don’t want to have to pick a linecard for this role
> at
> >> boot time.   Much better would be to allow some number (more than
> 2)
> >> nodes to be controllers (e.g. Active, Standby, Spare, Spare …). 
> Then
> >> OSAF takes responsibility for electing the next Active/Standby
> when
> >> necessary.  There would have to be a concept of preference such
> that
> >> the Server node is chosen given a choice.
> >> 
> >> For the solutions discussed so far…
> >> 
> >> Making the OSAF processes restartable and be active on either
> >> controller, is good and will increase the MTBF of an OSAF system. 
> >> However it won’t fix my problem if there are still only two
> >> controllers allowed.
> >> 
> >> I’m not familiar with the “cattle” concept.
> >> 
> >> I wasn’t advocating the “headless” solution only to the extent that
> it
> >> solved my immediate problem.  Even if we did have a “headless” mode
> it
> >> would be a temporary situation until the failed controller could
> be
> >> replaced.
> >> 
> >> thanks
> >> —
> >> tony
> >> 
> >> 
> >>> On Oct 14, 2015, at 6:44 AM, Mathivanan Naickan Palanivelu
> >> <[email protected]> wrote:
> >>> 
> >>> In a clustered environment, all nodes need to be in the same
> >> consistent view
> >>> from a membership perspective. 
> >>> So, loss of a leader will indeed result in state changes to the
> >> nodes
> >>> following the leader. Therefore there cannot be independent
> >> lifecycles
> >>> for some nodes and other nodes. 
> >>> B.T.W the restart mentioned below meant - 'transparent restart 
> >>> of OpenSAF' without affecting applications and for this the
> >> application's
> >>> CLC-CLI scripts need to handle.
> >>> 
> >>> While we could be inclined to support (unique!) use-case such as
> >> this, 
> >>> it is the solution(headless fork) that would be under the lens!
> ;-)
> >> 
> >>> Let's discuss this!
> >>> 
> >>> Cheers,
> >>> Mathi.
> >>> 
> >>> 
> >>> ----- [email protected] wrote:
> >>> 
> >>>> Yes, this is yet another approach. But it is also another
> use-case
> >> for
> >>>> 
> >>>> the headless feature. When we have moved the system controllers
> out
> >> of
> >>>> 
> >>>> the cluster (into the cloud infrastructure), I would expect
> >>>> controllers 
> >>>> and payloads to have independent life cycles. You have servers
> >> (i.e. 
> >>>> system controllers), and clients (payloads). They can be
> installed
> >> and
> >>>> 
> >>>> upgraded separately from each other, and I wouldn't expect a
> >> restart
> >>>> of 
> >>>> the servers to cause all the clients to restart as well, in the
> >> same
> >>>> way 
> >>>> as I don't expect my web browser to restart just because because
> >> the
> >>>> web 
> >>>> server has crashed.
> >>>> 
> >>>> / Anders Widell
> >>>> 
> >>>> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote:
> >>>>> I don't think this is a case of cattles! Even in those scenario
> >>>>> the cloud management stacks, the  "**controller" software
> >> themselves
> >>>> are 'placed' on physical nodes
> >>>>> in appropriate redundancy models and not inside those cattle
> VMs!
> >>>>> 
> >>>>> I think the case here is about avoid rebooting of the node!
> >>>>> This can be achieved by setting the NOACTIVE timer to a longer
> >> value
> >>>> till OpenSAF on the controller comes back up.
> >>>>> Upon detecting that the controllers are up, some entity on the
> >> local
> >>>> node restart OpenSAF (/etc/init.d/opensafd restart)
> >>>>> And ensure the CLC-CLI scripts of the applications
> differentiate
> >>>> usual restart versus this spoof-restart!
> >>>>> 
> >>>>> Mathi.
> >>>>> 
> >>>>>> -----Original Message-----
> >>>>>> From: Anders Widell [mailto:[email protected]]
> >>>>>> Sent: Tuesday, October 13, 2015 5:36 PM
> >>>>>> To: Anders Björnerstedt; Tony Hart
> >>>>>> Cc: [email protected]
> >>>>>> Subject: Re: [users] Avoid rebooting payload modules after
> >> losing
> >>>> system
> >>>>>> controller
> >>>>>> 
> >>>>>> Yes, I agree that the best fit for this feature is an
> >> application
> >>>> using either the
> >>>>>> NWay-Active or the No-Redundancy models, and where you view
> the
> >>>>>> system more as a collection of nodes rather than as a cluster.
> >> This
> >>>> kind of
> >>>>>> architecture is quite common when you write applications for
> >> cloud.
> >>>> The
> >>>>>> redundancy models are suitable for scaling, and the
> architecture
> >>>> fits into the
> >>>>>> "cattle" philosophy which is common in cloud.
> >>>>>> Such an application can tolerate any number of node failures,
> >> and
> >>>> the
> >>>>>> remaining nodes would still be able to continue functioning
> and
> >>>> provide their
> >>>>>> service. However, if we put the OpenSAF middleware on the
> nodes
> >> it
> >>>>>> becomes the weakest link, since OpenSAF will reboot all the
> >> nodes
> >>>> just
> >>>>>> because the two controller nodes fail. What a pity on a system
> >> with
> >>>> one
> >>>>>> hundred nodes!
> >>>>>> 
> >>>>>> / Anders Widell
> >>>>>> 
> >>>>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote:
> >>>>>>> 
> >>>>>>> On 10/13/2015 12:27 PM, Anders Widell wrote:
> >>>>>>>> The possibility to have more than two system controllers
> (one
> >>>> active
> >>>>>>>> + several standby and/or spare controller nodes) is also
> >>>> something
> >>>>>>>> that has been investigated. For scalability reasons, we
> >> probably
> >>>>>>>> can't turn all nodes into standby controllers in a large
> >> cluster
> >>>> -
> >>>>>>>> but it may be feasible to have a system with one or several
> >>>> standby
> >>>>>>>> controllers and the rest of the nodes are spares that are
> >> ready
> >>>> to
> >>>>>>>> take an active or standby assignment when needed.
> >>>>>>>> 
> >>>>>>>> However, the "headless" feature will still be needed in some
> >>>> systems
> >>>>>>>> where you need dedicated controller node(s).
> >>>>>>> That sounds as if some deployments have a special requirement
> >> that
> >>>> can
> >>>>>>> only be supported by the headless feature.
> >>>>>>> But you also have to say that the headless feature places
> >>>>>>> anti-requirements on the deployments/applications that are to
> >> use
> >>>> it.
> >>>>>>> 
> >>>>>>> For example not needing cluster coherence among the payloads.
> >>>>>>> 
> >>>>>>> If the payloads only run independent application instances
> >> where
> >>>> each
> >>>>>>> instance is implemented at one processor or at least does not
> >>>>>>> communicate in any state-sensitive way with peer processes at
> >>>> other
> >>>>>>> payloads; and no such instance is unique or if it is unique
> it
> >> is
> >>>>>>> still expendable (non critical to the service), then it could
> >>>> work.
> >>>>>>> 
> >>>>>>> It is important the the deployments that end up thinking they
> >> need
> >>>> the
> >>>>>>> headless feature also understand what they loose with the
> >>>> headless
> >>>>>>> feature and that this loss is acceptable for that deployment.
> >>>>>>> 
> >>>>>>> So headless is not a fancy feature needed by some exclusive
> and
> >>>> picky
> >>>>>>> subset of applications.
> >>>>>>> It is a relaxation that drops all requirements on distributed
> >>>>>>> consistency and may be acceptable to some applications with
> >>>> weaker
> >>>>>>> demands so they can accept the anti requirements.
> >>>>>>> 
> >>>>>>> Besides requiring "dedicated" controller nodes, the
> deployment
> >>>> must of
> >>>>>>> course NOT require any *availability* of those dedicated
> >>>> controller
> >>>>>>> nodes, i.e. not have any requirements on service availability
> >> in
> >>>>>>> general.
> >>>>>>> 
> >>>>>>> It may works for some "dumb" applications that are stateless,
> >> or
> >>>> state
> >>>>>>> stable (frozen in state), or have no requirements on
> >> availability
> >>>> of
> >>>>>>> state. In other words some applicaitons that really dont need
> >>>> SAF.
> >>>>>>> 
> >>>>>>> They may still want to use SAF as a way of managing and
> >> monitoring
> >>>> the
> >>>>>>> system when it happens to be healthy, but can live with  long
> >>>> periods
> >>>>>>> of not being able to manage or monitor that system, which can
> >> then
> >>>> be
> >>>>>>> degrading in any way that is possible.
> >>>>>>> 
> >>>>>>> 
> >>>>>>> /AndersBJ
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> / Anders Widell
> >>>>>>>> 
> >>>>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote:
> >>>>>>>>> Understood.  The assumption is that this is temporary but
> we
> >>>> allow
> >>>>>>>>> the payloads to continue to run (with reduced osaf
> >>>> functionality)
> >>>>>>>>> until a replacement controller is found.  At that point
> they
> >>>> can
> >>>>>>>>> reboot to get the system back into sync.
> >>>>>>>>> 
> >>>>>>>>> Or allow more than 2 controllers in the system so we can
> have
> >>>> one or
> >>>>>>>>> more usually-payload cards be controllers to reduce the
> >>>> probability
> >>>>>>>>> of no-controllers to an acceptable level.
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt
> >>>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>> 
> >>>>>>>>>> The headless state is also vulnerable to split-brain
> >>>> scenarios.
> >>>>>>>>>> That is network partitions and joins can occur and will
> not
> >> be
> >>>>>>>>>> detected as such and thus not handled properly (isolated)
> >> when
> >>>> they
> >>>>>>>>>> occur.
> >>>>>>>>>> Basically you can  not be sure you have a continuously
> >>>> coherent
> >>>>>>>>>> cluster while in the headless state.
> >>>>>>>>>> 
> >>>>>>>>>> On paper you may get a very resilient system in the sense
> >> that
> >>>> It
> >>>>>>>>>> "stays up"  and replies on ping etc.
> >>>>>>>>>> But typically a customer wants not just availability but
> >>>> reliable
> >>>>>>>>>> behavior also.
> >>>>>>>>>> 
> >>>>>>>>>> /AndersBj
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Anders Björnerstedt
> >>>>>> [mailto:[email protected]]
> >>>>>>>>>> Sent: den 12 oktober 2015 16:42
> >>>>>>>>>> To: Anders Widell; Tony Hart;
> >>>> [email protected]
> >>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
> >>>> losing
> >>>>>>>>>> system controller
> >>>>>>>>>> 
> >>>>>>>>>> Note that this headless variant  is a very questionable
> >>>> feature.
> >>>>>>>>>> This for the reasons explained earlier, i.e. you *will* 
> get
> >> a
> >>>>>>>>>> reduction in service availability.
> >>>>>>>>>> It was never accepted into OpenSAF for that reason.
> >>>>>>>>>> 
> >>>>>>>>>> On top of that the unreliability will typically not he
> >>>>>>>>>> explicit/handled. That is the operator will probably not
> >> even
> >>>> know
> >>>>>>>>>> what is working and what is not during the SC absence
> since
> >>>> the
> >>>>>>>>>> alarm/notification  function is gone. No OpenSAF director
> >>>> services
> >>>>>>>>>> are executing.
> >>>>>>>>>> 
> >>>>>>>>>> It is truly a headless system, i.e. a zombie system and
> thus
> >>>> not
> >>>>>>>>>> working at full monitoring and availability functionality.
> >>>>>>>>>> It begs the question of what OpenSAF and SAF is there for
> in
> >>>> the
> >>>>>>>>>> first place.
> >>>>>>>>>> 
> >>>>>>>>>> The SCs don’t have to run any special software and don’t
> >> have
> >>>> to
> >>>>>>>>>> have any special hardware.
> >>>>>>>>>> They do need file system access, at least for a cluster
> >>>> restart,
> >>>>>>>>>> but not necessarily to handle single SC failure.
> >>>>>>>>>> The headless variant when headless is also in that
> >>>>>>>>>> not-able-to-cluster-restart also, but with even less
> >>>> functionality.
> >>>>>>>>>> 
> >>>>>>>>>> An SC can of course run other (non OpenSAF specific)
> >> software. 
> >>>> And
> >>>>>>>>>> the two SCs don’t necessarily have to be symmetric in
> terms
> >> of
> >>>>>>>>>> software.
> >>>>>>>>>> 
> >>>>>>>>>> Providing file system access via NFS is typically a non
> >> issue.
> >>>> They
> >>>>>>>>>> have three nodes. Ergo  they should be able to assign two
> of
> >>>> them
> >>>>>>>>>> the role of SC in the OpensAF domain.
> >>>>>>>>>> 
> >>>>>>>>>> /AndersBj
> >>>>>>>>>> 
> >>>>>>>>>> -----Original Message-----
> >>>>>>>>>> From: Anders Widell [mailto:[email protected]]
> >>>>>>>>>> Sent: den 12 oktober 2015 16:08
> >>>>>>>>>> To: Tony Hart; [email protected]
> >>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
> >>>> losing
> >>>>>>>>>> system controller
> >>>>>>>>>> 
> >>>>>>>>>> We have actually implemented something very similar to
> what
> >> you
> >>>> are
> >>>>>>>>>> talking about. With this feature, the payloads can survive
> >>>> without
> >>>>>>>>>> a cluster restart even if both system controllers restart
> >> (or
> >>>> the
> >>>>>>>>>> single system controller, in your case). If you want to
> try
> >> it
> >>>> out,
> >>>>>>>>>> you can clone this Mercurial repository:
> >>>>>>>>>> 
> >>>>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/
> >>>>>>>>>> 
> >>>>>>>>>> To enable the feature, set the variable
> >>>>>> IMMSV_SC_ABSENCE_ALLOWED in
> >>>>>>>>>> immd.conf to the amount of seconds you wish the payloads
> to
> >>>> wait
> >>>>>>>>>> for the system controllers to come back. Note: we have
> only
> >>>>>>>>>> implemented this feature for the "core" OpenSAF services
> >> (plus
> >>>>>>>>>> CKPT), so you need to disable the optional serivces.
> >>>>>>>>>> 
> >>>>>>>>>> / Anders Widell
> >>>>>>>>>> 
> >>>>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
> >>>>>>>>>>> We have been using opensaf in our product for a couple of
> >>>> years
> >>>>>>>>>>> now.  One of the issues we have is the fact that payload
> >>>> cards
> >>>>>>>>>>> reboot when the system controllers are lost.  Although
> our
> >>>> payload
> >>>>>>>>>>> card hardware will continue to perform its functions
> whilst
> >>>> the
> >>>>>>>>>>> software is down (which is desirable) the functions that
> >> the
> >>>>>>>>>>> software performs are obviously not performed (which is
> not
> >>>>>>>>>>> desirable).
> >>>>>>>>>>> 
> >>>>>>>>>>> Why would we loose both controllers, surely that is a
> rare
> >>>>>>>>>>> circumstance?  Not if you only have one controller to
> begin
> >>>> with.
> >>>>>>>>>>> Removing the second controller is a significant cost
> saving
> >>>> for us
> >>>>>>>>>>> so we want to support a product that only has one
> >> controller. 
> >>>> The
> >>>>>>>>>>> most significant impediment to that is the loss of
> payload
> >>>>>>>>>>> software functions when the system controller fails.
> >>>>>>>>>>> 
> >>>>>>>>>>> I’m looking for suggestions from this email list as to
> what
> >>>> could
> >>>>>>>>>>> be done for this issue.
> >>>>>>>>>>> 
> >>>>>>>>>>> One suggestion, that would work for us, is if we could
> >>>> convince
> >>>>>>>>>>> the payload card to only reboot when the controller
> >> reappears
> >>>>>>>>>>> after a loss rather than when the loss initially occurs. 
> >> Is
> >>>> that
> >>>>>>>>>>> possible?
> >>>>>>>>>>> 
> >>>>>>>>>>> Another possibility is if we could support more than 2
> >>>>>>>>>>> controllers, for example if we could support 4 (one
> active
> >> and
> >>>> 3
> >>>>>>>>>>> standbys) that would also provide a solution for us (our
> >>>> current
> >>>>>>>>>>> payloads would instead become controllers). I know that
> >> this
> >>>> is
> >>>>>>>>>>> not currently possible with opensaf.
> >>>>>>>>>>> 
> >>>>>>>>>>> thanks for any suggestions,
> >>>>>>>>>>> —
> >>>>>>>>>>> tony
> >>>>>>>>>>> 
> >>>>
> ------------------------------------------------------------------
> >>>>>>>>>>> ----
> >>>>>>>>>>> 
> >>>>>>>>>>> -------- _______________________________________________
> >>>>>>>>>>> Opensaf-users mailing list
> >>>>>>>>>>> [email protected]
> >>>>>>>>>>>
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>> 
> >>
> -------------------------------------------------------------------
> >>>>>>>>>> -----------
> >>>>>>>>>> 
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Opensaf-users mailing list
> >>>>>>>>>> [email protected]
> >>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>>>>>> 
> >>>> 
> >>
> -------------------------------------------------------------------
> >>>>>>>>>> -----------
> >>>>>>>>>> 
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> Opensaf-users mailing list
> >>>>>>>>>> [email protected]
> >>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>> 
> >>
> ------------------------------------------------------------------------------
> >>>>>> _______________________________________________
> >>>>>> Opensaf-users mailing list
> >>>>>> [email protected]
> >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to