Re: [users] Avoid rebooting payload modules after losing system controller

Mathivanan Naickan Palanivelu Wed, 14 Oct 2015 04:50:55 -0700

Tony,

We(in the TLC) have discussed this before. At this point of time
you indeed have the solution that AndersW has pointed to.


There are also other discussions going on w.r.t multiple-standbys etc
for the next major release. Will tag you in the appropriate ticket
(one such ticket is http://sourceforge.net/p/opensaf/tickets/1170/)
once those discussions conclude.

Cheers,
Mathi.

----- [email protected] wrote:

> Thanks all, this is a good discussion.
> 
> Let me again state my use case, which as I said I don’t think is
> unique, because selfishly I want to make sure that this is covered
> :-)
> 
> I want to keep the system running in the case where both controller
> fail, or more specifically in my case, when the only server in the
> system fails (cost reduced system).
> 
> The "don’t reboot until a server re-appears" would work for me.  I
> think its a poorer solution but it would be workable (Mathi’s
> suggestion).  Especially if its available sooner.
> 
> A better option is to allow one of the linecards to take over the
> controller role when the preferred one dies.  Since linecards are
> assumed to be transitory (they may or may not be there or could be
> removed) so I don’t want to have to pick a linecard for this role at
> boot time.   Much better would be to allow some number (more than 2)
> nodes to be controllers (e.g. Active, Standby, Spare, Spare …).  Then
> OSAF takes responsibility for electing the next Active/Standby when
> necessary.  There would have to be a concept of preference such that
> the Server node is chosen given a choice.
> 
> For the solutions discussed so far…
> 
> Making the OSAF processes restartable and be active on either
> controller, is good and will increase the MTBF of an OSAF system. 
> However it won’t fix my problem if there are still only two
> controllers allowed.
> 
> I’m not familiar with the “cattle” concept.
> 
> I wasn’t advocating the “headless” solution only to the extent that it
> solved my immediate problem.  Even if we did have a “headless” mode it
> would be a temporary situation until the failed controller could be
> replaced.
> 
> thanks
> —
> tony
> 
> 
> > On Oct 14, 2015, at 6:44 AM, Mathivanan Naickan Palanivelu
> <[email protected]> wrote:
> > 
> > In a clustered environment, all nodes need to be in the same
> consistent view
> > from a membership perspective. 
> > So, loss of a leader will indeed result in state changes to the
> nodes
> > following the leader. Therefore there cannot be independent
> lifecycles
> > for some nodes and other nodes. 
> > B.T.W the restart mentioned below meant - 'transparent restart 
> > of OpenSAF' without affecting applications and for this the
> application's
> > CLC-CLI scripts need to handle.
> > 
> > While we could be inclined to support (unique!) use-case such as
> this, 
> > it is the solution(headless fork) that would be under the lens! ;-)
> 
> > Let's discuss this!
> > 
> > Cheers,
> > Mathi.
> > 
> > 
> > ----- [email protected] wrote:
> > 
> >> Yes, this is yet another approach. But it is also another use-case
> for
> >> 
> >> the headless feature. When we have moved the system controllers out
> of
> >> 
> >> the cluster (into the cloud infrastructure), I would expect
> >> controllers 
> >> and payloads to have independent life cycles. You have servers
> (i.e. 
> >> system controllers), and clients (payloads). They can be installed
> and
> >> 
> >> upgraded separately from each other, and I wouldn't expect a
> restart
> >> of 
> >> the servers to cause all the clients to restart as well, in the
> same
> >> way 
> >> as I don't expect my web browser to restart just because because
> the
> >> web 
> >> server has crashed.
> >> 
> >> / Anders Widell
> >> 
> >> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote:
> >>> I don't think this is a case of cattles! Even in those scenario
> >>> the cloud management stacks, the  "**controller" software
> themselves
> >> are 'placed' on physical nodes
> >>> in appropriate redundancy models and not inside those cattle VMs!
> >>> 
> >>> I think the case here is about avoid rebooting of the node!
> >>> This can be achieved by setting the NOACTIVE timer to a longer
> value
> >> till OpenSAF on the controller comes back up.
> >>> Upon detecting that the controllers are up, some entity on the
> local
> >> node restart OpenSAF (/etc/init.d/opensafd restart)
> >>> And ensure the CLC-CLI scripts of the applications differentiate
> >> usual restart versus this spoof-restart!
> >>> 
> >>> Mathi.
> >>> 
> >>>> -----Original Message-----
> >>>> From: Anders Widell [mailto:[email protected]]
> >>>> Sent: Tuesday, October 13, 2015 5:36 PM
> >>>> To: Anders Björnerstedt; Tony Hart
> >>>> Cc: [email protected]
> >>>> Subject: Re: [users] Avoid rebooting payload modules after
> losing
> >> system
> >>>> controller
> >>>> 
> >>>> Yes, I agree that the best fit for this feature is an
> application
> >> using either the
> >>>> NWay-Active or the No-Redundancy models, and where you view the
> >>>> system more as a collection of nodes rather than as a cluster.
> This
> >> kind of
> >>>> architecture is quite common when you write applications for
> cloud.
> >> The
> >>>> redundancy models are suitable for scaling, and the architecture
> >> fits into the
> >>>> "cattle" philosophy which is common in cloud.
> >>>> Such an application can tolerate any number of node failures,
> and
> >> the
> >>>> remaining nodes would still be able to continue functioning and
> >> provide their
> >>>> service. However, if we put the OpenSAF middleware on the nodes
> it
> >>>> becomes the weakest link, since OpenSAF will reboot all the
> nodes
> >> just
> >>>> because the two controller nodes fail. What a pity on a system
> with
> >> one
> >>>> hundred nodes!
> >>>> 
> >>>> / Anders Widell
> >>>> 
> >>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote:
> >>>>> 
> >>>>> On 10/13/2015 12:27 PM, Anders Widell wrote:
> >>>>>> The possibility to have more than two system controllers (one
> >> active
> >>>>>> + several standby and/or spare controller nodes) is also
> >> something
> >>>>>> that has been investigated. For scalability reasons, we
> probably
> >>>>>> can't turn all nodes into standby controllers in a large
> cluster
> >> -
> >>>>>> but it may be feasible to have a system with one or several
> >> standby
> >>>>>> controllers and the rest of the nodes are spares that are
> ready
> >> to
> >>>>>> take an active or standby assignment when needed.
> >>>>>> 
> >>>>>> However, the "headless" feature will still be needed in some
> >> systems
> >>>>>> where you need dedicated controller node(s).
> >>>>> That sounds as if some deployments have a special requirement
> that
> >> can
> >>>>> only be supported by the headless feature.
> >>>>> But you also have to say that the headless feature places
> >>>>> anti-requirements on the deployments/applications that are to
> use
> >> it.
> >>>>> 
> >>>>> For example not needing cluster coherence among the payloads.
> >>>>> 
> >>>>> If the payloads only run independent application instances
> where
> >> each
> >>>>> instance is implemented at one processor or at least does not
> >>>>> communicate in any state-sensitive way with peer processes at
> >> other
> >>>>> payloads; and no such instance is unique or if it is unique it
> is
> >>>>> still expendable (non critical to the service), then it could
> >> work.
> >>>>> 
> >>>>> It is important the the deployments that end up thinking they
> need
> >> the
> >>>>> headless feature also understand what they loose with the
> >> headless
> >>>>> feature and that this loss is acceptable for that deployment.
> >>>>> 
> >>>>> So headless is not a fancy feature needed by some exclusive and
> >> picky
> >>>>> subset of applications.
> >>>>> It is a relaxation that drops all requirements on distributed
> >>>>> consistency and may be acceptable to some applications with
> >> weaker
> >>>>> demands so they can accept the anti requirements.
> >>>>> 
> >>>>> Besides requiring "dedicated" controller nodes, the deployment
> >> must of
> >>>>> course NOT require any *availability* of those dedicated
> >> controller
> >>>>> nodes, i.e. not have any requirements on service availability
> in
> >>>>> general.
> >>>>> 
> >>>>> It may works for some "dumb" applications that are stateless,
> or
> >> state
> >>>>> stable (frozen in state), or have no requirements on
> availability
> >> of
> >>>>> state. In other words some applicaitons that really dont need
> >> SAF.
> >>>>> 
> >>>>> They may still want to use SAF as a way of managing and
> monitoring
> >> the
> >>>>> system when it happens to be healthy, but can live with  long
> >> periods
> >>>>> of not being able to manage or monitor that system, which can
> then
> >> be
> >>>>> degrading in any way that is possible.
> >>>>> 
> >>>>> 
> >>>>> /AndersBJ
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> / Anders Widell
> >>>>>> 
> >>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote:
> >>>>>>> Understood.  The assumption is that this is temporary but we
> >> allow
> >>>>>>> the payloads to continue to run (with reduced osaf
> >> functionality)
> >>>>>>> until a replacement controller is found.  At that point they
> >> can
> >>>>>>> reboot to get the system back into sync.
> >>>>>>> 
> >>>>>>> Or allow more than 2 controllers in the system so we can have
> >> one or
> >>>>>>> more usually-payload cards be controllers to reduce the
> >> probability
> >>>>>>> of no-controllers to an acceptable level.
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt
> >>>>>>>> <[email protected]> wrote:
> >>>>>>>> 
> >>>>>>>> The headless state is also vulnerable to split-brain
> >> scenarios.
> >>>>>>>> That is network partitions and joins can occur and will not
> be
> >>>>>>>> detected as such and thus not handled properly (isolated)
> when
> >> they
> >>>>>>>> occur.
> >>>>>>>> Basically you can  not be sure you have a continuously
> >> coherent
> >>>>>>>> cluster while in the headless state.
> >>>>>>>> 
> >>>>>>>> On paper you may get a very resilient system in the sense
> that
> >> It
> >>>>>>>> "stays up"  and replies on ping etc.
> >>>>>>>> But typically a customer wants not just availability but
> >> reliable
> >>>>>>>> behavior also.
> >>>>>>>> 
> >>>>>>>> /AndersBj
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Anders Björnerstedt
> >>>> [mailto:[email protected]]
> >>>>>>>> Sent: den 12 oktober 2015 16:42
> >>>>>>>> To: Anders Widell; Tony Hart;
> >> [email protected]
> >>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
> >> losing
> >>>>>>>> system controller
> >>>>>>>> 
> >>>>>>>> Note that this headless variant  is a very questionable
> >> feature.
> >>>>>>>> This for the reasons explained earlier, i.e. you *will*  get
> a
> >>>>>>>> reduction in service availability.
> >>>>>>>> It was never accepted into OpenSAF for that reason.
> >>>>>>>> 
> >>>>>>>> On top of that the unreliability will typically not he
> >>>>>>>> explicit/handled. That is the operator will probably not
> even
> >> know
> >>>>>>>> what is working and what is not during the SC absence since
> >> the
> >>>>>>>> alarm/notification  function is gone. No OpenSAF director
> >> services
> >>>>>>>> are executing.
> >>>>>>>> 
> >>>>>>>> It is truly a headless system, i.e. a zombie system and thus
> >> not
> >>>>>>>> working at full monitoring and availability functionality.
> >>>>>>>> It begs the question of what OpenSAF and SAF is there for in
> >> the
> >>>>>>>> first place.
> >>>>>>>> 
> >>>>>>>> The SCs don’t have to run any special software and don’t
> have
> >> to
> >>>>>>>> have any special hardware.
> >>>>>>>> They do need file system access, at least for a cluster
> >> restart,
> >>>>>>>> but not necessarily to handle single SC failure.
> >>>>>>>> The headless variant when headless is also in that
> >>>>>>>> not-able-to-cluster-restart also, but with even less
> >> functionality.
> >>>>>>>> 
> >>>>>>>> An SC can of course run other (non OpenSAF specific)
> software. 
> >> And
> >>>>>>>> the two SCs don’t necessarily have to be symmetric in terms
> of
> >>>>>>>> software.
> >>>>>>>> 
> >>>>>>>> Providing file system access via NFS is typically a non
> issue.
> >> They
> >>>>>>>> have three nodes. Ergo  they should be able to assign two of
> >> them
> >>>>>>>> the role of SC in the OpensAF domain.
> >>>>>>>> 
> >>>>>>>> /AndersBj
> >>>>>>>> 
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Anders Widell [mailto:[email protected]]
> >>>>>>>> Sent: den 12 oktober 2015 16:08
> >>>>>>>> To: Tony Hart; [email protected]
> >>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
> >> losing
> >>>>>>>> system controller
> >>>>>>>> 
> >>>>>>>> We have actually implemented something very similar to what
> you
> >> are
> >>>>>>>> talking about. With this feature, the payloads can survive
> >> without
> >>>>>>>> a cluster restart even if both system controllers restart
> (or
> >> the
> >>>>>>>> single system controller, in your case). If you want to try
> it
> >> out,
> >>>>>>>> you can clone this Mercurial repository:
> >>>>>>>> 
> >>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/
> >>>>>>>> 
> >>>>>>>> To enable the feature, set the variable
> >>>> IMMSV_SC_ABSENCE_ALLOWED in
> >>>>>>>> immd.conf to the amount of seconds you wish the payloads to
> >> wait
> >>>>>>>> for the system controllers to come back. Note: we have only
> >>>>>>>> implemented this feature for the "core" OpenSAF services
> (plus
> >>>>>>>> CKPT), so you need to disable the optional serivces.
> >>>>>>>> 
> >>>>>>>> / Anders Widell
> >>>>>>>> 
> >>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
> >>>>>>>>> We have been using opensaf in our product for a couple of
> >> years
> >>>>>>>>> now.  One of the issues we have is the fact that payload
> >> cards
> >>>>>>>>> reboot when the system controllers are lost.  Although our
> >> payload
> >>>>>>>>> card hardware will continue to perform its functions whilst
> >> the
> >>>>>>>>> software is down (which is desirable) the functions that
> the
> >>>>>>>>> software performs are obviously not performed (which is not
> >>>>>>>>> desirable).
> >>>>>>>>> 
> >>>>>>>>> Why would we loose both controllers, surely that is a rare
> >>>>>>>>> circumstance?  Not if you only have one controller to begin
> >> with.
> >>>>>>>>> Removing the second controller is a significant cost saving
> >> for us
> >>>>>>>>> so we want to support a product that only has one
> controller. 
> >> The
> >>>>>>>>> most significant impediment to that is the loss of payload
> >>>>>>>>> software functions when the system controller fails.
> >>>>>>>>> 
> >>>>>>>>> I’m looking for suggestions from this email list as to what
> >> could
> >>>>>>>>> be done for this issue.
> >>>>>>>>> 
> >>>>>>>>> One suggestion, that would work for us, is if we could
> >> convince
> >>>>>>>>> the payload card to only reboot when the controller
> reappears
> >>>>>>>>> after a loss rather than when the loss initially occurs. 
> Is
> >> that
> >>>>>>>>> possible?
> >>>>>>>>> 
> >>>>>>>>> Another possibility is if we could support more than 2
> >>>>>>>>> controllers, for example if we could support 4 (one active
> and
> >> 3
> >>>>>>>>> standbys) that would also provide a solution for us (our
> >> current
> >>>>>>>>> payloads would instead become controllers). I know that
> this
> >> is
> >>>>>>>>> not currently possible with opensaf.
> >>>>>>>>> 
> >>>>>>>>> thanks for any suggestions,
> >>>>>>>>> —
> >>>>>>>>> tony
> >>>>>>>>> 
> >> ------------------------------------------------------------------
> >>>>>>>>> ----
> >>>>>>>>> 
> >>>>>>>>> -------- _______________________________________________
> >>>>>>>>> Opensaf-users mailing list
> >>>>>>>>> [email protected]
> >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>>>> 
> >>>>>>>> 
> >>
> -------------------------------------------------------------------
> >>>>>>>> -----------
> >>>>>>>> 
> >>>>>>>> _______________________________________________
> >>>>>>>> Opensaf-users mailing list
> >>>>>>>> [email protected]
> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>>>> 
> >>
> -------------------------------------------------------------------
> >>>>>>>> -----------
> >>>>>>>> 
> >>>>>>>> _______________________________________________
> >>>>>>>> Opensaf-users mailing list
> >>>>>>>> [email protected]
> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >>>> 
> >>>> 
> >>
> ------------------------------------------------------------------------------
> >>>> _______________________________________________
> >>>> Opensaf-users mailing list
> >>>> [email protected]
> >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to