Tony,

We(in the TLC) have discussed this before. At this point of time
you indeed have the solution that AndersW has pointed to.

There are also other discussions going on w.r.t multiple-standbys etc
for the next major release. Will tag you in the appropriate ticket
(one such ticket is http://sourceforge.net/p/opensaf/tickets/1170/)
once those discussions conclude.

Cheers,
Mathi.

----- [email protected] wrote:

> Thanks all, this is a good discussion.
> 
> Let me again state my use case, which as I said I don’t think is
> unique, because selfishly I want to make sure that this is covered
> :-)
> 
> I want to keep the system running in the case where both controller
> fail, or more specifically in my case, when the only server in the
> system fails (cost reduced system).
> 
> The "don’t reboot until a server re-appears" would work for me.  I
> think its a poorer solution but it would be workable (Mathi’s
> suggestion).  Especially if its available sooner.
> 
> A better option is to allow one of the linecards to take over the
> controller role when the preferred one dies.  Since linecards are
> assumed to be transitory (they may or may not be there or could be
> removed) so I don’t want to have to pick a linecard for this role at
> boot time.   Much better would be to allow some number (more than 2)
> nodes to be controllers (e.g. Active, Standby, Spare, Spare …).  Then
> OSAF takes responsibility for electing the next Active/Standby when
> necessary.  There would have to be a concept of preference such that
> the Server node is chosen given a choice.
> 
> For the solutions discussed so far…
> 
> Making the OSAF processes restartable and be active on either
> controller, is good and will increase the MTBF of an OSAF system. 
> However it won’t fix my problem if there are still only two
> controllers allowed.
> 
> I’m not familiar with the “cattle” concept.
> 
> I wasn’t advocating the “headless” solution only to the extent that it
> solved my immediate problem.  Even if we did have a “headless” mode it
> would be a temporary situation until the failed controller could be
> replaced.
> 
> thanks
> —
> tony
> 
> 
> > On Oct 14, 2015, at 6:44 AM, Mathivanan Naickan Palanivelu
> <[email protected]> wrote:
> > 
> > In a clustered environment, all nodes need to be in the same
> consistent view
> > from a membership perspective. 
> > So, loss of a leader will indeed result in state changes to the
> nodes
> > following the leader. Therefore there cannot be independent
> lifecycles
> > for some nodes and other nodes. 
> > B.T.W the restart mentioned below meant - 'transparent restart 
> > of OpenSAF' without affecting applications and for this the
> application's
> > CLC-CLI scripts need to handle.
> > 
> > While we could be inclined to support (unique!) use-case such as
> this, 
> > it is the solution(headless fork) that would be under the lens! ;-)
> 
> > Let's discuss this!
> > 
> > Cheers,
> > Mathi.
> > 
> > 
> > ----- [email protected] wrote:
> > 
> >> Yes, this is yet another approach. But it is also another use-case
> for
> >> 
> >> the headless feature. When we have moved the system controllers out
> of
> >> 
> >> the cluster (into the cloud infrastructure), I would expect
> >> controllers 
> >> and payloads to have independent life cycles. You have servers
> (i.e. 
> >> system controllers), and clients (payloads). They can be installed
> and
> >> 
> >> upgraded separately from each other, and I wouldn't expect a
> restart
> >> of 
> >> the servers to cause all the clients to restart as well, in the
> same
> >> way 
> >> as I don't expect my web browser to restart just because because
> the
> >> web 
> >> server has crashed.
> >> 
> >> / Anders Widell
> >> 
> >> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote:
> >>> I don't think this is a case of cattles! Even in those scenario
> >>> the cloud management stacks, the  "**controller" software
> themselves
> >> are 'placed' on physical nodes
> >>> in appropriate redundancy models and not inside those cattle VMs!
> >>> 
> >>> I think the case here is about avoid rebooting of the node!
> >>> This can be achieved by setting the NOACTIVE timer to a longer
> value
> >> till OpenSAF on the controller comes back up.
> >>> Upon detecting that the controllers are up, some entity on the
> local
> >> node restart OpenSAF (/etc/init.d/opensafd restart)
> >>> And ensure the CLC-CLI scripts of the applications differentiate
> >> usual restart versus this spoof-restart!
> >>> 
> >>> Mathi.
> >>> 
> >>>> -----Original Message-----
> >>>> From: Anders Widell [mailto:[email protected]]
> >>>> Sent: Tuesday, October 13, 2015 5:36 PM
> >>>> To: Anders Björnerstedt; Tony Hart
> >>>> Cc: [email protected]
> >>>> Subject: Re: [users] Avoid rebooting payload modules after
> losing
> >> system
> >>>> controller
> >>>> 
> >>>> Yes, I agree that the best fit for this feature is an
> application
> >> using either the
> >>>> NWay-Active or the No-Redundancy models, and where you view the
> >>>> system more as a collection of nodes rather than as a cluster.
> This
> >> kind of
> >>>> architecture is quite common when you write applications for
> cloud.
> >> The
> >>>> redundancy models are suitable for scaling, and the architecture
> >> fits into the
> >>>> "cattle" philosophy which is common in cloud.
> >>>> Such an application can tolerate any number of node failures,
> and
> >> the
> >>>> remaining nodes would still be able to continue functioning and
> >> provide their
> >>>> service. However, if we put the OpenSAF middleware on the nodes
> it
> >>>> becomes the weakest link, since OpenSAF will reboot all the
> nodes
> >> just
> >>>> because the two controller nodes fail. What a pity on a system
> with
> >> one
> >>>> hundred nodes!
> >>>> 
> >>>> / Anders Widell
> >>>> 
> >>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote:
> >>>>> 
> >>>>> On 10/13/2015 12:27 PM, Anders Widell wrote:
> >>>>>> The possibility to have more than two system controllers (one
> >> active
> >>>>>> + several standby and/or spare controller nodes) is also
> >> something
> >>>>>> that has been investigated. For scalability reasons, we
> probably
> >>>>>> can't turn all nodes into standby controllers in a large
> cluster
> >> -
> >>>>>> but it may be feasible to have a system with one or several
> >> standby
> >>>>>> controllers and the rest of the nodes are spares that are
> ready
> >> to
> >>>>>> take an active or standby assignment when needed.
> >>>>>> 
> >>>>>> However, the "headless" feature will still be needed in some
> >> systems
> >>>>>> where you need dedicated controller node(s).
> >>>>> That sounds as if some deployments have a special requirement
> that
> >> can
> >>>>> only be supported by the headless feature.
> >>>>> But you also have to say that the headless feature places
> >>>>> anti-requirements on the deployments/applications that are to
> use
> >> it.
> >>>>> 
> >>>>> For example not needing cluster coherence among the payloads.
> >>>>> 
> >>>>> If the payloads only run independent application instances
> where
> >> each
> >>>>> instance is implemented at one processor or at least does not
> >>>>> communicate in any state-sensitive way with peer processes at
> >> other
> >>>>> payloads; and no such instance is unique or if it is unique it
> is
> >>>>> still expendable (non critical to the service), then it could
> >> work.
> >>>>> 
> >>>>> It is important the the deployments that end up thinking they
> need
> >> the
> >>>>> headless feature also understand what they loose with the
> >> headless
> >>>>> feature and that this loss is acceptable for that deployment.
> >>>>> 
> >>>>> So headless is not a fancy feature needed by some exclusive and
> >> picky
> >>>>> subset of applications.
> >>>>> It is a relaxation that drops all requirements on distributed
> >>>>> consistency and may be acceptable to some applications with
> >> weaker
> >>>>> demands so they can accept the anti requirements.
> >>>>> 
> >>>>> Besides requiring "dedicated" controller nodes, the deployment
> >> must of
> >>>>> course NOT require any *availability* of those dedicated
> >> controller
> >>>>> nodes, i.e. not have any requirements on service availability
> in
> >>>>> general.
> >>>>> 
> >>>>> It may works for some "dumb" applications that are stateless,
> or
> >> state
> >>>>> stable (frozen in state), or have no requirements on
> availability
> >> of
> >>>>> state. In other words some applicaitons that really dont need
> >> SAF.
> >>>>> 
> >>>>> They may still want to use SAF as a way of managing and
> monitoring
> >> the
> >>>>> system when it happens to be healthy, but can live with  long
> >> periods
> >>>>> of not being able to manage or monitor that system, which can
> then
> >> be
> >>>>> degrading in any way that is possible.
> >>>>> 
> >>>>> 
> >>>>> /AndersBJ
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>>> / Anders Widell
> >>>>>> 
> >>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote:
> >>>>>>> Understood.  The assumption is that this is temporary but we
> >> allow
> >>>>>>> the payloads to continue to run (with reduced osaf
> >> functionality)
> >>>>>>> until a replacement controller is found.  At that point they
> >> can
> >>>>>>> reboot to get the system back into sync.
> >>>>>>> 
> >>>>>>> Or allow more than 2 controllers in the system so we can have
> >> one or
> >>>>>>> more usually-payload cards be controllers to reduce the
> >> probability
> >>>>>>> of no-controllers to an acceptable level.
> >>>>>>> 
> >>>>>>> 
> >>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt
> >>>>>>>> <[email protected]> wrote:
> >>>>>>>> 
> >>>>>>>> The headless state is also vulnerable to split-brain
> >> scenarios.
> >>>>>>>> That is network partitions and joins can occur and will not
> be
> >>>>>>>> detected as such and thus not handled properly (isolated)
> when
> >> they
> >>>>>>>> occur.
> >>>>>>>> Basically you can  not be sure you have a continuously
> >> coherent
> >>>>>>>> cluster while in the headless state.
> >>>>>>>> 
> >>>>>>>> On paper you may get a very resilient system in the sense
> that
> >> It
> >>>>>>>> "stays up"  and replies on ping etc.
> >>>>>>>> But typically a customer wants not just availability but
> >> reliable
> >>>>>>>> behavior also.
> >>>>>>>> 
> >>>>>>>> /AndersBj
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Anders Björnerstedt
> >>>> [mailto:[email protected]]
> >>>>>>>> Sent: den 12 oktober 2015 16:42
> >>>>>>>> To: Anders Widell; Tony Hart;
> >> [email protected]
> >>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
> >> losing
> >>>>>>>> system controller
> >>>>>>>> 
> >>>>>>>> Note that this headless variant  is a very questionable
> >> feature.
> >>>>>>>> This for the reasons explained earlier, i.e. you *will*  get
> a
> >>>>>>>> reduction in service availability.
> >>>>>>>> It was never accepted into OpenSAF for that reason.
> >>>>>>>> 
> >>>>>>>> On top of that the unreliability will typically not he
> >>>>>>>> explicit/handled. That is the operator will probably not
> even
> >> know
> >>>>>>>> what is working and what is not during the SC absence since
> >> the
> >>>>>>>> alarm/notification  function is gone. No OpenSAF director
> >> services
> >>>>>>>> are executing.
> >>>>>>>> 
> >>>>>>>> It is truly a headless system, i.e. a zombie system and thus
> >> not
> >>>>>>>> working at full monitoring and availability functionality.
> >>>>>>>> It begs the question of what OpenSAF and SAF is there for in
> >> the
> >>>>>>>> first place.
> >>>>>>>> 
> >>>>>>>> The SCs don’t have to run any special software and don’t
> have
> >> to
> >>>>>>>> have any special hardware.
> >>>>>>>> They do need file system access, at least for a cluster
> >> restart,
> >>>>>>>> but not necessarily to handle single SC failure.
> >>>>>>>> The headless variant when headless is also in that
> >>>>>>>> not-able-to-cluster-restart also, but with even less
> >> functionality.
> >>>>>>>> 
> >>>>>>>> An SC can of course run other (non OpenSAF specific)
> software. 
> >> And
> >>>>>>>> the two SCs don’t necessarily have to be symmetric in terms
> of
> >>>>>>>> software.
> >>>>>>>> 
> >>>>>>>> Providing file system access via NFS is typically a non
> issue.
> >> They
> >>>>>>>> have three nodes. Ergo  they should be able to assign two of
> >> them
> >>>>>>>> the role of SC in the OpensAF domain.
> >>>>>>>> 
> >>>>>>>> /AndersBj
> >>>>>>>> 
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Anders Widell [mailto:[email protected]]
> >>>>>>>> Sent: den 12 oktober 2015 16:08
> >>>>>>>> To: Tony Hart; [email protected]
> >>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
> >> losing
> >>>>>>>> system controller
> >>>>>>>> 
> >>>>>>>> We have actually implemented something very similar to what
> you
> >> are
> >>>>>>>> talking about. With this feature, the payloads can survive
> >> without
> >>>>>>>> a cluster restart even if both system controllers restart
> (or
> >> the
> >>>>>>>> single system controller, in your case). If you want to try
> it
> >> out,
> >>>>>>>> you can clone this Mercurial repository:
> >>>>>>>> 
> >>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/
> >>>>>>>> 
> >>>>>>>> To enable the feature, set the variable
> >>>> IMMSV_SC_ABSENCE_ALLOWED in
> >>>>>>>> immd.conf to the amount of seconds you wish the payloads to
> >> wait
> >>>>>>>> for the system controllers to come back. Note: we have only
> >>>>>>>> implemented this feature for the "core" OpenSAF services
> (plus
> >>>>>>>> CKPT), so you need to disable the optional serivces.
> >>>>>>>> 
> >>>>>>>> / Anders Widell
> >>>>>>>> 
> >>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
> >>>>>>>>> We have been using opensaf in our product for a couple of
> >> years
> >>>>>>>>> now.  One of the issues we have is the fact that payload
> >> cards
> >>>>>>>>> reboot when the system controllers are lost.  Although our
> >> payload
> >>>>>>>>> card hardware will continue to perform its functions whilst
> >> the
> >>>>>>>>> software is down (which is desirable) the functions that
> the
> >>>>>>>>> software performs are obviously not performed (which is not
> >>>>>>>>> desirable).
> >>>>>>>>> 
> >>>>>>>>> Why would we loose both controllers, surely that is a rare
> >>>>>>>>> circumstance?  Not if you only have one controller to begin
> >> with.
> >>>>>>>>> Removing the second controller is a significant cost saving
> >> for us
> >>>>>>>>> so we want to support a product that only has one
> controller. 
> >> The
> >>>>>>>>> most significant impediment to that is the loss of payload
> >>>>>>>>> software functions when the system controller fails.
> >>>>>>>>> 
> >>>>>>>>> I’m looking for suggestions from this email list as to what
> >> could
> >>>>>>>>> be done for this issue.
> >>>>>>>>> 
> >>>>>>>>> One suggestion, that would work for us, is if we could
> >> convince
> >>>>>>>>> the payload card to only reboot when the controller
> reappears
> >>>>>>>>> after a loss rather than when the loss initially occurs. 
> Is
> >> that
> >>>>>>>>> possible?
> >>>>>>>>> 
> >>>>>>>>> Another possibility is if we could support more than 2
> >>>>>>>>> controllers, for example if we could support 4 (one active
> and
> >> 3
> >>>>>>>>> standbys) that would also provide a solution for us (our
> >> current
> >>>>>>>>> payloads would instead become controllers). I know that
> this
> >> is
> >>>>>>>>> not currently possible with opensaf.
> >>>>>>>>> 
> >>>>>>>>> thanks for any suggestions,
> >>>>>>>>> —
> >>>>>>>>> tony
> >>>>>>>>> 
> >> ------------------------------------------------------------------
> >>>>>>>>> ----
> >>>>>>>>> 
> >>>>>>>>> -------- _______________________________________________
> >>>>>>>>> Opensaf-users mailing list
> >>>>>>>>> [email protected]
> >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>>>> 
> >>>>>>>> 
> >>
> -------------------------------------------------------------------
> >>>>>>>> -----------
> >>>>>>>> 
> >>>>>>>> _______________________________________________
> >>>>>>>> Opensaf-users mailing list
> >>>>>>>> [email protected]
> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>>>> 
> >>
> -------------------------------------------------------------------
> >>>>>>>> -----------
> >>>>>>>> 
> >>>>>>>> _______________________________________________
> >>>>>>>> Opensaf-users mailing list
> >>>>>>>> [email protected]
> >>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >>>> 
> >>>> 
> >>
> ------------------------------------------------------------------------------
> >>>> _______________________________________________
> >>>> Opensaf-users mailing list
> >>>> [email protected]
> >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to