Re: [users] Avoid rebooting payload modules after losing system controller

Mathivanan Naickan Palanivelu Wed, 14 Oct 2015 03:46:11 -0700

In a clustered environment, all nodes need to be in the same consistent view
from a membership perspective. 
So, loss of a leader will indeed result in state changes to the nodes
following the leader. Therefore there cannot be independent lifecycles
for some nodes and other nodes. 
B.T.W the restart mentioned below meant - 'transparent restart 
of OpenSAF' without affecting applications and for this the application's
CLC-CLI scripts need to handle.


While we could be inclined to support (unique!) use-case such as this, 
it is the solution(headless fork) that would be under the lens! ;-) 
Let's discuss this!

Cheers,
Mathi.


----- [email protected] wrote:

> Yes, this is yet another approach. But it is also another use-case for
> 
> the headless feature. When we have moved the system controllers out of
> 
> the cluster (into the cloud infrastructure), I would expect
> controllers 
> and payloads to have independent life cycles. You have servers (i.e. 
> system controllers), and clients (payloads). They can be installed and
> 
> upgraded separately from each other, and I wouldn't expect a restart
> of 
> the servers to cause all the clients to restart as well, in the same
> way 
> as I don't expect my web browser to restart just because because the
> web 
> server has crashed.
> 
> / Anders Widell
> 
> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote:
> > I don't think this is a case of cattles! Even in those scenario
> > the cloud management stacks, the  "**controller" software themselves
> are 'placed' on physical nodes
> > in appropriate redundancy models and not inside those cattle VMs!
> >
> > I think the case here is about avoid rebooting of the node!
> > This can be achieved by setting the NOACTIVE timer to a longer value
> till OpenSAF on the controller comes back up.
> > Upon detecting that the controllers are up, some entity on the local
> node restart OpenSAF (/etc/init.d/opensafd restart)
> > And ensure the CLC-CLI scripts of the applications differentiate
> usual restart versus this spoof-restart!
> >
> > Mathi.
> >
> >> -----Original Message-----
> >> From: Anders Widell [mailto:[email protected]]
> >> Sent: Tuesday, October 13, 2015 5:36 PM
> >> To: Anders Björnerstedt; Tony Hart
> >> Cc: [email protected]
> >> Subject: Re: [users] Avoid rebooting payload modules after losing
> system
> >> controller
> >>
> >> Yes, I agree that the best fit for this feature is an application
> using either the
> >> NWay-Active or the No-Redundancy models, and where you view the
> >> system more as a collection of nodes rather than as a cluster. This
> kind of
> >> architecture is quite common when you write applications for cloud.
> The
> >> redundancy models are suitable for scaling, and the architecture
> fits into the
> >> "cattle" philosophy which is common in cloud.
> >> Such an application can tolerate any number of node failures, and
> the
> >> remaining nodes would still be able to continue functioning and
> provide their
> >> service. However, if we put the OpenSAF middleware on the nodes it
> >> becomes the weakest link, since OpenSAF will reboot all the nodes
> just
> >> because the two controller nodes fail. What a pity on a system with
> one
> >> hundred nodes!
> >>
> >> / Anders Widell
> >>
> >> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote:
> >>>
> >>> On 10/13/2015 12:27 PM, Anders Widell wrote:
> >>>> The possibility to have more than two system controllers (one
> active
> >>>> + several standby and/or spare controller nodes) is also
> something
> >>>> that has been investigated. For scalability reasons, we probably
> >>>> can't turn all nodes into standby controllers in a large cluster
> -
> >>>> but it may be feasible to have a system with one or several
> standby
> >>>> controllers and the rest of the nodes are spares that are ready
> to
> >>>> take an active or standby assignment when needed.
> >>>>
> >>>> However, the "headless" feature will still be needed in some
> systems
> >>>> where you need dedicated controller node(s).
> >>> That sounds as if some deployments have a special requirement that
> can
> >>> only be supported by the headless feature.
> >>> But you also have to say that the headless feature places
> >>> anti-requirements on the deployments/applications that are to use
> it.
> >>>
> >>> For example not needing cluster coherence among the payloads.
> >>>
> >>> If the payloads only run independent application instances where
> each
> >>> instance is implemented at one processor or at least does not
> >>> communicate in any state-sensitive way with peer processes at
> other
> >>> payloads; and no such instance is unique or if it is unique it is
> >>> still expendable (non critical to the service), then it could
> work.
> >>>
> >>> It is important the the deployments that end up thinking they need
> the
> >>> headless feature also understand what they loose with the
> headless
> >>> feature and that this loss is acceptable for that deployment.
> >>>
> >>> So headless is not a fancy feature needed by some exclusive and
> picky
> >>> subset of applications.
> >>> It is a relaxation that drops all requirements on distributed
> >>> consistency and may be acceptable to some applications with
> weaker
> >>> demands so they can accept the anti requirements.
> >>>
> >>> Besides requiring "dedicated" controller nodes, the deployment
> must of
> >>> course NOT require any *availability* of those dedicated
> controller
> >>> nodes, i.e. not have any requirements on service availability in
> >>> general.
> >>>
> >>> It may works for some "dumb" applications that are stateless, or
> state
> >>> stable (frozen in state), or have no requirements on availability
> of
> >>> state. In other words some applicaitons that really dont need
> SAF.
> >>>
> >>> They may still want to use SAF as a way of managing and monitoring
> the
> >>> system when it happens to be healthy, but can live with  long
> periods
> >>> of not being able to manage or monitor that system, which can then
> be
> >>> degrading in any way that is possible.
> >>>
> >>>
> >>> /AndersBJ
> >>>
> >>>
> >>>
> >>>> / Anders Widell
> >>>>
> >>>> On 10/13/2015 12:07 PM, Tony Hart wrote:
> >>>>> Understood.  The assumption is that this is temporary but we
> allow
> >>>>> the payloads to continue to run (with reduced osaf
> functionality)
> >>>>> until a replacement controller is found.  At that point they
> can
> >>>>> reboot to get the system back into sync.
> >>>>>
> >>>>> Or allow more than 2 controllers in the system so we can have
> one or
> >>>>> more usually-payload cards be controllers to reduce the
> probability
> >>>>> of no-controllers to an acceptable level.
> >>>>>
> >>>>>
> >>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt
> >>>>>> <[email protected]> wrote:
> >>>>>>
> >>>>>> The headless state is also vulnerable to split-brain
> scenarios.
> >>>>>> That is network partitions and joins can occur and will not be
> >>>>>> detected as such and thus not handled properly (isolated) when
> they
> >>>>>> occur.
> >>>>>> Basically you can  not be sure you have a continuously
> coherent
> >>>>>> cluster while in the headless state.
> >>>>>>
> >>>>>> On paper you may get a very resilient system in the sense that
> It
> >>>>>> "stays up"  and replies on ping etc.
> >>>>>> But typically a customer wants not just availability but
> reliable
> >>>>>> behavior also.
> >>>>>>
> >>>>>> /AndersBj
> >>>>>>
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Anders Björnerstedt
> >> [mailto:[email protected]]
> >>>>>> Sent: den 12 oktober 2015 16:42
> >>>>>> To: Anders Widell; Tony Hart;
> [email protected]
> >>>>>> Subject: Re: [users] Avoid rebooting payload modules after
> losing
> >>>>>> system controller
> >>>>>>
> >>>>>> Note that this headless variant  is a very questionable
> feature.
> >>>>>> This for the reasons explained earlier, i.e. you *will*  get a
> >>>>>> reduction in service availability.
> >>>>>> It was never accepted into OpenSAF for that reason.
> >>>>>>
> >>>>>> On top of that the unreliability will typically not he
> >>>>>> explicit/handled. That is the operator will probably not even
> know
> >>>>>> what is working and what is not during the SC absence since
> the
> >>>>>> alarm/notification  function is gone. No OpenSAF director
> services
> >>>>>> are executing.
> >>>>>>
> >>>>>> It is truly a headless system, i.e. a zombie system and thus
> not
> >>>>>> working at full monitoring and availability functionality.
> >>>>>> It begs the question of what OpenSAF and SAF is there for in
> the
> >>>>>> first place.
> >>>>>>
> >>>>>> The SCs don’t have to run any special software and don’t have
> to
> >>>>>> have any special hardware.
> >>>>>> They do need file system access, at least for a cluster
> restart,
> >>>>>> but not necessarily to handle single SC failure.
> >>>>>> The headless variant when headless is also in that
> >>>>>> not-able-to-cluster-restart also, but with even less
> functionality.
> >>>>>>
> >>>>>> An SC can of course run other (non OpenSAF specific) software. 
> And
> >>>>>> the two SCs don’t necessarily have to be symmetric in terms of
> >>>>>> software.
> >>>>>>
> >>>>>> Providing file system access via NFS is typically a non issue.
> They
> >>>>>> have three nodes. Ergo  they should be able to assign two of
> them
> >>>>>> the role of SC in the OpensAF domain.
> >>>>>>
> >>>>>> /AndersBj
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Anders Widell [mailto:[email protected]]
> >>>>>> Sent: den 12 oktober 2015 16:08
> >>>>>> To: Tony Hart; [email protected]
> >>>>>> Subject: Re: [users] Avoid rebooting payload modules after
> losing
> >>>>>> system controller
> >>>>>>
> >>>>>> We have actually implemented something very similar to what you
> are
> >>>>>> talking about. With this feature, the payloads can survive
> without
> >>>>>> a cluster restart even if both system controllers restart (or
> the
> >>>>>> single system controller, in your case). If you want to try it
> out,
> >>>>>> you can clone this Mercurial repository:
> >>>>>>
> >>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/
> >>>>>>
> >>>>>> To enable the feature, set the variable
> >> IMMSV_SC_ABSENCE_ALLOWED in
> >>>>>> immd.conf to the amount of seconds you wish the payloads to
> wait
> >>>>>> for the system controllers to come back. Note: we have only
> >>>>>> implemented this feature for the "core" OpenSAF services (plus
> >>>>>> CKPT), so you need to disable the optional serivces.
> >>>>>>
> >>>>>> / Anders Widell
> >>>>>>
> >>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
> >>>>>>> We have been using opensaf in our product for a couple of
> years
> >>>>>>> now.  One of the issues we have is the fact that payload
> cards
> >>>>>>> reboot when the system controllers are lost.  Although our
> payload
> >>>>>>> card hardware will continue to perform its functions whilst
> the
> >>>>>>> software is down (which is desirable) the functions that the
> >>>>>>> software performs are obviously not performed (which is not
> >>>>>>> desirable).
> >>>>>>>
> >>>>>>> Why would we loose both controllers, surely that is a rare
> >>>>>>> circumstance?  Not if you only have one controller to begin
> with.
> >>>>>>> Removing the second controller is a significant cost saving
> for us
> >>>>>>> so we want to support a product that only has one controller. 
> The
> >>>>>>> most significant impediment to that is the loss of payload
> >>>>>>> software functions when the system controller fails.
> >>>>>>>
> >>>>>>> I’m looking for suggestions from this email list as to what
> could
> >>>>>>> be done for this issue.
> >>>>>>>
> >>>>>>> One suggestion, that would work for us, is if we could
> convince
> >>>>>>> the payload card to only reboot when the controller reappears
> >>>>>>> after a loss rather than when the loss initially occurs.  Is
> that
> >>>>>>> possible?
> >>>>>>>
> >>>>>>> Another possibility is if we could support more than 2
> >>>>>>> controllers, for example if we could support 4 (one active and
> 3
> >>>>>>> standbys) that would also provide a solution for us (our
> current
> >>>>>>> payloads would instead become controllers). I know that this
> is
> >>>>>>> not currently possible with opensaf.
> >>>>>>>
> >>>>>>> thanks for any suggestions,
> >>>>>>> —
> >>>>>>> tony
> >>>>>>>
> ------------------------------------------------------------------
> >>>>>>> ----
> >>>>>>>
> >>>>>>> -------- _______________________________________________
> >>>>>>> Opensaf-users mailing list
> >>>>>>> [email protected]
> >>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>>
> >>>>>>
> -------------------------------------------------------------------
> >>>>>> -----------
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Opensaf-users mailing list
> >>>>>> [email protected]
> >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>>>
> -------------------------------------------------------------------
> >>>>>> -----------
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Opensaf-users mailing list
> >>>>>> [email protected]
> >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
> >>>>
> >>>
> >>>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> _______________________________________________
> >> Opensaf-users mailing list
> >> [email protected]
> >> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to