Re: [users] Avoid rebooting payload modules after losing system controller

Tony Hart Tue, 22 Dec 2015 02:55:49 -0800

Hi Mathi,
Any updates on this topic?
thanks
—
tony


> On Oct 14, 2015, at 7:50 AM, Mathivanan Naickan Palanivelu 
> <[email protected]> wrote:
> 
> Tony,
> 
> We(in the TLC) have discussed this before. At this point of time
> you indeed have the solution that AndersW has pointed to.
> 
> There are also other discussions going on w.r.t multiple-standbys etc
> for the next major release. Will tag you in the appropriate ticket
> (one such ticket is http://sourceforge.net/p/opensaf/tickets/1170/)
> once those discussions conclude.
> 
> Cheers,
> Mathi.
> 
> ----- [email protected] wrote:
> 
>> Thanks all, this is a good discussion.
>> 
>> Let me again state my use case, which as I said I don’t think is
>> unique, because selfishly I want to make sure that this is covered
>> :-)
>> 
>> I want to keep the system running in the case where both controller
>> fail, or more specifically in my case, when the only server in the
>> system fails (cost reduced system).
>> 
>> The "don’t reboot until a server re-appears" would work for me.  I
>> think its a poorer solution but it would be workable (Mathi’s
>> suggestion).  Especially if its available sooner.
>> 
>> A better option is to allow one of the linecards to take over the
>> controller role when the preferred one dies.  Since linecards are
>> assumed to be transitory (they may or may not be there or could be
>> removed) so I don’t want to have to pick a linecard for this role at
>> boot time.   Much better would be to allow some number (more than 2)
>> nodes to be controllers (e.g. Active, Standby, Spare, Spare …).  Then
>> OSAF takes responsibility for electing the next Active/Standby when
>> necessary.  There would have to be a concept of preference such that
>> the Server node is chosen given a choice.
>> 
>> For the solutions discussed so far…
>> 
>> Making the OSAF processes restartable and be active on either
>> controller, is good and will increase the MTBF of an OSAF system. 
>> However it won’t fix my problem if there are still only two
>> controllers allowed.
>> 
>> I’m not familiar with the “cattle” concept.
>> 
>> I wasn’t advocating the “headless” solution only to the extent that it
>> solved my immediate problem.  Even if we did have a “headless” mode it
>> would be a temporary situation until the failed controller could be
>> replaced.
>> 
>> thanks
>> —
>> tony
>> 
>> 
>>> On Oct 14, 2015, at 6:44 AM, Mathivanan Naickan Palanivelu
>> <[email protected]> wrote:
>>> 
>>> In a clustered environment, all nodes need to be in the same
>> consistent view
>>> from a membership perspective. 
>>> So, loss of a leader will indeed result in state changes to the
>> nodes
>>> following the leader. Therefore there cannot be independent
>> lifecycles
>>> for some nodes and other nodes. 
>>> B.T.W the restart mentioned below meant - 'transparent restart 
>>> of OpenSAF' without affecting applications and for this the
>> application's
>>> CLC-CLI scripts need to handle.
>>> 
>>> While we could be inclined to support (unique!) use-case such as
>> this, 
>>> it is the solution(headless fork) that would be under the lens! ;-)
>> 
>>> Let's discuss this!
>>> 
>>> Cheers,
>>> Mathi.
>>> 
>>> 
>>> ----- [email protected] wrote:
>>> 
>>>> Yes, this is yet another approach. But it is also another use-case
>> for
>>>> 
>>>> the headless feature. When we have moved the system controllers out
>> of
>>>> 
>>>> the cluster (into the cloud infrastructure), I would expect
>>>> controllers 
>>>> and payloads to have independent life cycles. You have servers
>> (i.e. 
>>>> system controllers), and clients (payloads). They can be installed
>> and
>>>> 
>>>> upgraded separately from each other, and I wouldn't expect a
>> restart
>>>> of 
>>>> the servers to cause all the clients to restart as well, in the
>> same
>>>> way 
>>>> as I don't expect my web browser to restart just because because
>> the
>>>> web 
>>>> server has crashed.
>>>> 
>>>> / Anders Widell
>>>> 
>>>> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote:
>>>>> I don't think this is a case of cattles! Even in those scenario
>>>>> the cloud management stacks, the  "**controller" software
>> themselves
>>>> are 'placed' on physical nodes
>>>>> in appropriate redundancy models and not inside those cattle VMs!
>>>>> 
>>>>> I think the case here is about avoid rebooting of the node!
>>>>> This can be achieved by setting the NOACTIVE timer to a longer
>> value
>>>> till OpenSAF on the controller comes back up.
>>>>> Upon detecting that the controllers are up, some entity on the
>> local
>>>> node restart OpenSAF (/etc/init.d/opensafd restart)
>>>>> And ensure the CLC-CLI scripts of the applications differentiate
>>>> usual restart versus this spoof-restart!
>>>>> 
>>>>> Mathi.
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Anders Widell [mailto:[email protected]]
>>>>>> Sent: Tuesday, October 13, 2015 5:36 PM
>>>>>> To: Anders Björnerstedt; Tony Hart
>>>>>> Cc: [email protected]
>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
>> losing
>>>> system
>>>>>> controller
>>>>>> 
>>>>>> Yes, I agree that the best fit for this feature is an
>> application
>>>> using either the
>>>>>> NWay-Active or the No-Redundancy models, and where you view the
>>>>>> system more as a collection of nodes rather than as a cluster.
>> This
>>>> kind of
>>>>>> architecture is quite common when you write applications for
>> cloud.
>>>> The
>>>>>> redundancy models are suitable for scaling, and the architecture
>>>> fits into the
>>>>>> "cattle" philosophy which is common in cloud.
>>>>>> Such an application can tolerate any number of node failures,
>> and
>>>> the
>>>>>> remaining nodes would still be able to continue functioning and
>>>> provide their
>>>>>> service. However, if we put the OpenSAF middleware on the nodes
>> it
>>>>>> becomes the weakest link, since OpenSAF will reboot all the
>> nodes
>>>> just
>>>>>> because the two controller nodes fail. What a pity on a system
>> with
>>>> one
>>>>>> hundred nodes!
>>>>>> 
>>>>>> / Anders Widell
>>>>>> 
>>>>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote:
>>>>>>> 
>>>>>>> On 10/13/2015 12:27 PM, Anders Widell wrote:
>>>>>>>> The possibility to have more than two system controllers (one
>>>> active
>>>>>>>> + several standby and/or spare controller nodes) is also
>>>> something
>>>>>>>> that has been investigated. For scalability reasons, we
>> probably
>>>>>>>> can't turn all nodes into standby controllers in a large
>> cluster
>>>> -
>>>>>>>> but it may be feasible to have a system with one or several
>>>> standby
>>>>>>>> controllers and the rest of the nodes are spares that are
>> ready
>>>> to
>>>>>>>> take an active or standby assignment when needed.
>>>>>>>> 
>>>>>>>> However, the "headless" feature will still be needed in some
>>>> systems
>>>>>>>> where you need dedicated controller node(s).
>>>>>>> That sounds as if some deployments have a special requirement
>> that
>>>> can
>>>>>>> only be supported by the headless feature.
>>>>>>> But you also have to say that the headless feature places
>>>>>>> anti-requirements on the deployments/applications that are to
>> use
>>>> it.
>>>>>>> 
>>>>>>> For example not needing cluster coherence among the payloads.
>>>>>>> 
>>>>>>> If the payloads only run independent application instances
>> where
>>>> each
>>>>>>> instance is implemented at one processor or at least does not
>>>>>>> communicate in any state-sensitive way with peer processes at
>>>> other
>>>>>>> payloads; and no such instance is unique or if it is unique it
>> is
>>>>>>> still expendable (non critical to the service), then it could
>>>> work.
>>>>>>> 
>>>>>>> It is important the the deployments that end up thinking they
>> need
>>>> the
>>>>>>> headless feature also understand what they loose with the
>>>> headless
>>>>>>> feature and that this loss is acceptable for that deployment.
>>>>>>> 
>>>>>>> So headless is not a fancy feature needed by some exclusive and
>>>> picky
>>>>>>> subset of applications.
>>>>>>> It is a relaxation that drops all requirements on distributed
>>>>>>> consistency and may be acceptable to some applications with
>>>> weaker
>>>>>>> demands so they can accept the anti requirements.
>>>>>>> 
>>>>>>> Besides requiring "dedicated" controller nodes, the deployment
>>>> must of
>>>>>>> course NOT require any *availability* of those dedicated
>>>> controller
>>>>>>> nodes, i.e. not have any requirements on service availability
>> in
>>>>>>> general.
>>>>>>> 
>>>>>>> It may works for some "dumb" applications that are stateless,
>> or
>>>> state
>>>>>>> stable (frozen in state), or have no requirements on
>> availability
>>>> of
>>>>>>> state. In other words some applicaitons that really dont need
>>>> SAF.
>>>>>>> 
>>>>>>> They may still want to use SAF as a way of managing and
>> monitoring
>>>> the
>>>>>>> system when it happens to be healthy, but can live with  long
>>>> periods
>>>>>>> of not being able to manage or monitor that system, which can
>> then
>>>> be
>>>>>>> degrading in any way that is possible.
>>>>>>> 
>>>>>>> 
>>>>>>> /AndersBJ
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> / Anders Widell
>>>>>>>> 
>>>>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote:
>>>>>>>>> Understood.  The assumption is that this is temporary but we
>>>> allow
>>>>>>>>> the payloads to continue to run (with reduced osaf
>>>> functionality)
>>>>>>>>> until a replacement controller is found.  At that point they
>>>> can
>>>>>>>>> reboot to get the system back into sync.
>>>>>>>>> 
>>>>>>>>> Or allow more than 2 controllers in the system so we can have
>>>> one or
>>>>>>>>> more usually-payload cards be controllers to reduce the
>>>> probability
>>>>>>>>> of no-controllers to an acceptable level.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> 
>>>>>>>>>> The headless state is also vulnerable to split-brain
>>>> scenarios.
>>>>>>>>>> That is network partitions and joins can occur and will not
>> be
>>>>>>>>>> detected as such and thus not handled properly (isolated)
>> when
>>>> they
>>>>>>>>>> occur.
>>>>>>>>>> Basically you can  not be sure you have a continuously
>>>> coherent
>>>>>>>>>> cluster while in the headless state.
>>>>>>>>>> 
>>>>>>>>>> On paper you may get a very resilient system in the sense
>> that
>>>> It
>>>>>>>>>> "stays up"  and replies on ping etc.
>>>>>>>>>> But typically a customer wants not just availability but
>>>> reliable
>>>>>>>>>> behavior also.
>>>>>>>>>> 
>>>>>>>>>> /AndersBj
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Anders Björnerstedt
>>>>>> [mailto:[email protected]]
>>>>>>>>>> Sent: den 12 oktober 2015 16:42
>>>>>>>>>> To: Anders Widell; Tony Hart;
>>>> [email protected]
>>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
>>>> losing
>>>>>>>>>> system controller
>>>>>>>>>> 
>>>>>>>>>> Note that this headless variant  is a very questionable
>>>> feature.
>>>>>>>>>> This for the reasons explained earlier, i.e. you *will*  get
>> a
>>>>>>>>>> reduction in service availability.
>>>>>>>>>> It was never accepted into OpenSAF for that reason.
>>>>>>>>>> 
>>>>>>>>>> On top of that the unreliability will typically not he
>>>>>>>>>> explicit/handled. That is the operator will probably not
>> even
>>>> know
>>>>>>>>>> what is working and what is not during the SC absence since
>>>> the
>>>>>>>>>> alarm/notification  function is gone. No OpenSAF director
>>>> services
>>>>>>>>>> are executing.
>>>>>>>>>> 
>>>>>>>>>> It is truly a headless system, i.e. a zombie system and thus
>>>> not
>>>>>>>>>> working at full monitoring and availability functionality.
>>>>>>>>>> It begs the question of what OpenSAF and SAF is there for in
>>>> the
>>>>>>>>>> first place.
>>>>>>>>>> 
>>>>>>>>>> The SCs don’t have to run any special software and don’t
>> have
>>>> to
>>>>>>>>>> have any special hardware.
>>>>>>>>>> They do need file system access, at least for a cluster
>>>> restart,
>>>>>>>>>> but not necessarily to handle single SC failure.
>>>>>>>>>> The headless variant when headless is also in that
>>>>>>>>>> not-able-to-cluster-restart also, but with even less
>>>> functionality.
>>>>>>>>>> 
>>>>>>>>>> An SC can of course run other (non OpenSAF specific)
>> software. 
>>>> And
>>>>>>>>>> the two SCs don’t necessarily have to be symmetric in terms
>> of
>>>>>>>>>> software.
>>>>>>>>>> 
>>>>>>>>>> Providing file system access via NFS is typically a non
>> issue.
>>>> They
>>>>>>>>>> have three nodes. Ergo  they should be able to assign two of
>>>> them
>>>>>>>>>> the role of SC in the OpensAF domain.
>>>>>>>>>> 
>>>>>>>>>> /AndersBj
>>>>>>>>>> 
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Anders Widell [mailto:[email protected]]
>>>>>>>>>> Sent: den 12 oktober 2015 16:08
>>>>>>>>>> To: Tony Hart; [email protected]
>>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
>>>> losing
>>>>>>>>>> system controller
>>>>>>>>>> 
>>>>>>>>>> We have actually implemented something very similar to what
>> you
>>>> are
>>>>>>>>>> talking about. With this feature, the payloads can survive
>>>> without
>>>>>>>>>> a cluster restart even if both system controllers restart
>> (or
>>>> the
>>>>>>>>>> single system controller, in your case). If you want to try
>> it
>>>> out,
>>>>>>>>>> you can clone this Mercurial repository:
>>>>>>>>>> 
>>>>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/
>>>>>>>>>> 
>>>>>>>>>> To enable the feature, set the variable
>>>>>> IMMSV_SC_ABSENCE_ALLOWED in
>>>>>>>>>> immd.conf to the amount of seconds you wish the payloads to
>>>> wait
>>>>>>>>>> for the system controllers to come back. Note: we have only
>>>>>>>>>> implemented this feature for the "core" OpenSAF services
>> (plus
>>>>>>>>>> CKPT), so you need to disable the optional serivces.
>>>>>>>>>> 
>>>>>>>>>> / Anders Widell
>>>>>>>>>> 
>>>>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
>>>>>>>>>>> We have been using opensaf in our product for a couple of
>>>> years
>>>>>>>>>>> now.  One of the issues we have is the fact that payload
>>>> cards
>>>>>>>>>>> reboot when the system controllers are lost.  Although our
>>>> payload
>>>>>>>>>>> card hardware will continue to perform its functions whilst
>>>> the
>>>>>>>>>>> software is down (which is desirable) the functions that
>> the
>>>>>>>>>>> software performs are obviously not performed (which is not
>>>>>>>>>>> desirable).
>>>>>>>>>>> 
>>>>>>>>>>> Why would we loose both controllers, surely that is a rare
>>>>>>>>>>> circumstance?  Not if you only have one controller to begin
>>>> with.
>>>>>>>>>>> Removing the second controller is a significant cost saving
>>>> for us
>>>>>>>>>>> so we want to support a product that only has one
>> controller. 
>>>> The
>>>>>>>>>>> most significant impediment to that is the loss of payload
>>>>>>>>>>> software functions when the system controller fails.
>>>>>>>>>>> 
>>>>>>>>>>> I’m looking for suggestions from this email list as to what
>>>> could
>>>>>>>>>>> be done for this issue.
>>>>>>>>>>> 
>>>>>>>>>>> One suggestion, that would work for us, is if we could
>>>> convince
>>>>>>>>>>> the payload card to only reboot when the controller
>> reappears
>>>>>>>>>>> after a loss rather than when the loss initially occurs. 
>> Is
>>>> that
>>>>>>>>>>> possible?
>>>>>>>>>>> 
>>>>>>>>>>> Another possibility is if we could support more than 2
>>>>>>>>>>> controllers, for example if we could support 4 (one active
>> and
>>>> 3
>>>>>>>>>>> standbys) that would also provide a solution for us (our
>>>> current
>>>>>>>>>>> payloads would instead become controllers). I know that
>> this
>>>> is
>>>>>>>>>>> not currently possible with opensaf.
>>>>>>>>>>> 
>>>>>>>>>>> thanks for any suggestions,
>>>>>>>>>>> —
>>>>>>>>>>> tony
>>>>>>>>>>> 
>>>> ------------------------------------------------------------------
>>>>>>>>>>> ----
>>>>>>>>>>> 
>>>>>>>>>>> -------- _______________________________________________
>>>>>>>>>>> Opensaf-users mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>>>>>>> 
>>>>>>>>>> 
>>>> 
>> -------------------------------------------------------------------
>>>>>>>>>> -----------
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Opensaf-users mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>>>>>>> 
>>>> 
>> -------------------------------------------------------------------
>>>>>>>>>> -----------
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Opensaf-users mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>> ------------------------------------------------------------------------------
>>>>>> _______________________________________________
>>>>>> Opensaf-users mailing list
>>>>>> [email protected]
>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to