Re: [users] Avoid rebooting payload modules after losing system controller

Anders Widell Wed, 14 Oct 2015 05:52:50 -0700

In this kind of solution the system controller(s) would not be part of 
the cluster (not the same cluster anyway) - the system controller 
functionality would be a service provided externally (e.g. by the 
cloud). So loss of connectivity with the system controller service would 
not in itself result in a change of the cluster membership.


/ Anders Widell

On 10/14/2015 12:44 PM, Mathivanan Naickan Palanivelu wrote:
> In a clustered environment, all nodes need to be in the same consistent view
> from a membership perspective.
> So, loss of a leader will indeed result in state changes to the nodes
> following the leader. Therefore there cannot be independent lifecycles
> for some nodes and other nodes.
> B.T.W the restart mentioned below meant - 'transparent restart
> of OpenSAF' without affecting applications and for this the application's
> CLC-CLI scripts need to handle.
>
> While we could be inclined to support (unique!) use-case such as this,
> it is the solution(headless fork) that would be under the lens! ;-)
> Let's discuss this!
>
> Cheers,
> Mathi.
>
>
> ----- [email protected] wrote:
>
>> Yes, this is yet another approach. But it is also another use-case for
>>
>> the headless feature. When we have moved the system controllers out of
>>
>> the cluster (into the cloud infrastructure), I would expect
>> controllers
>> and payloads to have independent life cycles. You have servers (i.e.
>> system controllers), and clients (payloads). They can be installed and
>>
>> upgraded separately from each other, and I wouldn't expect a restart
>> of
>> the servers to cause all the clients to restart as well, in the same
>> way
>> as I don't expect my web browser to restart just because because the
>> web
>> server has crashed.
>>
>> / Anders Widell
>>
>> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote:
>>> I don't think this is a case of cattles! Even in those scenario
>>> the cloud management stacks, the  "**controller" software themselves
>> are 'placed' on physical nodes
>>> in appropriate redundancy models and not inside those cattle VMs!
>>>
>>> I think the case here is about avoid rebooting of the node!
>>> This can be achieved by setting the NOACTIVE timer to a longer value
>> till OpenSAF on the controller comes back up.
>>> Upon detecting that the controllers are up, some entity on the local
>> node restart OpenSAF (/etc/init.d/opensafd restart)
>>> And ensure the CLC-CLI scripts of the applications differentiate
>> usual restart versus this spoof-restart!
>>> Mathi.
>>>
>>>> -----Original Message-----
>>>> From: Anders Widell [mailto:[email protected]]
>>>> Sent: Tuesday, October 13, 2015 5:36 PM
>>>> To: Anders Björnerstedt; Tony Hart
>>>> Cc: [email protected]
>>>> Subject: Re: [users] Avoid rebooting payload modules after losing
>> system
>>>> controller
>>>>
>>>> Yes, I agree that the best fit for this feature is an application
>> using either the
>>>> NWay-Active or the No-Redundancy models, and where you view the
>>>> system more as a collection of nodes rather than as a cluster. This
>> kind of
>>>> architecture is quite common when you write applications for cloud.
>> The
>>>> redundancy models are suitable for scaling, and the architecture
>> fits into the
>>>> "cattle" philosophy which is common in cloud.
>>>> Such an application can tolerate any number of node failures, and
>> the
>>>> remaining nodes would still be able to continue functioning and
>> provide their
>>>> service. However, if we put the OpenSAF middleware on the nodes it
>>>> becomes the weakest link, since OpenSAF will reboot all the nodes
>> just
>>>> because the two controller nodes fail. What a pity on a system with
>> one
>>>> hundred nodes!
>>>>
>>>> / Anders Widell
>>>>
>>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote:
>>>>> On 10/13/2015 12:27 PM, Anders Widell wrote:
>>>>>> The possibility to have more than two system controllers (one
>> active
>>>>>> + several standby and/or spare controller nodes) is also
>> something
>>>>>> that has been investigated. For scalability reasons, we probably
>>>>>> can't turn all nodes into standby controllers in a large cluster
>> -
>>>>>> but it may be feasible to have a system with one or several
>> standby
>>>>>> controllers and the rest of the nodes are spares that are ready
>> to
>>>>>> take an active or standby assignment when needed.
>>>>>>
>>>>>> However, the "headless" feature will still be needed in some
>> systems
>>>>>> where you need dedicated controller node(s).
>>>>> That sounds as if some deployments have a special requirement that
>> can
>>>>> only be supported by the headless feature.
>>>>> But you also have to say that the headless feature places
>>>>> anti-requirements on the deployments/applications that are to use
>> it.
>>>>> For example not needing cluster coherence among the payloads.
>>>>>
>>>>> If the payloads only run independent application instances where
>> each
>>>>> instance is implemented at one processor or at least does not
>>>>> communicate in any state-sensitive way with peer processes at
>> other
>>>>> payloads; and no such instance is unique or if it is unique it is
>>>>> still expendable (non critical to the service), then it could
>> work.
>>>>> It is important the the deployments that end up thinking they need
>> the
>>>>> headless feature also understand what they loose with the
>> headless
>>>>> feature and that this loss is acceptable for that deployment.
>>>>>
>>>>> So headless is not a fancy feature needed by some exclusive and
>> picky
>>>>> subset of applications.
>>>>> It is a relaxation that drops all requirements on distributed
>>>>> consistency and may be acceptable to some applications with
>> weaker
>>>>> demands so they can accept the anti requirements.
>>>>>
>>>>> Besides requiring "dedicated" controller nodes, the deployment
>> must of
>>>>> course NOT require any *availability* of those dedicated
>> controller
>>>>> nodes, i.e. not have any requirements on service availability in
>>>>> general.
>>>>>
>>>>> It may works for some "dumb" applications that are stateless, or
>> state
>>>>> stable (frozen in state), or have no requirements on availability
>> of
>>>>> state. In other words some applicaitons that really dont need
>> SAF.
>>>>> They may still want to use SAF as a way of managing and monitoring
>> the
>>>>> system when it happens to be healthy, but can live with  long
>> periods
>>>>> of not being able to manage or monitor that system, which can then
>> be
>>>>> degrading in any way that is possible.
>>>>>
>>>>>
>>>>> /AndersBJ
>>>>>
>>>>>
>>>>>
>>>>>> / Anders Widell
>>>>>>
>>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote:
>>>>>>> Understood.  The assumption is that this is temporary but we
>> allow
>>>>>>> the payloads to continue to run (with reduced osaf
>> functionality)
>>>>>>> until a replacement controller is found.  At that point they
>> can
>>>>>>> reboot to get the system back into sync.
>>>>>>>
>>>>>>> Or allow more than 2 controllers in the system so we can have
>> one or
>>>>>>> more usually-payload cards be controllers to reduce the
>> probability
>>>>>>> of no-controllers to an acceptable level.
>>>>>>>
>>>>>>>
>>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt
>>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> The headless state is also vulnerable to split-brain
>> scenarios.
>>>>>>>> That is network partitions and joins can occur and will not be
>>>>>>>> detected as such and thus not handled properly (isolated) when
>> they
>>>>>>>> occur.
>>>>>>>> Basically you can  not be sure you have a continuously
>> coherent
>>>>>>>> cluster while in the headless state.
>>>>>>>>
>>>>>>>> On paper you may get a very resilient system in the sense that
>> It
>>>>>>>> "stays up"  and replies on ping etc.
>>>>>>>> But typically a customer wants not just availability but
>> reliable
>>>>>>>> behavior also.
>>>>>>>>
>>>>>>>> /AndersBj
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Anders Björnerstedt
>>>> [mailto:[email protected]]
>>>>>>>> Sent: den 12 oktober 2015 16:42
>>>>>>>> To: Anders Widell; Tony Hart;
>> [email protected]
>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
>> losing
>>>>>>>> system controller
>>>>>>>>
>>>>>>>> Note that this headless variant  is a very questionable
>> feature.
>>>>>>>> This for the reasons explained earlier, i.e. you *will*  get a
>>>>>>>> reduction in service availability.
>>>>>>>> It was never accepted into OpenSAF for that reason.
>>>>>>>>
>>>>>>>> On top of that the unreliability will typically not he
>>>>>>>> explicit/handled. That is the operator will probably not even
>> know
>>>>>>>> what is working and what is not during the SC absence since
>> the
>>>>>>>> alarm/notification  function is gone. No OpenSAF director
>> services
>>>>>>>> are executing.
>>>>>>>>
>>>>>>>> It is truly a headless system, i.e. a zombie system and thus
>> not
>>>>>>>> working at full monitoring and availability functionality.
>>>>>>>> It begs the question of what OpenSAF and SAF is there for in
>> the
>>>>>>>> first place.
>>>>>>>>
>>>>>>>> The SCs don’t have to run any special software and don’t have
>> to
>>>>>>>> have any special hardware.
>>>>>>>> They do need file system access, at least for a cluster
>> restart,
>>>>>>>> but not necessarily to handle single SC failure.
>>>>>>>> The headless variant when headless is also in that
>>>>>>>> not-able-to-cluster-restart also, but with even less
>> functionality.
>>>>>>>> An SC can of course run other (non OpenSAF specific) software.
>> And
>>>>>>>> the two SCs don’t necessarily have to be symmetric in terms of
>>>>>>>> software.
>>>>>>>>
>>>>>>>> Providing file system access via NFS is typically a non issue.
>> They
>>>>>>>> have three nodes. Ergo  they should be able to assign two of
>> them
>>>>>>>> the role of SC in the OpensAF domain.
>>>>>>>>
>>>>>>>> /AndersBj
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Anders Widell [mailto:[email protected]]
>>>>>>>> Sent: den 12 oktober 2015 16:08
>>>>>>>> To: Tony Hart; [email protected]
>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after
>> losing
>>>>>>>> system controller
>>>>>>>>
>>>>>>>> We have actually implemented something very similar to what you
>> are
>>>>>>>> talking about. With this feature, the payloads can survive
>> without
>>>>>>>> a cluster restart even if both system controllers restart (or
>> the
>>>>>>>> single system controller, in your case). If you want to try it
>> out,
>>>>>>>> you can clone this Mercurial repository:
>>>>>>>>
>>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/
>>>>>>>>
>>>>>>>> To enable the feature, set the variable
>>>> IMMSV_SC_ABSENCE_ALLOWED in
>>>>>>>> immd.conf to the amount of seconds you wish the payloads to
>> wait
>>>>>>>> for the system controllers to come back. Note: we have only
>>>>>>>> implemented this feature for the "core" OpenSAF services (plus
>>>>>>>> CKPT), so you need to disable the optional serivces.
>>>>>>>>
>>>>>>>> / Anders Widell
>>>>>>>>
>>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote:
>>>>>>>>> We have been using opensaf in our product for a couple of
>> years
>>>>>>>>> now.  One of the issues we have is the fact that payload
>> cards
>>>>>>>>> reboot when the system controllers are lost.  Although our
>> payload
>>>>>>>>> card hardware will continue to perform its functions whilst
>> the
>>>>>>>>> software is down (which is desirable) the functions that the
>>>>>>>>> software performs are obviously not performed (which is not
>>>>>>>>> desirable).
>>>>>>>>>
>>>>>>>>> Why would we loose both controllers, surely that is a rare
>>>>>>>>> circumstance?  Not if you only have one controller to begin
>> with.
>>>>>>>>> Removing the second controller is a significant cost saving
>> for us
>>>>>>>>> so we want to support a product that only has one controller.
>> The
>>>>>>>>> most significant impediment to that is the loss of payload
>>>>>>>>> software functions when the system controller fails.
>>>>>>>>>
>>>>>>>>> I’m looking for suggestions from this email list as to what
>> could
>>>>>>>>> be done for this issue.
>>>>>>>>>
>>>>>>>>> One suggestion, that would work for us, is if we could
>> convince
>>>>>>>>> the payload card to only reboot when the controller reappears
>>>>>>>>> after a loss rather than when the loss initially occurs.  Is
>> that
>>>>>>>>> possible?
>>>>>>>>>
>>>>>>>>> Another possibility is if we could support more than 2
>>>>>>>>> controllers, for example if we could support 4 (one active and
>> 3
>>>>>>>>> standbys) that would also provide a solution for us (our
>> current
>>>>>>>>> payloads would instead become controllers). I know that this
>> is
>>>>>>>>> not currently possible with opensaf.
>>>>>>>>>
>>>>>>>>> thanks for any suggestions,
>>>>>>>>> —
>>>>>>>>> tony
>>>>>>>>>
>> ------------------------------------------------------------------
>>>>>>>>> ----
>>>>>>>>>
>>>>>>>>> -------- _______________________________________________
>>>>>>>>> Opensaf-users mailing list
>>>>>>>>> [email protected]
>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>>>>>
>> -------------------------------------------------------------------
>>>>>>>> -----------
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Opensaf-users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>>>>>
>> -------------------------------------------------------------------
>>>>>>>> -----------
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Opensaf-users mailing list
>>>>>>>> [email protected]
>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>>>>
>>>>
>>>>
>> ------------------------------------------------------------------------------
>>>> _______________________________________________
>>>> Opensaf-users mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users



------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Avoid rebooting payload modules after losing system controller

Reply via email to