In this kind of solution the system controller(s) would not be part of the cluster (not the same cluster anyway) - the system controller functionality would be a service provided externally (e.g. by the cloud). So loss of connectivity with the system controller service would not in itself result in a change of the cluster membership.
/ Anders Widell On 10/14/2015 12:44 PM, Mathivanan Naickan Palanivelu wrote: > In a clustered environment, all nodes need to be in the same consistent view > from a membership perspective. > So, loss of a leader will indeed result in state changes to the nodes > following the leader. Therefore there cannot be independent lifecycles > for some nodes and other nodes. > B.T.W the restart mentioned below meant - 'transparent restart > of OpenSAF' without affecting applications and for this the application's > CLC-CLI scripts need to handle. > > While we could be inclined to support (unique!) use-case such as this, > it is the solution(headless fork) that would be under the lens! ;-) > Let's discuss this! > > Cheers, > Mathi. > > > ----- [email protected] wrote: > >> Yes, this is yet another approach. But it is also another use-case for >> >> the headless feature. When we have moved the system controllers out of >> >> the cluster (into the cloud infrastructure), I would expect >> controllers >> and payloads to have independent life cycles. You have servers (i.e. >> system controllers), and clients (payloads). They can be installed and >> >> upgraded separately from each other, and I wouldn't expect a restart >> of >> the servers to cause all the clients to restart as well, in the same >> way >> as I don't expect my web browser to restart just because because the >> web >> server has crashed. >> >> / Anders Widell >> >> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote: >>> I don't think this is a case of cattles! Even in those scenario >>> the cloud management stacks, the "**controller" software themselves >> are 'placed' on physical nodes >>> in appropriate redundancy models and not inside those cattle VMs! >>> >>> I think the case here is about avoid rebooting of the node! >>> This can be achieved by setting the NOACTIVE timer to a longer value >> till OpenSAF on the controller comes back up. >>> Upon detecting that the controllers are up, some entity on the local >> node restart OpenSAF (/etc/init.d/opensafd restart) >>> And ensure the CLC-CLI scripts of the applications differentiate >> usual restart versus this spoof-restart! >>> Mathi. >>> >>>> -----Original Message----- >>>> From: Anders Widell [mailto:[email protected]] >>>> Sent: Tuesday, October 13, 2015 5:36 PM >>>> To: Anders Björnerstedt; Tony Hart >>>> Cc: [email protected] >>>> Subject: Re: [users] Avoid rebooting payload modules after losing >> system >>>> controller >>>> >>>> Yes, I agree that the best fit for this feature is an application >> using either the >>>> NWay-Active or the No-Redundancy models, and where you view the >>>> system more as a collection of nodes rather than as a cluster. This >> kind of >>>> architecture is quite common when you write applications for cloud. >> The >>>> redundancy models are suitable for scaling, and the architecture >> fits into the >>>> "cattle" philosophy which is common in cloud. >>>> Such an application can tolerate any number of node failures, and >> the >>>> remaining nodes would still be able to continue functioning and >> provide their >>>> service. However, if we put the OpenSAF middleware on the nodes it >>>> becomes the weakest link, since OpenSAF will reboot all the nodes >> just >>>> because the two controller nodes fail. What a pity on a system with >> one >>>> hundred nodes! >>>> >>>> / Anders Widell >>>> >>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote: >>>>> On 10/13/2015 12:27 PM, Anders Widell wrote: >>>>>> The possibility to have more than two system controllers (one >> active >>>>>> + several standby and/or spare controller nodes) is also >> something >>>>>> that has been investigated. For scalability reasons, we probably >>>>>> can't turn all nodes into standby controllers in a large cluster >> - >>>>>> but it may be feasible to have a system with one or several >> standby >>>>>> controllers and the rest of the nodes are spares that are ready >> to >>>>>> take an active or standby assignment when needed. >>>>>> >>>>>> However, the "headless" feature will still be needed in some >> systems >>>>>> where you need dedicated controller node(s). >>>>> That sounds as if some deployments have a special requirement that >> can >>>>> only be supported by the headless feature. >>>>> But you also have to say that the headless feature places >>>>> anti-requirements on the deployments/applications that are to use >> it. >>>>> For example not needing cluster coherence among the payloads. >>>>> >>>>> If the payloads only run independent application instances where >> each >>>>> instance is implemented at one processor or at least does not >>>>> communicate in any state-sensitive way with peer processes at >> other >>>>> payloads; and no such instance is unique or if it is unique it is >>>>> still expendable (non critical to the service), then it could >> work. >>>>> It is important the the deployments that end up thinking they need >> the >>>>> headless feature also understand what they loose with the >> headless >>>>> feature and that this loss is acceptable for that deployment. >>>>> >>>>> So headless is not a fancy feature needed by some exclusive and >> picky >>>>> subset of applications. >>>>> It is a relaxation that drops all requirements on distributed >>>>> consistency and may be acceptable to some applications with >> weaker >>>>> demands so they can accept the anti requirements. >>>>> >>>>> Besides requiring "dedicated" controller nodes, the deployment >> must of >>>>> course NOT require any *availability* of those dedicated >> controller >>>>> nodes, i.e. not have any requirements on service availability in >>>>> general. >>>>> >>>>> It may works for some "dumb" applications that are stateless, or >> state >>>>> stable (frozen in state), or have no requirements on availability >> of >>>>> state. In other words some applicaitons that really dont need >> SAF. >>>>> They may still want to use SAF as a way of managing and monitoring >> the >>>>> system when it happens to be healthy, but can live with long >> periods >>>>> of not being able to manage or monitor that system, which can then >> be >>>>> degrading in any way that is possible. >>>>> >>>>> >>>>> /AndersBJ >>>>> >>>>> >>>>> >>>>>> / Anders Widell >>>>>> >>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote: >>>>>>> Understood. The assumption is that this is temporary but we >> allow >>>>>>> the payloads to continue to run (with reduced osaf >> functionality) >>>>>>> until a replacement controller is found. At that point they >> can >>>>>>> reboot to get the system back into sync. >>>>>>> >>>>>>> Or allow more than 2 controllers in the system so we can have >> one or >>>>>>> more usually-payload cards be controllers to reduce the >> probability >>>>>>> of no-controllers to an acceptable level. >>>>>>> >>>>>>> >>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt >>>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> The headless state is also vulnerable to split-brain >> scenarios. >>>>>>>> That is network partitions and joins can occur and will not be >>>>>>>> detected as such and thus not handled properly (isolated) when >> they >>>>>>>> occur. >>>>>>>> Basically you can not be sure you have a continuously >> coherent >>>>>>>> cluster while in the headless state. >>>>>>>> >>>>>>>> On paper you may get a very resilient system in the sense that >> It >>>>>>>> "stays up" and replies on ping etc. >>>>>>>> But typically a customer wants not just availability but >> reliable >>>>>>>> behavior also. >>>>>>>> >>>>>>>> /AndersBj >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Anders Björnerstedt >>>> [mailto:[email protected]] >>>>>>>> Sent: den 12 oktober 2015 16:42 >>>>>>>> To: Anders Widell; Tony Hart; >> [email protected] >>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after >> losing >>>>>>>> system controller >>>>>>>> >>>>>>>> Note that this headless variant is a very questionable >> feature. >>>>>>>> This for the reasons explained earlier, i.e. you *will* get a >>>>>>>> reduction in service availability. >>>>>>>> It was never accepted into OpenSAF for that reason. >>>>>>>> >>>>>>>> On top of that the unreliability will typically not he >>>>>>>> explicit/handled. That is the operator will probably not even >> know >>>>>>>> what is working and what is not during the SC absence since >> the >>>>>>>> alarm/notification function is gone. No OpenSAF director >> services >>>>>>>> are executing. >>>>>>>> >>>>>>>> It is truly a headless system, i.e. a zombie system and thus >> not >>>>>>>> working at full monitoring and availability functionality. >>>>>>>> It begs the question of what OpenSAF and SAF is there for in >> the >>>>>>>> first place. >>>>>>>> >>>>>>>> The SCs don’t have to run any special software and don’t have >> to >>>>>>>> have any special hardware. >>>>>>>> They do need file system access, at least for a cluster >> restart, >>>>>>>> but not necessarily to handle single SC failure. >>>>>>>> The headless variant when headless is also in that >>>>>>>> not-able-to-cluster-restart also, but with even less >> functionality. >>>>>>>> An SC can of course run other (non OpenSAF specific) software. >> And >>>>>>>> the two SCs don’t necessarily have to be symmetric in terms of >>>>>>>> software. >>>>>>>> >>>>>>>> Providing file system access via NFS is typically a non issue. >> They >>>>>>>> have three nodes. Ergo they should be able to assign two of >> them >>>>>>>> the role of SC in the OpensAF domain. >>>>>>>> >>>>>>>> /AndersBj >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Anders Widell [mailto:[email protected]] >>>>>>>> Sent: den 12 oktober 2015 16:08 >>>>>>>> To: Tony Hart; [email protected] >>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after >> losing >>>>>>>> system controller >>>>>>>> >>>>>>>> We have actually implemented something very similar to what you >> are >>>>>>>> talking about. With this feature, the payloads can survive >> without >>>>>>>> a cluster restart even if both system controllers restart (or >> the >>>>>>>> single system controller, in your case). If you want to try it >> out, >>>>>>>> you can clone this Mercurial repository: >>>>>>>> >>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/ >>>>>>>> >>>>>>>> To enable the feature, set the variable >>>> IMMSV_SC_ABSENCE_ALLOWED in >>>>>>>> immd.conf to the amount of seconds you wish the payloads to >> wait >>>>>>>> for the system controllers to come back. Note: we have only >>>>>>>> implemented this feature for the "core" OpenSAF services (plus >>>>>>>> CKPT), so you need to disable the optional serivces. >>>>>>>> >>>>>>>> / Anders Widell >>>>>>>> >>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote: >>>>>>>>> We have been using opensaf in our product for a couple of >> years >>>>>>>>> now. One of the issues we have is the fact that payload >> cards >>>>>>>>> reboot when the system controllers are lost. Although our >> payload >>>>>>>>> card hardware will continue to perform its functions whilst >> the >>>>>>>>> software is down (which is desirable) the functions that the >>>>>>>>> software performs are obviously not performed (which is not >>>>>>>>> desirable). >>>>>>>>> >>>>>>>>> Why would we loose both controllers, surely that is a rare >>>>>>>>> circumstance? Not if you only have one controller to begin >> with. >>>>>>>>> Removing the second controller is a significant cost saving >> for us >>>>>>>>> so we want to support a product that only has one controller. >> The >>>>>>>>> most significant impediment to that is the loss of payload >>>>>>>>> software functions when the system controller fails. >>>>>>>>> >>>>>>>>> I’m looking for suggestions from this email list as to what >> could >>>>>>>>> be done for this issue. >>>>>>>>> >>>>>>>>> One suggestion, that would work for us, is if we could >> convince >>>>>>>>> the payload card to only reboot when the controller reappears >>>>>>>>> after a loss rather than when the loss initially occurs. Is >> that >>>>>>>>> possible? >>>>>>>>> >>>>>>>>> Another possibility is if we could support more than 2 >>>>>>>>> controllers, for example if we could support 4 (one active and >> 3 >>>>>>>>> standbys) that would also provide a solution for us (our >> current >>>>>>>>> payloads would instead become controllers). I know that this >> is >>>>>>>>> not currently possible with opensaf. >>>>>>>>> >>>>>>>>> thanks for any suggestions, >>>>>>>>> — >>>>>>>>> tony >>>>>>>>> >> ------------------------------------------------------------------ >>>>>>>>> ---- >>>>>>>>> >>>>>>>>> -------- _______________________________________________ >>>>>>>>> Opensaf-users mailing list >>>>>>>>> [email protected] >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>>>> >> ------------------------------------------------------------------- >>>>>>>> ----------- >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Opensaf-users mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>>>> >> ------------------------------------------------------------------- >>>>>>>> ----------- >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Opensaf-users mailing list >>>>>>>> [email protected] >>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>> >>>> >>>> >> ------------------------------------------------------------------------------ >>>> _______________________________________________ >>>> Opensaf-users mailing list >>>> [email protected] >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
