Hi Tony, Yes, it has been agreed to support more than two controllers in a phased manner.
The ticket http://sourceforge.net/p/opensaf/tickets/79/ will be available as a part of the next major release 5.0 in April 2016. This ticket #79 will provide support for configuring spare system controllers. i.e. Everytime a controller dies, a spare will be promoted as an ACTIVE. In your case, one of the line cards can be configured as a spare! Please see the attachment in the ticket for a overview of the feature. At the same time, it has been agreed to include the solution(in 5.0) also in https://sourceforge.net/u/anders-w/opensaf-headless/ (currently maintained as a fork) into the main branch. This feature will be configurable to be turned on/off. Subsequently, 5.1 will have support for a mix of multiplestandbys + spares. The details if this has been updated in the following two tickets: http://sourceforge.net/p/opensaf/tickets/439/ and http://sourceforge.net/p/opensaf/tickets/1170/ Cheers, Mathi. ----- [email protected] wrote: > Hi Mathi, > Any updates on this topic? > thanks > — > tony > > > > On Oct 14, 2015, at 7:50 AM, Mathivanan Naickan Palanivelu > <[email protected]> wrote: > > > > Tony, > > > > We(in the TLC) have discussed this before. At this point of time > > you indeed have the solution that AndersW has pointed to. > > > > There are also other discussions going on w.r.t multiple-standbys > etc > > for the next major release. Will tag you in the appropriate ticket > > (one such ticket is http://sourceforge.net/p/opensaf/tickets/1170/) > > once those discussions conclude. > > > > Cheers, > > Mathi. > > > > ----- [email protected] wrote: > > > >> Thanks all, this is a good discussion. > >> > >> Let me again state my use case, which as I said I don’t think is > >> unique, because selfishly I want to make sure that this is covered > >> :-) > >> > >> I want to keep the system running in the case where both > controller > >> fail, or more specifically in my case, when the only server in the > >> system fails (cost reduced system). > >> > >> The "don’t reboot until a server re-appears" would work for me. I > >> think its a poorer solution but it would be workable (Mathi’s > >> suggestion). Especially if its available sooner. > >> > >> A better option is to allow one of the linecards to take over the > >> controller role when the preferred one dies. Since linecards are > >> assumed to be transitory (they may or may not be there or could be > >> removed) so I don’t want to have to pick a linecard for this role > at > >> boot time. Much better would be to allow some number (more than > 2) > >> nodes to be controllers (e.g. Active, Standby, Spare, Spare …). > Then > >> OSAF takes responsibility for electing the next Active/Standby > when > >> necessary. There would have to be a concept of preference such > that > >> the Server node is chosen given a choice. > >> > >> For the solutions discussed so far… > >> > >> Making the OSAF processes restartable and be active on either > >> controller, is good and will increase the MTBF of an OSAF system. > >> However it won’t fix my problem if there are still only two > >> controllers allowed. > >> > >> I’m not familiar with the “cattle” concept. > >> > >> I wasn’t advocating the “headless” solution only to the extent that > it > >> solved my immediate problem. Even if we did have a “headless” mode > it > >> would be a temporary situation until the failed controller could > be > >> replaced. > >> > >> thanks > >> — > >> tony > >> > >> > >>> On Oct 14, 2015, at 6:44 AM, Mathivanan Naickan Palanivelu > >> <[email protected]> wrote: > >>> > >>> In a clustered environment, all nodes need to be in the same > >> consistent view > >>> from a membership perspective. > >>> So, loss of a leader will indeed result in state changes to the > >> nodes > >>> following the leader. Therefore there cannot be independent > >> lifecycles > >>> for some nodes and other nodes. > >>> B.T.W the restart mentioned below meant - 'transparent restart > >>> of OpenSAF' without affecting applications and for this the > >> application's > >>> CLC-CLI scripts need to handle. > >>> > >>> While we could be inclined to support (unique!) use-case such as > >> this, > >>> it is the solution(headless fork) that would be under the lens! > ;-) > >> > >>> Let's discuss this! > >>> > >>> Cheers, > >>> Mathi. > >>> > >>> > >>> ----- [email protected] wrote: > >>> > >>>> Yes, this is yet another approach. But it is also another > use-case > >> for > >>>> > >>>> the headless feature. When we have moved the system controllers > out > >> of > >>>> > >>>> the cluster (into the cloud infrastructure), I would expect > >>>> controllers > >>>> and payloads to have independent life cycles. You have servers > >> (i.e. > >>>> system controllers), and clients (payloads). They can be > installed > >> and > >>>> > >>>> upgraded separately from each other, and I wouldn't expect a > >> restart > >>>> of > >>>> the servers to cause all the clients to restart as well, in the > >> same > >>>> way > >>>> as I don't expect my web browser to restart just because because > >> the > >>>> web > >>>> server has crashed. > >>>> > >>>> / Anders Widell > >>>> > >>>> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote: > >>>>> I don't think this is a case of cattles! Even in those scenario > >>>>> the cloud management stacks, the "**controller" software > >> themselves > >>>> are 'placed' on physical nodes > >>>>> in appropriate redundancy models and not inside those cattle > VMs! > >>>>> > >>>>> I think the case here is about avoid rebooting of the node! > >>>>> This can be achieved by setting the NOACTIVE timer to a longer > >> value > >>>> till OpenSAF on the controller comes back up. > >>>>> Upon detecting that the controllers are up, some entity on the > >> local > >>>> node restart OpenSAF (/etc/init.d/opensafd restart) > >>>>> And ensure the CLC-CLI scripts of the applications > differentiate > >>>> usual restart versus this spoof-restart! > >>>>> > >>>>> Mathi. > >>>>> > >>>>>> -----Original Message----- > >>>>>> From: Anders Widell [mailto:[email protected]] > >>>>>> Sent: Tuesday, October 13, 2015 5:36 PM > >>>>>> To: Anders Björnerstedt; Tony Hart > >>>>>> Cc: [email protected] > >>>>>> Subject: Re: [users] Avoid rebooting payload modules after > >> losing > >>>> system > >>>>>> controller > >>>>>> > >>>>>> Yes, I agree that the best fit for this feature is an > >> application > >>>> using either the > >>>>>> NWay-Active or the No-Redundancy models, and where you view > the > >>>>>> system more as a collection of nodes rather than as a cluster. > >> This > >>>> kind of > >>>>>> architecture is quite common when you write applications for > >> cloud. > >>>> The > >>>>>> redundancy models are suitable for scaling, and the > architecture > >>>> fits into the > >>>>>> "cattle" philosophy which is common in cloud. > >>>>>> Such an application can tolerate any number of node failures, > >> and > >>>> the > >>>>>> remaining nodes would still be able to continue functioning > and > >>>> provide their > >>>>>> service. However, if we put the OpenSAF middleware on the > nodes > >> it > >>>>>> becomes the weakest link, since OpenSAF will reboot all the > >> nodes > >>>> just > >>>>>> because the two controller nodes fail. What a pity on a system > >> with > >>>> one > >>>>>> hundred nodes! > >>>>>> > >>>>>> / Anders Widell > >>>>>> > >>>>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote: > >>>>>>> > >>>>>>> On 10/13/2015 12:27 PM, Anders Widell wrote: > >>>>>>>> The possibility to have more than two system controllers > (one > >>>> active > >>>>>>>> + several standby and/or spare controller nodes) is also > >>>> something > >>>>>>>> that has been investigated. For scalability reasons, we > >> probably > >>>>>>>> can't turn all nodes into standby controllers in a large > >> cluster > >>>> - > >>>>>>>> but it may be feasible to have a system with one or several > >>>> standby > >>>>>>>> controllers and the rest of the nodes are spares that are > >> ready > >>>> to > >>>>>>>> take an active or standby assignment when needed. > >>>>>>>> > >>>>>>>> However, the "headless" feature will still be needed in some > >>>> systems > >>>>>>>> where you need dedicated controller node(s). > >>>>>>> That sounds as if some deployments have a special requirement > >> that > >>>> can > >>>>>>> only be supported by the headless feature. > >>>>>>> But you also have to say that the headless feature places > >>>>>>> anti-requirements on the deployments/applications that are to > >> use > >>>> it. > >>>>>>> > >>>>>>> For example not needing cluster coherence among the payloads. > >>>>>>> > >>>>>>> If the payloads only run independent application instances > >> where > >>>> each > >>>>>>> instance is implemented at one processor or at least does not > >>>>>>> communicate in any state-sensitive way with peer processes at > >>>> other > >>>>>>> payloads; and no such instance is unique or if it is unique > it > >> is > >>>>>>> still expendable (non critical to the service), then it could > >>>> work. > >>>>>>> > >>>>>>> It is important the the deployments that end up thinking they > >> need > >>>> the > >>>>>>> headless feature also understand what they loose with the > >>>> headless > >>>>>>> feature and that this loss is acceptable for that deployment. > >>>>>>> > >>>>>>> So headless is not a fancy feature needed by some exclusive > and > >>>> picky > >>>>>>> subset of applications. > >>>>>>> It is a relaxation that drops all requirements on distributed > >>>>>>> consistency and may be acceptable to some applications with > >>>> weaker > >>>>>>> demands so they can accept the anti requirements. > >>>>>>> > >>>>>>> Besides requiring "dedicated" controller nodes, the > deployment > >>>> must of > >>>>>>> course NOT require any *availability* of those dedicated > >>>> controller > >>>>>>> nodes, i.e. not have any requirements on service availability > >> in > >>>>>>> general. > >>>>>>> > >>>>>>> It may works for some "dumb" applications that are stateless, > >> or > >>>> state > >>>>>>> stable (frozen in state), or have no requirements on > >> availability > >>>> of > >>>>>>> state. In other words some applicaitons that really dont need > >>>> SAF. > >>>>>>> > >>>>>>> They may still want to use SAF as a way of managing and > >> monitoring > >>>> the > >>>>>>> system when it happens to be healthy, but can live with long > >>>> periods > >>>>>>> of not being able to manage or monitor that system, which can > >> then > >>>> be > >>>>>>> degrading in any way that is possible. > >>>>>>> > >>>>>>> > >>>>>>> /AndersBJ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>> / Anders Widell > >>>>>>>> > >>>>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote: > >>>>>>>>> Understood. The assumption is that this is temporary but > we > >>>> allow > >>>>>>>>> the payloads to continue to run (with reduced osaf > >>>> functionality) > >>>>>>>>> until a replacement controller is found. At that point > they > >>>> can > >>>>>>>>> reboot to get the system back into sync. > >>>>>>>>> > >>>>>>>>> Or allow more than 2 controllers in the system so we can > have > >>>> one or > >>>>>>>>> more usually-payload cards be controllers to reduce the > >>>> probability > >>>>>>>>> of no-controllers to an acceptable level. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt > >>>>>>>>>> <[email protected]> wrote: > >>>>>>>>>> > >>>>>>>>>> The headless state is also vulnerable to split-brain > >>>> scenarios. > >>>>>>>>>> That is network partitions and joins can occur and will > not > >> be > >>>>>>>>>> detected as such and thus not handled properly (isolated) > >> when > >>>> they > >>>>>>>>>> occur. > >>>>>>>>>> Basically you can not be sure you have a continuously > >>>> coherent > >>>>>>>>>> cluster while in the headless state. > >>>>>>>>>> > >>>>>>>>>> On paper you may get a very resilient system in the sense > >> that > >>>> It > >>>>>>>>>> "stays up" and replies on ping etc. > >>>>>>>>>> But typically a customer wants not just availability but > >>>> reliable > >>>>>>>>>> behavior also. > >>>>>>>>>> > >>>>>>>>>> /AndersBj > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -----Original Message----- > >>>>>>>>>> From: Anders Björnerstedt > >>>>>> [mailto:[email protected]] > >>>>>>>>>> Sent: den 12 oktober 2015 16:42 > >>>>>>>>>> To: Anders Widell; Tony Hart; > >>>> [email protected] > >>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after > >>>> losing > >>>>>>>>>> system controller > >>>>>>>>>> > >>>>>>>>>> Note that this headless variant is a very questionable > >>>> feature. > >>>>>>>>>> This for the reasons explained earlier, i.e. you *will* > get > >> a > >>>>>>>>>> reduction in service availability. > >>>>>>>>>> It was never accepted into OpenSAF for that reason. > >>>>>>>>>> > >>>>>>>>>> On top of that the unreliability will typically not he > >>>>>>>>>> explicit/handled. That is the operator will probably not > >> even > >>>> know > >>>>>>>>>> what is working and what is not during the SC absence > since > >>>> the > >>>>>>>>>> alarm/notification function is gone. No OpenSAF director > >>>> services > >>>>>>>>>> are executing. > >>>>>>>>>> > >>>>>>>>>> It is truly a headless system, i.e. a zombie system and > thus > >>>> not > >>>>>>>>>> working at full monitoring and availability functionality. > >>>>>>>>>> It begs the question of what OpenSAF and SAF is there for > in > >>>> the > >>>>>>>>>> first place. > >>>>>>>>>> > >>>>>>>>>> The SCs don’t have to run any special software and don’t > >> have > >>>> to > >>>>>>>>>> have any special hardware. > >>>>>>>>>> They do need file system access, at least for a cluster > >>>> restart, > >>>>>>>>>> but not necessarily to handle single SC failure. > >>>>>>>>>> The headless variant when headless is also in that > >>>>>>>>>> not-able-to-cluster-restart also, but with even less > >>>> functionality. > >>>>>>>>>> > >>>>>>>>>> An SC can of course run other (non OpenSAF specific) > >> software. > >>>> And > >>>>>>>>>> the two SCs don’t necessarily have to be symmetric in > terms > >> of > >>>>>>>>>> software. > >>>>>>>>>> > >>>>>>>>>> Providing file system access via NFS is typically a non > >> issue. > >>>> They > >>>>>>>>>> have three nodes. Ergo they should be able to assign two > of > >>>> them > >>>>>>>>>> the role of SC in the OpensAF domain. > >>>>>>>>>> > >>>>>>>>>> /AndersBj > >>>>>>>>>> > >>>>>>>>>> -----Original Message----- > >>>>>>>>>> From: Anders Widell [mailto:[email protected]] > >>>>>>>>>> Sent: den 12 oktober 2015 16:08 > >>>>>>>>>> To: Tony Hart; [email protected] > >>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after > >>>> losing > >>>>>>>>>> system controller > >>>>>>>>>> > >>>>>>>>>> We have actually implemented something very similar to > what > >> you > >>>> are > >>>>>>>>>> talking about. With this feature, the payloads can survive > >>>> without > >>>>>>>>>> a cluster restart even if both system controllers restart > >> (or > >>>> the > >>>>>>>>>> single system controller, in your case). If you want to > try > >> it > >>>> out, > >>>>>>>>>> you can clone this Mercurial repository: > >>>>>>>>>> > >>>>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/ > >>>>>>>>>> > >>>>>>>>>> To enable the feature, set the variable > >>>>>> IMMSV_SC_ABSENCE_ALLOWED in > >>>>>>>>>> immd.conf to the amount of seconds you wish the payloads > to > >>>> wait > >>>>>>>>>> for the system controllers to come back. Note: we have > only > >>>>>>>>>> implemented this feature for the "core" OpenSAF services > >> (plus > >>>>>>>>>> CKPT), so you need to disable the optional serivces. > >>>>>>>>>> > >>>>>>>>>> / Anders Widell > >>>>>>>>>> > >>>>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote: > >>>>>>>>>>> We have been using opensaf in our product for a couple of > >>>> years > >>>>>>>>>>> now. One of the issues we have is the fact that payload > >>>> cards > >>>>>>>>>>> reboot when the system controllers are lost. Although > our > >>>> payload > >>>>>>>>>>> card hardware will continue to perform its functions > whilst > >>>> the > >>>>>>>>>>> software is down (which is desirable) the functions that > >> the > >>>>>>>>>>> software performs are obviously not performed (which is > not > >>>>>>>>>>> desirable). > >>>>>>>>>>> > >>>>>>>>>>> Why would we loose both controllers, surely that is a > rare > >>>>>>>>>>> circumstance? Not if you only have one controller to > begin > >>>> with. > >>>>>>>>>>> Removing the second controller is a significant cost > saving > >>>> for us > >>>>>>>>>>> so we want to support a product that only has one > >> controller. > >>>> The > >>>>>>>>>>> most significant impediment to that is the loss of > payload > >>>>>>>>>>> software functions when the system controller fails. > >>>>>>>>>>> > >>>>>>>>>>> I’m looking for suggestions from this email list as to > what > >>>> could > >>>>>>>>>>> be done for this issue. > >>>>>>>>>>> > >>>>>>>>>>> One suggestion, that would work for us, is if we could > >>>> convince > >>>>>>>>>>> the payload card to only reboot when the controller > >> reappears > >>>>>>>>>>> after a loss rather than when the loss initially occurs. > >> Is > >>>> that > >>>>>>>>>>> possible? > >>>>>>>>>>> > >>>>>>>>>>> Another possibility is if we could support more than 2 > >>>>>>>>>>> controllers, for example if we could support 4 (one > active > >> and > >>>> 3 > >>>>>>>>>>> standbys) that would also provide a solution for us (our > >>>> current > >>>>>>>>>>> payloads would instead become controllers). I know that > >> this > >>>> is > >>>>>>>>>>> not currently possible with opensaf. > >>>>>>>>>>> > >>>>>>>>>>> thanks for any suggestions, > >>>>>>>>>>> — > >>>>>>>>>>> tony > >>>>>>>>>>> > >>>> > ------------------------------------------------------------------ > >>>>>>>>>>> ---- > >>>>>>>>>>> > >>>>>>>>>>> -------- _______________________________________________ > >>>>>>>>>>> Opensaf-users mailing list > >>>>>>>>>>> [email protected] > >>>>>>>>>>> > https://lists.sourceforge.net/lists/listinfo/opensaf-users > >>>>>>>>>> > >>>>>>>>>> > >>>> > >> > ------------------------------------------------------------------- > >>>>>>>>>> ----------- > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Opensaf-users mailing list > >>>>>>>>>> [email protected] > >>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users > >>>>>>>>>> > >>>> > >> > ------------------------------------------------------------------- > >>>>>>>>>> ----------- > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> Opensaf-users mailing list > >>>>>>>>>> [email protected] > >>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>> > >> > ------------------------------------------------------------------------------ > >>>>>> _______________________________________________ > >>>>>> Opensaf-users mailing list > >>>>>> [email protected] > >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
