That’s great - thanks Mathi. > On Dec 22, 2015, at 6:09 AM, Mathivanan Naickan Palanivelu > <[email protected]> wrote: > > please see a correction inline: > > ----- [email protected] wrote: > >> Hi Tony, >> >> Yes, it has been agreed to support more than two controllers in a >> phased manner. >> >> The ticket http://sourceforge.net/p/opensaf/tickets/79/ will be >> available >> as a part of the next major release 5.0 in April 2016. >> This ticket #79 will provide support for configuring spare system >> controllers. >> i.e. Everytime a controller dies, a spare will be promoted as an >> ACTIVE. > A spare node will be promoted as an SC. > > Mathi. > >> In your case, one of the line cards can be configured as a spare! >> Please see the attachment in the ticket for a overview of the >> feature. >> >> >> At the same time, it has been agreed to include the solution(in 5.0) >> also in >> https://sourceforge.net/u/anders-w/opensaf-headless/ (currently >> maintained as a fork) into the main branch. This feature >> will be configurable to be turned on/off. >> >> >> Subsequently, 5.1 will have support for a mix of multiplestandbys + >> spares. >> >> The details if this has been updated in the following two tickets: >> http://sourceforge.net/p/opensaf/tickets/439/ >> and >> http://sourceforge.net/p/opensaf/tickets/1170/ >> >> >> Cheers, >> Mathi. >> >> >> >> >> >> >> ----- [email protected] wrote: >> >>> Hi Mathi, >>> Any updates on this topic? >>> thanks >>> — >>> tony >>> >>> >>>> On Oct 14, 2015, at 7:50 AM, Mathivanan Naickan Palanivelu >>> <[email protected]> wrote: >>>> >>>> Tony, >>>> >>>> We(in the TLC) have discussed this before. At this point of time >>>> you indeed have the solution that AndersW has pointed to. >>>> >>>> There are also other discussions going on w.r.t multiple-standbys >>> etc >>>> for the next major release. Will tag you in the appropriate >> ticket >>>> (one such ticket is >> http://sourceforge.net/p/opensaf/tickets/1170/) >>>> once those discussions conclude. >>>> >>>> Cheers, >>>> Mathi. >>>> >>>> ----- [email protected] wrote: >>>> >>>>> Thanks all, this is a good discussion. >>>>> >>>>> Let me again state my use case, which as I said I don’t think is >>>>> unique, because selfishly I want to make sure that this is >> covered >>>>> :-) >>>>> >>>>> I want to keep the system running in the case where both >>> controller >>>>> fail, or more specifically in my case, when the only server in >> the >>>>> system fails (cost reduced system). >>>>> >>>>> The "don’t reboot until a server re-appears" would work for me. >> I >>>>> think its a poorer solution but it would be workable (Mathi’s >>>>> suggestion). Especially if its available sooner. >>>>> >>>>> A better option is to allow one of the linecards to take over >> the >>>>> controller role when the preferred one dies. Since linecards >> are >>>>> assumed to be transitory (they may or may not be there or could >> be >>>>> removed) so I don’t want to have to pick a linecard for this >> role >>> at >>>>> boot time. Much better would be to allow some number (more >> than >>> 2) >>>>> nodes to be controllers (e.g. Active, Standby, Spare, Spare …). >>> Then >>>>> OSAF takes responsibility for electing the next Active/Standby >>> when >>>>> necessary. There would have to be a concept of preference such >>> that >>>>> the Server node is chosen given a choice. >>>>> >>>>> For the solutions discussed so far… >>>>> >>>>> Making the OSAF processes restartable and be active on either >>>>> controller, is good and will increase the MTBF of an OSAF system. >> >>>>> However it won’t fix my problem if there are still only two >>>>> controllers allowed. >>>>> >>>>> I’m not familiar with the “cattle” concept. >>>>> >>>>> I wasn’t advocating the “headless” solution only to the extent >> that >>> it >>>>> solved my immediate problem. Even if we did have a “headless” >> mode >>> it >>>>> would be a temporary situation until the failed controller could >>> be >>>>> replaced. >>>>> >>>>> thanks >>>>> — >>>>> tony >>>>> >>>>> >>>>>> On Oct 14, 2015, at 6:44 AM, Mathivanan Naickan Palanivelu >>>>> <[email protected]> wrote: >>>>>> >>>>>> In a clustered environment, all nodes need to be in the same >>>>> consistent view >>>>>> from a membership perspective. >>>>>> So, loss of a leader will indeed result in state changes to the >>>>> nodes >>>>>> following the leader. Therefore there cannot be independent >>>>> lifecycles >>>>>> for some nodes and other nodes. >>>>>> B.T.W the restart mentioned below meant - 'transparent restart >>>>>> of OpenSAF' without affecting applications and for this the >>>>> application's >>>>>> CLC-CLI scripts need to handle. >>>>>> >>>>>> While we could be inclined to support (unique!) use-case such >> as >>>>> this, >>>>>> it is the solution(headless fork) that would be under the lens! >>> ;-) >>>>> >>>>>> Let's discuss this! >>>>>> >>>>>> Cheers, >>>>>> Mathi. >>>>>> >>>>>> >>>>>> ----- [email protected] wrote: >>>>>> >>>>>>> Yes, this is yet another approach. But it is also another >>> use-case >>>>> for >>>>>>> >>>>>>> the headless feature. When we have moved the system >> controllers >>> out >>>>> of >>>>>>> >>>>>>> the cluster (into the cloud infrastructure), I would expect >>>>>>> controllers >>>>>>> and payloads to have independent life cycles. You have servers >>>>> (i.e. >>>>>>> system controllers), and clients (payloads). They can be >>> installed >>>>> and >>>>>>> >>>>>>> upgraded separately from each other, and I wouldn't expect a >>>>> restart >>>>>>> of >>>>>>> the servers to cause all the clients to restart as well, in >> the >>>>> same >>>>>>> way >>>>>>> as I don't expect my web browser to restart just because >> because >>>>> the >>>>>>> web >>>>>>> server has crashed. >>>>>>> >>>>>>> / Anders Widell >>>>>>> >>>>>>> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote: >>>>>>>> I don't think this is a case of cattles! Even in those >> scenario >>>>>>>> the cloud management stacks, the "**controller" software >>>>> themselves >>>>>>> are 'placed' on physical nodes >>>>>>>> in appropriate redundancy models and not inside those cattle >>> VMs! >>>>>>>> >>>>>>>> I think the case here is about avoid rebooting of the node! >>>>>>>> This can be achieved by setting the NOACTIVE timer to a >> longer >>>>> value >>>>>>> till OpenSAF on the controller comes back up. >>>>>>>> Upon detecting that the controllers are up, some entity on >> the >>>>> local >>>>>>> node restart OpenSAF (/etc/init.d/opensafd restart) >>>>>>>> And ensure the CLC-CLI scripts of the applications >>> differentiate >>>>>>> usual restart versus this spoof-restart! >>>>>>>> >>>>>>>> Mathi. >>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Anders Widell [mailto:[email protected]] >>>>>>>>> Sent: Tuesday, October 13, 2015 5:36 PM >>>>>>>>> To: Anders Björnerstedt; Tony Hart >>>>>>>>> Cc: [email protected] >>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after >>>>> losing >>>>>>> system >>>>>>>>> controller >>>>>>>>> >>>>>>>>> Yes, I agree that the best fit for this feature is an >>>>> application >>>>>>> using either the >>>>>>>>> NWay-Active or the No-Redundancy models, and where you view >>> the >>>>>>>>> system more as a collection of nodes rather than as a >> cluster. >>>>> This >>>>>>> kind of >>>>>>>>> architecture is quite common when you write applications for >>>>> cloud. >>>>>>> The >>>>>>>>> redundancy models are suitable for scaling, and the >>> architecture >>>>>>> fits into the >>>>>>>>> "cattle" philosophy which is common in cloud. >>>>>>>>> Such an application can tolerate any number of node >> failures, >>>>> and >>>>>>> the >>>>>>>>> remaining nodes would still be able to continue functioning >>> and >>>>>>> provide their >>>>>>>>> service. However, if we put the OpenSAF middleware on the >>> nodes >>>>> it >>>>>>>>> becomes the weakest link, since OpenSAF will reboot all the >>>>> nodes >>>>>>> just >>>>>>>>> because the two controller nodes fail. What a pity on a >> system >>>>> with >>>>>>> one >>>>>>>>> hundred nodes! >>>>>>>>> >>>>>>>>> / Anders Widell >>>>>>>>> >>>>>>>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote: >>>>>>>>>> >>>>>>>>>> On 10/13/2015 12:27 PM, Anders Widell wrote: >>>>>>>>>>> The possibility to have more than two system controllers >>> (one >>>>>>> active >>>>>>>>>>> + several standby and/or spare controller nodes) is also >>>>>>> something >>>>>>>>>>> that has been investigated. For scalability reasons, we >>>>> probably >>>>>>>>>>> can't turn all nodes into standby controllers in a large >>>>> cluster >>>>>>> - >>>>>>>>>>> but it may be feasible to have a system with one or >> several >>>>>>> standby >>>>>>>>>>> controllers and the rest of the nodes are spares that are >>>>> ready >>>>>>> to >>>>>>>>>>> take an active or standby assignment when needed. >>>>>>>>>>> >>>>>>>>>>> However, the "headless" feature will still be needed in >> some >>>>>>> systems >>>>>>>>>>> where you need dedicated controller node(s). >>>>>>>>>> That sounds as if some deployments have a special >> requirement >>>>> that >>>>>>> can >>>>>>>>>> only be supported by the headless feature. >>>>>>>>>> But you also have to say that the headless feature places >>>>>>>>>> anti-requirements on the deployments/applications that are >> to >>>>> use >>>>>>> it. >>>>>>>>>> >>>>>>>>>> For example not needing cluster coherence among the >> payloads. >>>>>>>>>> >>>>>>>>>> If the payloads only run independent application instances >>>>> where >>>>>>> each >>>>>>>>>> instance is implemented at one processor or at least does >> not >>>>>>>>>> communicate in any state-sensitive way with peer processes >> at >>>>>>> other >>>>>>>>>> payloads; and no such instance is unique or if it is unique >>> it >>>>> is >>>>>>>>>> still expendable (non critical to the service), then it >> could >>>>>>> work. >>>>>>>>>> >>>>>>>>>> It is important the the deployments that end up thinking >> they >>>>> need >>>>>>> the >>>>>>>>>> headless feature also understand what they loose with the >>>>>>> headless >>>>>>>>>> feature and that this loss is acceptable for that >> deployment. >>>>>>>>>> >>>>>>>>>> So headless is not a fancy feature needed by some exclusive >>> and >>>>>>> picky >>>>>>>>>> subset of applications. >>>>>>>>>> It is a relaxation that drops all requirements on >> distributed >>>>>>>>>> consistency and may be acceptable to some applications with >>>>>>> weaker >>>>>>>>>> demands so they can accept the anti requirements. >>>>>>>>>> >>>>>>>>>> Besides requiring "dedicated" controller nodes, the >>> deployment >>>>>>> must of >>>>>>>>>> course NOT require any *availability* of those dedicated >>>>>>> controller >>>>>>>>>> nodes, i.e. not have any requirements on service >> availability >>>>> in >>>>>>>>>> general. >>>>>>>>>> >>>>>>>>>> It may works for some "dumb" applications that are >> stateless, >>>>> or >>>>>>> state >>>>>>>>>> stable (frozen in state), or have no requirements on >>>>> availability >>>>>>> of >>>>>>>>>> state. In other words some applicaitons that really dont >> need >>>>>>> SAF. >>>>>>>>>> >>>>>>>>>> They may still want to use SAF as a way of managing and >>>>> monitoring >>>>>>> the >>>>>>>>>> system when it happens to be healthy, but can live with >> long >>>>>>> periods >>>>>>>>>> of not being able to manage or monitor that system, which >> can >>>>> then >>>>>>> be >>>>>>>>>> degrading in any way that is possible. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> /AndersBJ >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> / Anders Widell >>>>>>>>>>> >>>>>>>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote: >>>>>>>>>>>> Understood. The assumption is that this is temporary but >>> we >>>>>>> allow >>>>>>>>>>>> the payloads to continue to run (with reduced osaf >>>>>>> functionality) >>>>>>>>>>>> until a replacement controller is found. At that point >>> they >>>>>>> can >>>>>>>>>>>> reboot to get the system back into sync. >>>>>>>>>>>> >>>>>>>>>>>> Or allow more than 2 controllers in the system so we can >>> have >>>>>>> one or >>>>>>>>>>>> more usually-payload cards be controllers to reduce the >>>>>>> probability >>>>>>>>>>>> of no-controllers to an acceptable level. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt >>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> The headless state is also vulnerable to split-brain >>>>>>> scenarios. >>>>>>>>>>>>> That is network partitions and joins can occur and will >>> not >>>>> be >>>>>>>>>>>>> detected as such and thus not handled properly >> (isolated) >>>>> when >>>>>>> they >>>>>>>>>>>>> occur. >>>>>>>>>>>>> Basically you can not be sure you have a continuously >>>>>>> coherent >>>>>>>>>>>>> cluster while in the headless state. >>>>>>>>>>>>> >>>>>>>>>>>>> On paper you may get a very resilient system in the >> sense >>>>> that >>>>>>> It >>>>>>>>>>>>> "stays up" and replies on ping etc. >>>>>>>>>>>>> But typically a customer wants not just availability but >>>>>>> reliable >>>>>>>>>>>>> behavior also. >>>>>>>>>>>>> >>>>>>>>>>>>> /AndersBj >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>> From: Anders Björnerstedt >>>>>>>>> [mailto:[email protected]] >>>>>>>>>>>>> Sent: den 12 oktober 2015 16:42 >>>>>>>>>>>>> To: Anders Widell; Tony Hart; >>>>>>> [email protected] >>>>>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules >> after >>>>>>> losing >>>>>>>>>>>>> system controller >>>>>>>>>>>>> >>>>>>>>>>>>> Note that this headless variant is a very questionable >>>>>>> feature. >>>>>>>>>>>>> This for the reasons explained earlier, i.e. you *will* >>> get >>>>> a >>>>>>>>>>>>> reduction in service availability. >>>>>>>>>>>>> It was never accepted into OpenSAF for that reason. >>>>>>>>>>>>> >>>>>>>>>>>>> On top of that the unreliability will typically not he >>>>>>>>>>>>> explicit/handled. That is the operator will probably not >>>>> even >>>>>>> know >>>>>>>>>>>>> what is working and what is not during the SC absence >>> since >>>>>>> the >>>>>>>>>>>>> alarm/notification function is gone. No OpenSAF >> director >>>>>>> services >>>>>>>>>>>>> are executing. >>>>>>>>>>>>> >>>>>>>>>>>>> It is truly a headless system, i.e. a zombie system and >>> thus >>>>>>> not >>>>>>>>>>>>> working at full monitoring and availability >> functionality. >>>>>>>>>>>>> It begs the question of what OpenSAF and SAF is there >> for >>> in >>>>>>> the >>>>>>>>>>>>> first place. >>>>>>>>>>>>> >>>>>>>>>>>>> The SCs don’t have to run any special software and don’t >>>>> have >>>>>>> to >>>>>>>>>>>>> have any special hardware. >>>>>>>>>>>>> They do need file system access, at least for a cluster >>>>>>> restart, >>>>>>>>>>>>> but not necessarily to handle single SC failure. >>>>>>>>>>>>> The headless variant when headless is also in that >>>>>>>>>>>>> not-able-to-cluster-restart also, but with even less >>>>>>> functionality. >>>>>>>>>>>>> >>>>>>>>>>>>> An SC can of course run other (non OpenSAF specific) >>>>> software. >>>>>>> And >>>>>>>>>>>>> the two SCs don’t necessarily have to be symmetric in >>> terms >>>>> of >>>>>>>>>>>>> software. >>>>>>>>>>>>> >>>>>>>>>>>>> Providing file system access via NFS is typically a non >>>>> issue. >>>>>>> They >>>>>>>>>>>>> have three nodes. Ergo they should be able to assign >> two >>> of >>>>>>> them >>>>>>>>>>>>> the role of SC in the OpensAF domain. >>>>>>>>>>>>> >>>>>>>>>>>>> /AndersBj >>>>>>>>>>>>> >>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>> From: Anders Widell [mailto:[email protected]] >>>>>>>>>>>>> Sent: den 12 oktober 2015 16:08 >>>>>>>>>>>>> To: Tony Hart; [email protected] >>>>>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules >> after >>>>>>> losing >>>>>>>>>>>>> system controller >>>>>>>>>>>>> >>>>>>>>>>>>> We have actually implemented something very similar to >>> what >>>>> you >>>>>>> are >>>>>>>>>>>>> talking about. With this feature, the payloads can >> survive >>>>>>> without >>>>>>>>>>>>> a cluster restart even if both system controllers >> restart >>>>> (or >>>>>>> the >>>>>>>>>>>>> single system controller, in your case). If you want to >>> try >>>>> it >>>>>>> out, >>>>>>>>>>>>> you can clone this Mercurial repository: >>>>>>>>>>>>> >>>>>>>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/ >>>>>>>>>>>>> >>>>>>>>>>>>> To enable the feature, set the variable >>>>>>>>> IMMSV_SC_ABSENCE_ALLOWED in >>>>>>>>>>>>> immd.conf to the amount of seconds you wish the payloads >>> to >>>>>>> wait >>>>>>>>>>>>> for the system controllers to come back. Note: we have >>> only >>>>>>>>>>>>> implemented this feature for the "core" OpenSAF services >>>>> (plus >>>>>>>>>>>>> CKPT), so you need to disable the optional serivces. >>>>>>>>>>>>> >>>>>>>>>>>>> / Anders Widell >>>>>>>>>>>>> >>>>>>>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote: >>>>>>>>>>>>>> We have been using opensaf in our product for a couple >> of >>>>>>> years >>>>>>>>>>>>>> now. One of the issues we have is the fact that >> payload >>>>>>> cards >>>>>>>>>>>>>> reboot when the system controllers are lost. Although >>> our >>>>>>> payload >>>>>>>>>>>>>> card hardware will continue to perform its functions >>> whilst >>>>>>> the >>>>>>>>>>>>>> software is down (which is desirable) the functions >> that >>>>> the >>>>>>>>>>>>>> software performs are obviously not performed (which is >>> not >>>>>>>>>>>>>> desirable). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Why would we loose both controllers, surely that is a >>> rare >>>>>>>>>>>>>> circumstance? Not if you only have one controller to >>> begin >>>>>>> with. >>>>>>>>>>>>>> Removing the second controller is a significant cost >>> saving >>>>>>> for us >>>>>>>>>>>>>> so we want to support a product that only has one >>>>> controller. >>>>>>> The >>>>>>>>>>>>>> most significant impediment to that is the loss of >>> payload >>>>>>>>>>>>>> software functions when the system controller fails. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I’m looking for suggestions from this email list as to >>> what >>>>>>> could >>>>>>>>>>>>>> be done for this issue. >>>>>>>>>>>>>> >>>>>>>>>>>>>> One suggestion, that would work for us, is if we could >>>>>>> convince >>>>>>>>>>>>>> the payload card to only reboot when the controller >>>>> reappears >>>>>>>>>>>>>> after a loss rather than when the loss initially occurs. >> >>>>> Is >>>>>>> that >>>>>>>>>>>>>> possible? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Another possibility is if we could support more than 2 >>>>>>>>>>>>>> controllers, for example if we could support 4 (one >>> active >>>>> and >>>>>>> 3 >>>>>>>>>>>>>> standbys) that would also provide a solution for us >> (our >>>>>>> current >>>>>>>>>>>>>> payloads would instead become controllers). I know that >>>>> this >>>>>>> is >>>>>>>>>>>>>> not currently possible with opensaf. >>>>>>>>>>>>>> >>>>>>>>>>>>>> thanks for any suggestions, >>>>>>>>>>>>>> — >>>>>>>>>>>>>> tony >>>>>>>>>>>>>> >>>>>>> >>> ------------------------------------------------------------------ >>>>>>>>>>>>>> ---- >>>>>>>>>>>>>> >>>>>>>>>>>>>> -------- >> _______________________________________________ >>>>>>>>>>>>>> Opensaf-users mailing list >>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>> >>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>> >>>>> >>> ------------------------------------------------------------------- >>>>>>>>>>>>> ----------- >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Opensaf-users mailing list >>>>>>>>>>>>> [email protected] >>>>>>>>>>>>> >> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>>>>>>>>> >>>>>>> >>>>> >>> ------------------------------------------------------------------- >>>>>>>>>>>>> ----------- >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> Opensaf-users mailing list >>>>>>>>>>>>> [email protected] >>>>>>>>>>>>> >> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> >> ------------------------------------------------------------------------------ >>>>>>>>> _______________________________________________ >>>>>>>>> Opensaf-users mailing list >>>>>>>>> [email protected] >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
