Hi Mathi, Any updates on this topic? thanks — tony
> On Oct 14, 2015, at 7:50 AM, Mathivanan Naickan Palanivelu > <[email protected]> wrote: > > Tony, > > We(in the TLC) have discussed this before. At this point of time > you indeed have the solution that AndersW has pointed to. > > There are also other discussions going on w.r.t multiple-standbys etc > for the next major release. Will tag you in the appropriate ticket > (one such ticket is http://sourceforge.net/p/opensaf/tickets/1170/) > once those discussions conclude. > > Cheers, > Mathi. > > ----- [email protected] wrote: > >> Thanks all, this is a good discussion. >> >> Let me again state my use case, which as I said I don’t think is >> unique, because selfishly I want to make sure that this is covered >> :-) >> >> I want to keep the system running in the case where both controller >> fail, or more specifically in my case, when the only server in the >> system fails (cost reduced system). >> >> The "don’t reboot until a server re-appears" would work for me. I >> think its a poorer solution but it would be workable (Mathi’s >> suggestion). Especially if its available sooner. >> >> A better option is to allow one of the linecards to take over the >> controller role when the preferred one dies. Since linecards are >> assumed to be transitory (they may or may not be there or could be >> removed) so I don’t want to have to pick a linecard for this role at >> boot time. Much better would be to allow some number (more than 2) >> nodes to be controllers (e.g. Active, Standby, Spare, Spare …). Then >> OSAF takes responsibility for electing the next Active/Standby when >> necessary. There would have to be a concept of preference such that >> the Server node is chosen given a choice. >> >> For the solutions discussed so far… >> >> Making the OSAF processes restartable and be active on either >> controller, is good and will increase the MTBF of an OSAF system. >> However it won’t fix my problem if there are still only two >> controllers allowed. >> >> I’m not familiar with the “cattle” concept. >> >> I wasn’t advocating the “headless” solution only to the extent that it >> solved my immediate problem. Even if we did have a “headless” mode it >> would be a temporary situation until the failed controller could be >> replaced. >> >> thanks >> — >> tony >> >> >>> On Oct 14, 2015, at 6:44 AM, Mathivanan Naickan Palanivelu >> <[email protected]> wrote: >>> >>> In a clustered environment, all nodes need to be in the same >> consistent view >>> from a membership perspective. >>> So, loss of a leader will indeed result in state changes to the >> nodes >>> following the leader. Therefore there cannot be independent >> lifecycles >>> for some nodes and other nodes. >>> B.T.W the restart mentioned below meant - 'transparent restart >>> of OpenSAF' without affecting applications and for this the >> application's >>> CLC-CLI scripts need to handle. >>> >>> While we could be inclined to support (unique!) use-case such as >> this, >>> it is the solution(headless fork) that would be under the lens! ;-) >> >>> Let's discuss this! >>> >>> Cheers, >>> Mathi. >>> >>> >>> ----- [email protected] wrote: >>> >>>> Yes, this is yet another approach. But it is also another use-case >> for >>>> >>>> the headless feature. When we have moved the system controllers out >> of >>>> >>>> the cluster (into the cloud infrastructure), I would expect >>>> controllers >>>> and payloads to have independent life cycles. You have servers >> (i.e. >>>> system controllers), and clients (payloads). They can be installed >> and >>>> >>>> upgraded separately from each other, and I wouldn't expect a >> restart >>>> of >>>> the servers to cause all the clients to restart as well, in the >> same >>>> way >>>> as I don't expect my web browser to restart just because because >> the >>>> web >>>> server has crashed. >>>> >>>> / Anders Widell >>>> >>>> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote: >>>>> I don't think this is a case of cattles! Even in those scenario >>>>> the cloud management stacks, the "**controller" software >> themselves >>>> are 'placed' on physical nodes >>>>> in appropriate redundancy models and not inside those cattle VMs! >>>>> >>>>> I think the case here is about avoid rebooting of the node! >>>>> This can be achieved by setting the NOACTIVE timer to a longer >> value >>>> till OpenSAF on the controller comes back up. >>>>> Upon detecting that the controllers are up, some entity on the >> local >>>> node restart OpenSAF (/etc/init.d/opensafd restart) >>>>> And ensure the CLC-CLI scripts of the applications differentiate >>>> usual restart versus this spoof-restart! >>>>> >>>>> Mathi. >>>>> >>>>>> -----Original Message----- >>>>>> From: Anders Widell [mailto:[email protected]] >>>>>> Sent: Tuesday, October 13, 2015 5:36 PM >>>>>> To: Anders Björnerstedt; Tony Hart >>>>>> Cc: [email protected] >>>>>> Subject: Re: [users] Avoid rebooting payload modules after >> losing >>>> system >>>>>> controller >>>>>> >>>>>> Yes, I agree that the best fit for this feature is an >> application >>>> using either the >>>>>> NWay-Active or the No-Redundancy models, and where you view the >>>>>> system more as a collection of nodes rather than as a cluster. >> This >>>> kind of >>>>>> architecture is quite common when you write applications for >> cloud. >>>> The >>>>>> redundancy models are suitable for scaling, and the architecture >>>> fits into the >>>>>> "cattle" philosophy which is common in cloud. >>>>>> Such an application can tolerate any number of node failures, >> and >>>> the >>>>>> remaining nodes would still be able to continue functioning and >>>> provide their >>>>>> service. However, if we put the OpenSAF middleware on the nodes >> it >>>>>> becomes the weakest link, since OpenSAF will reboot all the >> nodes >>>> just >>>>>> because the two controller nodes fail. What a pity on a system >> with >>>> one >>>>>> hundred nodes! >>>>>> >>>>>> / Anders Widell >>>>>> >>>>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote: >>>>>>> >>>>>>> On 10/13/2015 12:27 PM, Anders Widell wrote: >>>>>>>> The possibility to have more than two system controllers (one >>>> active >>>>>>>> + several standby and/or spare controller nodes) is also >>>> something >>>>>>>> that has been investigated. For scalability reasons, we >> probably >>>>>>>> can't turn all nodes into standby controllers in a large >> cluster >>>> - >>>>>>>> but it may be feasible to have a system with one or several >>>> standby >>>>>>>> controllers and the rest of the nodes are spares that are >> ready >>>> to >>>>>>>> take an active or standby assignment when needed. >>>>>>>> >>>>>>>> However, the "headless" feature will still be needed in some >>>> systems >>>>>>>> where you need dedicated controller node(s). >>>>>>> That sounds as if some deployments have a special requirement >> that >>>> can >>>>>>> only be supported by the headless feature. >>>>>>> But you also have to say that the headless feature places >>>>>>> anti-requirements on the deployments/applications that are to >> use >>>> it. >>>>>>> >>>>>>> For example not needing cluster coherence among the payloads. >>>>>>> >>>>>>> If the payloads only run independent application instances >> where >>>> each >>>>>>> instance is implemented at one processor or at least does not >>>>>>> communicate in any state-sensitive way with peer processes at >>>> other >>>>>>> payloads; and no such instance is unique or if it is unique it >> is >>>>>>> still expendable (non critical to the service), then it could >>>> work. >>>>>>> >>>>>>> It is important the the deployments that end up thinking they >> need >>>> the >>>>>>> headless feature also understand what they loose with the >>>> headless >>>>>>> feature and that this loss is acceptable for that deployment. >>>>>>> >>>>>>> So headless is not a fancy feature needed by some exclusive and >>>> picky >>>>>>> subset of applications. >>>>>>> It is a relaxation that drops all requirements on distributed >>>>>>> consistency and may be acceptable to some applications with >>>> weaker >>>>>>> demands so they can accept the anti requirements. >>>>>>> >>>>>>> Besides requiring "dedicated" controller nodes, the deployment >>>> must of >>>>>>> course NOT require any *availability* of those dedicated >>>> controller >>>>>>> nodes, i.e. not have any requirements on service availability >> in >>>>>>> general. >>>>>>> >>>>>>> It may works for some "dumb" applications that are stateless, >> or >>>> state >>>>>>> stable (frozen in state), or have no requirements on >> availability >>>> of >>>>>>> state. In other words some applicaitons that really dont need >>>> SAF. >>>>>>> >>>>>>> They may still want to use SAF as a way of managing and >> monitoring >>>> the >>>>>>> system when it happens to be healthy, but can live with long >>>> periods >>>>>>> of not being able to manage or monitor that system, which can >> then >>>> be >>>>>>> degrading in any way that is possible. >>>>>>> >>>>>>> >>>>>>> /AndersBJ >>>>>>> >>>>>>> >>>>>>> >>>>>>>> / Anders Widell >>>>>>>> >>>>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote: >>>>>>>>> Understood. The assumption is that this is temporary but we >>>> allow >>>>>>>>> the payloads to continue to run (with reduced osaf >>>> functionality) >>>>>>>>> until a replacement controller is found. At that point they >>>> can >>>>>>>>> reboot to get the system back into sync. >>>>>>>>> >>>>>>>>> Or allow more than 2 controllers in the system so we can have >>>> one or >>>>>>>>> more usually-payload cards be controllers to reduce the >>>> probability >>>>>>>>> of no-controllers to an acceptable level. >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt >>>>>>>>>> <[email protected]> wrote: >>>>>>>>>> >>>>>>>>>> The headless state is also vulnerable to split-brain >>>> scenarios. >>>>>>>>>> That is network partitions and joins can occur and will not >> be >>>>>>>>>> detected as such and thus not handled properly (isolated) >> when >>>> they >>>>>>>>>> occur. >>>>>>>>>> Basically you can not be sure you have a continuously >>>> coherent >>>>>>>>>> cluster while in the headless state. >>>>>>>>>> >>>>>>>>>> On paper you may get a very resilient system in the sense >> that >>>> It >>>>>>>>>> "stays up" and replies on ping etc. >>>>>>>>>> But typically a customer wants not just availability but >>>> reliable >>>>>>>>>> behavior also. >>>>>>>>>> >>>>>>>>>> /AndersBj >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Anders Björnerstedt >>>>>> [mailto:[email protected]] >>>>>>>>>> Sent: den 12 oktober 2015 16:42 >>>>>>>>>> To: Anders Widell; Tony Hart; >>>> [email protected] >>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after >>>> losing >>>>>>>>>> system controller >>>>>>>>>> >>>>>>>>>> Note that this headless variant is a very questionable >>>> feature. >>>>>>>>>> This for the reasons explained earlier, i.e. you *will* get >> a >>>>>>>>>> reduction in service availability. >>>>>>>>>> It was never accepted into OpenSAF for that reason. >>>>>>>>>> >>>>>>>>>> On top of that the unreliability will typically not he >>>>>>>>>> explicit/handled. That is the operator will probably not >> even >>>> know >>>>>>>>>> what is working and what is not during the SC absence since >>>> the >>>>>>>>>> alarm/notification function is gone. No OpenSAF director >>>> services >>>>>>>>>> are executing. >>>>>>>>>> >>>>>>>>>> It is truly a headless system, i.e. a zombie system and thus >>>> not >>>>>>>>>> working at full monitoring and availability functionality. >>>>>>>>>> It begs the question of what OpenSAF and SAF is there for in >>>> the >>>>>>>>>> first place. >>>>>>>>>> >>>>>>>>>> The SCs don’t have to run any special software and don’t >> have >>>> to >>>>>>>>>> have any special hardware. >>>>>>>>>> They do need file system access, at least for a cluster >>>> restart, >>>>>>>>>> but not necessarily to handle single SC failure. >>>>>>>>>> The headless variant when headless is also in that >>>>>>>>>> not-able-to-cluster-restart also, but with even less >>>> functionality. >>>>>>>>>> >>>>>>>>>> An SC can of course run other (non OpenSAF specific) >> software. >>>> And >>>>>>>>>> the two SCs don’t necessarily have to be symmetric in terms >> of >>>>>>>>>> software. >>>>>>>>>> >>>>>>>>>> Providing file system access via NFS is typically a non >> issue. >>>> They >>>>>>>>>> have three nodes. Ergo they should be able to assign two of >>>> them >>>>>>>>>> the role of SC in the OpensAF domain. >>>>>>>>>> >>>>>>>>>> /AndersBj >>>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Anders Widell [mailto:[email protected]] >>>>>>>>>> Sent: den 12 oktober 2015 16:08 >>>>>>>>>> To: Tony Hart; [email protected] >>>>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after >>>> losing >>>>>>>>>> system controller >>>>>>>>>> >>>>>>>>>> We have actually implemented something very similar to what >> you >>>> are >>>>>>>>>> talking about. With this feature, the payloads can survive >>>> without >>>>>>>>>> a cluster restart even if both system controllers restart >> (or >>>> the >>>>>>>>>> single system controller, in your case). If you want to try >> it >>>> out, >>>>>>>>>> you can clone this Mercurial repository: >>>>>>>>>> >>>>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/ >>>>>>>>>> >>>>>>>>>> To enable the feature, set the variable >>>>>> IMMSV_SC_ABSENCE_ALLOWED in >>>>>>>>>> immd.conf to the amount of seconds you wish the payloads to >>>> wait >>>>>>>>>> for the system controllers to come back. Note: we have only >>>>>>>>>> implemented this feature for the "core" OpenSAF services >> (plus >>>>>>>>>> CKPT), so you need to disable the optional serivces. >>>>>>>>>> >>>>>>>>>> / Anders Widell >>>>>>>>>> >>>>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote: >>>>>>>>>>> We have been using opensaf in our product for a couple of >>>> years >>>>>>>>>>> now. One of the issues we have is the fact that payload >>>> cards >>>>>>>>>>> reboot when the system controllers are lost. Although our >>>> payload >>>>>>>>>>> card hardware will continue to perform its functions whilst >>>> the >>>>>>>>>>> software is down (which is desirable) the functions that >> the >>>>>>>>>>> software performs are obviously not performed (which is not >>>>>>>>>>> desirable). >>>>>>>>>>> >>>>>>>>>>> Why would we loose both controllers, surely that is a rare >>>>>>>>>>> circumstance? Not if you only have one controller to begin >>>> with. >>>>>>>>>>> Removing the second controller is a significant cost saving >>>> for us >>>>>>>>>>> so we want to support a product that only has one >> controller. >>>> The >>>>>>>>>>> most significant impediment to that is the loss of payload >>>>>>>>>>> software functions when the system controller fails. >>>>>>>>>>> >>>>>>>>>>> I’m looking for suggestions from this email list as to what >>>> could >>>>>>>>>>> be done for this issue. >>>>>>>>>>> >>>>>>>>>>> One suggestion, that would work for us, is if we could >>>> convince >>>>>>>>>>> the payload card to only reboot when the controller >> reappears >>>>>>>>>>> after a loss rather than when the loss initially occurs. >> Is >>>> that >>>>>>>>>>> possible? >>>>>>>>>>> >>>>>>>>>>> Another possibility is if we could support more than 2 >>>>>>>>>>> controllers, for example if we could support 4 (one active >> and >>>> 3 >>>>>>>>>>> standbys) that would also provide a solution for us (our >>>> current >>>>>>>>>>> payloads would instead become controllers). I know that >> this >>>> is >>>>>>>>>>> not currently possible with opensaf. >>>>>>>>>>> >>>>>>>>>>> thanks for any suggestions, >>>>>>>>>>> — >>>>>>>>>>> tony >>>>>>>>>>> >>>> ------------------------------------------------------------------ >>>>>>>>>>> ---- >>>>>>>>>>> >>>>>>>>>>> -------- _______________________________________________ >>>>>>>>>>> Opensaf-users mailing list >>>>>>>>>>> [email protected] >>>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>>>>>> >>>>>>>>>> >>>> >> ------------------------------------------------------------------- >>>>>>>>>> ----------- >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Opensaf-users mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>>>>>> >>>> >> ------------------------------------------------------------------- >>>>>>>>>> ----------- >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Opensaf-users mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>> >> ------------------------------------------------------------------------------ >>>>>> _______________________________________________ >>>>>> Opensaf-users mailing list >>>>>> [email protected] >>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
