Tony, We(in the TLC) have discussed this before. At this point of time you indeed have the solution that AndersW has pointed to.
There are also other discussions going on w.r.t multiple-standbys etc for the next major release. Will tag you in the appropriate ticket (one such ticket is http://sourceforge.net/p/opensaf/tickets/1170/) once those discussions conclude. Cheers, Mathi. ----- [email protected] wrote: > Thanks all, this is a good discussion. > > Let me again state my use case, which as I said I don’t think is > unique, because selfishly I want to make sure that this is covered > :-) > > I want to keep the system running in the case where both controller > fail, or more specifically in my case, when the only server in the > system fails (cost reduced system). > > The "don’t reboot until a server re-appears" would work for me. I > think its a poorer solution but it would be workable (Mathi’s > suggestion). Especially if its available sooner. > > A better option is to allow one of the linecards to take over the > controller role when the preferred one dies. Since linecards are > assumed to be transitory (they may or may not be there or could be > removed) so I don’t want to have to pick a linecard for this role at > boot time. Much better would be to allow some number (more than 2) > nodes to be controllers (e.g. Active, Standby, Spare, Spare …). Then > OSAF takes responsibility for electing the next Active/Standby when > necessary. There would have to be a concept of preference such that > the Server node is chosen given a choice. > > For the solutions discussed so far… > > Making the OSAF processes restartable and be active on either > controller, is good and will increase the MTBF of an OSAF system. > However it won’t fix my problem if there are still only two > controllers allowed. > > I’m not familiar with the “cattle” concept. > > I wasn’t advocating the “headless” solution only to the extent that it > solved my immediate problem. Even if we did have a “headless” mode it > would be a temporary situation until the failed controller could be > replaced. > > thanks > — > tony > > > > On Oct 14, 2015, at 6:44 AM, Mathivanan Naickan Palanivelu > <[email protected]> wrote: > > > > In a clustered environment, all nodes need to be in the same > consistent view > > from a membership perspective. > > So, loss of a leader will indeed result in state changes to the > nodes > > following the leader. Therefore there cannot be independent > lifecycles > > for some nodes and other nodes. > > B.T.W the restart mentioned below meant - 'transparent restart > > of OpenSAF' without affecting applications and for this the > application's > > CLC-CLI scripts need to handle. > > > > While we could be inclined to support (unique!) use-case such as > this, > > it is the solution(headless fork) that would be under the lens! ;-) > > > Let's discuss this! > > > > Cheers, > > Mathi. > > > > > > ----- [email protected] wrote: > > > >> Yes, this is yet another approach. But it is also another use-case > for > >> > >> the headless feature. When we have moved the system controllers out > of > >> > >> the cluster (into the cloud infrastructure), I would expect > >> controllers > >> and payloads to have independent life cycles. You have servers > (i.e. > >> system controllers), and clients (payloads). They can be installed > and > >> > >> upgraded separately from each other, and I wouldn't expect a > restart > >> of > >> the servers to cause all the clients to restart as well, in the > same > >> way > >> as I don't expect my web browser to restart just because because > the > >> web > >> server has crashed. > >> > >> / Anders Widell > >> > >> On 10/13/2015 03:54 PM, Mathivanan Naickan Palanivelu wrote: > >>> I don't think this is a case of cattles! Even in those scenario > >>> the cloud management stacks, the "**controller" software > themselves > >> are 'placed' on physical nodes > >>> in appropriate redundancy models and not inside those cattle VMs! > >>> > >>> I think the case here is about avoid rebooting of the node! > >>> This can be achieved by setting the NOACTIVE timer to a longer > value > >> till OpenSAF on the controller comes back up. > >>> Upon detecting that the controllers are up, some entity on the > local > >> node restart OpenSAF (/etc/init.d/opensafd restart) > >>> And ensure the CLC-CLI scripts of the applications differentiate > >> usual restart versus this spoof-restart! > >>> > >>> Mathi. > >>> > >>>> -----Original Message----- > >>>> From: Anders Widell [mailto:[email protected]] > >>>> Sent: Tuesday, October 13, 2015 5:36 PM > >>>> To: Anders Björnerstedt; Tony Hart > >>>> Cc: [email protected] > >>>> Subject: Re: [users] Avoid rebooting payload modules after > losing > >> system > >>>> controller > >>>> > >>>> Yes, I agree that the best fit for this feature is an > application > >> using either the > >>>> NWay-Active or the No-Redundancy models, and where you view the > >>>> system more as a collection of nodes rather than as a cluster. > This > >> kind of > >>>> architecture is quite common when you write applications for > cloud. > >> The > >>>> redundancy models are suitable for scaling, and the architecture > >> fits into the > >>>> "cattle" philosophy which is common in cloud. > >>>> Such an application can tolerate any number of node failures, > and > >> the > >>>> remaining nodes would still be able to continue functioning and > >> provide their > >>>> service. However, if we put the OpenSAF middleware on the nodes > it > >>>> becomes the weakest link, since OpenSAF will reboot all the > nodes > >> just > >>>> because the two controller nodes fail. What a pity on a system > with > >> one > >>>> hundred nodes! > >>>> > >>>> / Anders Widell > >>>> > >>>> On 10/13/2015 01:19 PM, Anders Björnerstedt wrote: > >>>>> > >>>>> On 10/13/2015 12:27 PM, Anders Widell wrote: > >>>>>> The possibility to have more than two system controllers (one > >> active > >>>>>> + several standby and/or spare controller nodes) is also > >> something > >>>>>> that has been investigated. For scalability reasons, we > probably > >>>>>> can't turn all nodes into standby controllers in a large > cluster > >> - > >>>>>> but it may be feasible to have a system with one or several > >> standby > >>>>>> controllers and the rest of the nodes are spares that are > ready > >> to > >>>>>> take an active or standby assignment when needed. > >>>>>> > >>>>>> However, the "headless" feature will still be needed in some > >> systems > >>>>>> where you need dedicated controller node(s). > >>>>> That sounds as if some deployments have a special requirement > that > >> can > >>>>> only be supported by the headless feature. > >>>>> But you also have to say that the headless feature places > >>>>> anti-requirements on the deployments/applications that are to > use > >> it. > >>>>> > >>>>> For example not needing cluster coherence among the payloads. > >>>>> > >>>>> If the payloads only run independent application instances > where > >> each > >>>>> instance is implemented at one processor or at least does not > >>>>> communicate in any state-sensitive way with peer processes at > >> other > >>>>> payloads; and no such instance is unique or if it is unique it > is > >>>>> still expendable (non critical to the service), then it could > >> work. > >>>>> > >>>>> It is important the the deployments that end up thinking they > need > >> the > >>>>> headless feature also understand what they loose with the > >> headless > >>>>> feature and that this loss is acceptable for that deployment. > >>>>> > >>>>> So headless is not a fancy feature needed by some exclusive and > >> picky > >>>>> subset of applications. > >>>>> It is a relaxation that drops all requirements on distributed > >>>>> consistency and may be acceptable to some applications with > >> weaker > >>>>> demands so they can accept the anti requirements. > >>>>> > >>>>> Besides requiring "dedicated" controller nodes, the deployment > >> must of > >>>>> course NOT require any *availability* of those dedicated > >> controller > >>>>> nodes, i.e. not have any requirements on service availability > in > >>>>> general. > >>>>> > >>>>> It may works for some "dumb" applications that are stateless, > or > >> state > >>>>> stable (frozen in state), or have no requirements on > availability > >> of > >>>>> state. In other words some applicaitons that really dont need > >> SAF. > >>>>> > >>>>> They may still want to use SAF as a way of managing and > monitoring > >> the > >>>>> system when it happens to be healthy, but can live with long > >> periods > >>>>> of not being able to manage or monitor that system, which can > then > >> be > >>>>> degrading in any way that is possible. > >>>>> > >>>>> > >>>>> /AndersBJ > >>>>> > >>>>> > >>>>> > >>>>>> / Anders Widell > >>>>>> > >>>>>> On 10/13/2015 12:07 PM, Tony Hart wrote: > >>>>>>> Understood. The assumption is that this is temporary but we > >> allow > >>>>>>> the payloads to continue to run (with reduced osaf > >> functionality) > >>>>>>> until a replacement controller is found. At that point they > >> can > >>>>>>> reboot to get the system back into sync. > >>>>>>> > >>>>>>> Or allow more than 2 controllers in the system so we can have > >> one or > >>>>>>> more usually-payload cards be controllers to reduce the > >> probability > >>>>>>> of no-controllers to an acceptable level. > >>>>>>> > >>>>>>> > >>>>>>>> On Oct 12, 2015, at 11:05 AM, Anders Björnerstedt > >>>>>>>> <[email protected]> wrote: > >>>>>>>> > >>>>>>>> The headless state is also vulnerable to split-brain > >> scenarios. > >>>>>>>> That is network partitions and joins can occur and will not > be > >>>>>>>> detected as such and thus not handled properly (isolated) > when > >> they > >>>>>>>> occur. > >>>>>>>> Basically you can not be sure you have a continuously > >> coherent > >>>>>>>> cluster while in the headless state. > >>>>>>>> > >>>>>>>> On paper you may get a very resilient system in the sense > that > >> It > >>>>>>>> "stays up" and replies on ping etc. > >>>>>>>> But typically a customer wants not just availability but > >> reliable > >>>>>>>> behavior also. > >>>>>>>> > >>>>>>>> /AndersBj > >>>>>>>> > >>>>>>>> > >>>>>>>> -----Original Message----- > >>>>>>>> From: Anders Björnerstedt > >>>> [mailto:[email protected]] > >>>>>>>> Sent: den 12 oktober 2015 16:42 > >>>>>>>> To: Anders Widell; Tony Hart; > >> [email protected] > >>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after > >> losing > >>>>>>>> system controller > >>>>>>>> > >>>>>>>> Note that this headless variant is a very questionable > >> feature. > >>>>>>>> This for the reasons explained earlier, i.e. you *will* get > a > >>>>>>>> reduction in service availability. > >>>>>>>> It was never accepted into OpenSAF for that reason. > >>>>>>>> > >>>>>>>> On top of that the unreliability will typically not he > >>>>>>>> explicit/handled. That is the operator will probably not > even > >> know > >>>>>>>> what is working and what is not during the SC absence since > >> the > >>>>>>>> alarm/notification function is gone. No OpenSAF director > >> services > >>>>>>>> are executing. > >>>>>>>> > >>>>>>>> It is truly a headless system, i.e. a zombie system and thus > >> not > >>>>>>>> working at full monitoring and availability functionality. > >>>>>>>> It begs the question of what OpenSAF and SAF is there for in > >> the > >>>>>>>> first place. > >>>>>>>> > >>>>>>>> The SCs don’t have to run any special software and don’t > have > >> to > >>>>>>>> have any special hardware. > >>>>>>>> They do need file system access, at least for a cluster > >> restart, > >>>>>>>> but not necessarily to handle single SC failure. > >>>>>>>> The headless variant when headless is also in that > >>>>>>>> not-able-to-cluster-restart also, but with even less > >> functionality. > >>>>>>>> > >>>>>>>> An SC can of course run other (non OpenSAF specific) > software. > >> And > >>>>>>>> the two SCs don’t necessarily have to be symmetric in terms > of > >>>>>>>> software. > >>>>>>>> > >>>>>>>> Providing file system access via NFS is typically a non > issue. > >> They > >>>>>>>> have three nodes. Ergo they should be able to assign two of > >> them > >>>>>>>> the role of SC in the OpensAF domain. > >>>>>>>> > >>>>>>>> /AndersBj > >>>>>>>> > >>>>>>>> -----Original Message----- > >>>>>>>> From: Anders Widell [mailto:[email protected]] > >>>>>>>> Sent: den 12 oktober 2015 16:08 > >>>>>>>> To: Tony Hart; [email protected] > >>>>>>>> Subject: Re: [users] Avoid rebooting payload modules after > >> losing > >>>>>>>> system controller > >>>>>>>> > >>>>>>>> We have actually implemented something very similar to what > you > >> are > >>>>>>>> talking about. With this feature, the payloads can survive > >> without > >>>>>>>> a cluster restart even if both system controllers restart > (or > >> the > >>>>>>>> single system controller, in your case). If you want to try > it > >> out, > >>>>>>>> you can clone this Mercurial repository: > >>>>>>>> > >>>>>>>> https://sourceforge.net/u/anders-w/opensaf-headless/ > >>>>>>>> > >>>>>>>> To enable the feature, set the variable > >>>> IMMSV_SC_ABSENCE_ALLOWED in > >>>>>>>> immd.conf to the amount of seconds you wish the payloads to > >> wait > >>>>>>>> for the system controllers to come back. Note: we have only > >>>>>>>> implemented this feature for the "core" OpenSAF services > (plus > >>>>>>>> CKPT), so you need to disable the optional serivces. > >>>>>>>> > >>>>>>>> / Anders Widell > >>>>>>>> > >>>>>>>> On 10/11/2015 02:30 PM, Tony Hart wrote: > >>>>>>>>> We have been using opensaf in our product for a couple of > >> years > >>>>>>>>> now. One of the issues we have is the fact that payload > >> cards > >>>>>>>>> reboot when the system controllers are lost. Although our > >> payload > >>>>>>>>> card hardware will continue to perform its functions whilst > >> the > >>>>>>>>> software is down (which is desirable) the functions that > the > >>>>>>>>> software performs are obviously not performed (which is not > >>>>>>>>> desirable). > >>>>>>>>> > >>>>>>>>> Why would we loose both controllers, surely that is a rare > >>>>>>>>> circumstance? Not if you only have one controller to begin > >> with. > >>>>>>>>> Removing the second controller is a significant cost saving > >> for us > >>>>>>>>> so we want to support a product that only has one > controller. > >> The > >>>>>>>>> most significant impediment to that is the loss of payload > >>>>>>>>> software functions when the system controller fails. > >>>>>>>>> > >>>>>>>>> I’m looking for suggestions from this email list as to what > >> could > >>>>>>>>> be done for this issue. > >>>>>>>>> > >>>>>>>>> One suggestion, that would work for us, is if we could > >> convince > >>>>>>>>> the payload card to only reboot when the controller > reappears > >>>>>>>>> after a loss rather than when the loss initially occurs. > Is > >> that > >>>>>>>>> possible? > >>>>>>>>> > >>>>>>>>> Another possibility is if we could support more than 2 > >>>>>>>>> controllers, for example if we could support 4 (one active > and > >> 3 > >>>>>>>>> standbys) that would also provide a solution for us (our > >> current > >>>>>>>>> payloads would instead become controllers). I know that > this > >> is > >>>>>>>>> not currently possible with opensaf. > >>>>>>>>> > >>>>>>>>> thanks for any suggestions, > >>>>>>>>> — > >>>>>>>>> tony > >>>>>>>>> > >> ------------------------------------------------------------------ > >>>>>>>>> ---- > >>>>>>>>> > >>>>>>>>> -------- _______________________________________________ > >>>>>>>>> Opensaf-users mailing list > >>>>>>>>> [email protected] > >>>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users > >>>>>>>> > >>>>>>>> > >> > ------------------------------------------------------------------- > >>>>>>>> ----------- > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Opensaf-users mailing list > >>>>>>>> [email protected] > >>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users > >>>>>>>> > >> > ------------------------------------------------------------------- > >>>>>>>> ----------- > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Opensaf-users mailing list > >>>>>>>> [email protected] > >>>>>>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users > >>>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> > >> > ------------------------------------------------------------------------------ > >>>> _______________________________________________ > >>>> Opensaf-users mailing list > >>>> [email protected] > >>>> https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
