> -----Original Message----- > From: Hans Feldt [mailto:hans.fe...@ericsson.com] > Sent: Monday, November 18, 2013 9:47 PM > To: Mathivanan Naickan Palanivelu; Nagendra Kumar > Cc: opensaf-devel@lists.sourceforge.net > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when link > between Controller and Payload flickers [#600] > > > > > -----Original Message----- > > From: Mathivanan Naickan Palanivelu [mailto:mathi.naic...@oracle.com] > > Sent: den 18 november 2013 14:05 > > To: Hans Feldt; Nagendra Kumar > > Cc: opensaf-devel@lists.sourceforge.net > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when link > > between Controller and Payload flickers [#600] > > > > Comments inline: > > > > > -----Original Message----- > > > From: Hans Feldt [mailto:hans.fe...@ericsson.com] > > > Sent: Monday, November 18, 2013 5:53 PM > > > To: Mathivanan Naickan Palanivelu; Nagendra Kumar > > > Cc: opensaf-devel@lists.sourceforge.net > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when link > > > between Controller and Payload flickers [#600] > > > > > > > > > > -----Original Message----- > > > > From: Mathivanan Naickan Palanivelu > > > > [mailto:mathi.naic...@oracle.com] > > > > Sent: den 18 november 2013 11:07 > > > > To: Hans Feldt; Nagendra Kumar > > > > Cc: opensaf-devel@lists.sourceforge.net > > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when > > > > link between Controller and Payload flickers [#600] > > > > > > > > From an OpenSAF architecture perspective, FM is designated to > > > orchestrate controller failover. > > > > > > Yes _only_ controller failover and it will try do fencing by calling > > > the opensaf_reboot script as seen in syslog. > > > > > > > Comments inline: > > > > > > > > > -----Original Message----- > > > > > From: Hans Feldt [mailto:hans.fe...@ericsson.com] > > > > > Sent: Monday, November 18, 2013 2:42 PM > > > > > To: Nagendra Kumar; Mathivanan Naickan Palanivelu > > > > > Cc: opensaf-devel@lists.sourceforge.net > > > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when > > > > > link between Controller and Payload flickers [#600] > > > > > > > > > > Just to sort out node fencing and recap the architecture we have > > > > > I checked old and new code. > > > > > > > > > > In 3.x, avd upon detecting heartbeat loss, would inform avm > > > > > about this. avm would inform fm. fm would I guess issue HPI > > > > > reboot of the node. When the reboot response is received, only > > > > > then would avd failover services. In 3.x mds down for avnd does not > result in any action. > > > > > > > > > [Mathi] > > > > In 3.x and 4.x, FM is the one that orchestrates failover. > > > > In 3.x, 4.x AMF indeed reboots the payload node upon losing > > > > connection > > > with the controller. > > > > If that's not happening in 4.x, then that means this code path got > > > > broken > > > and that has to be fixed. > > > > > > I think it has. I am sure it worked in 3.x. Now there seems to be no > > > node fencing done in the case of MDS NODE DOWN for a payload node. > > > > > > CLM is not doing it (as far as I can tell) and AMF is not doing it. > > > According to the spec CLM is supposed to do it: > > > > > > " The Cluster Membership Service will initiate a cluster node reboot > > > as a fencing and repair action if the value of the > > > saClmNodeDisableReboot attribute of the node (defined in the > > > SaClmNode configuration object class, see Section 4.4) is SA_FALSE. " > > > > > > To me it sounds like CLM should do fencing and AMF should use CLM. > > > Controller node handling is a special case though. > > > > > > Mathi what do you think, how should we change the architecture? > > > > > [Mathi] > > Well, there are problems if we rely on CLM alone to do the payload fencing. > > If user wants the saclmnodedisablereboot to be ON because the node is > > running a Database and/or has special constraints because of which the > > user does not want the node to be rebooted, but rather only wants to > > "fence only the shared resources" and not poweroff/reboot and then mark > the node as non-member! > > I am not sure I understand... > > Are you talking about a CLM node that is not an AMF node? > [Mathi] Both scenarios. > What would saClmNodeDisableReboot=TRUE mean from an AMF > perspective? > [Mathi] Suppose If AMF does not fences and fencing is left to be handled by CLM && if This variable is TRUE, and then this scenario can potentially disrupt(by continuing to be UP) a distributed application that runs on two payloads that just got cut off from the controller but are reachable between them and is accessing a shared resource!
Imagine that a fencing action (*fencing != reboot always*) is controlled by fault policies. In terms of policy hierarchy, A fencing action done by CLM could be seen a 'superset' policy and a Fencing action done by AMF could be seen as a 'subset' policy. There could also be 'node-global' or a 'global' policy That would overrides a CLM/AMF policy(like avoiding reboot in some scenarios, or rebooting forcefully over-wriding Some flags, etc.) > > In an implementation of the 'membership' service, this line from the > > spec has to be taken with a pinch of salt(read - in conjuction with a > > cluster > 'manager'). > > i.e. A generic clustermangement solution enables a deployment to have > > a "fault handling" that allows user to create custom policies that > > define "various criteria" to be considered when a node is fenced. > > > > In the absence of a good cluster manager that can feed evidences to > > CLM (evidences that can also overwride the saclmnodedisablereboot when > > the membership algorithm is run), I think we could continue to handle it in > AMF(revert the old behavior), unless it is a complex fix in AMF! > > Continue? Quite a few years ago it was a part of avsv which was a combined > CLM/AMF service... > [Mathi] If 'continue' is not comfortable, then 'reverting back' should atleast be comfortable! :-) Anyways, my point is that we should aim to take a formal stab/approach to change the architecture in 4.5, till then I think we should continue to do the reboot from within AMF in the scenario when payload loses connection with controller. Cheers, Mathi. > Thanks, > Hans > > > > > In the long run(4.5), we should move this and also the FM functionality into > CLM. > > > > > In 3.x avd also directly fenced nodes for various others reasons > > > such as no success returns in responses. But not for heartbeat loss, > > > that seemed to take a different route. This is also removed. > > > > > [Mathi] > > Yeah, I think this thing has popped up multiple times now, i.e. "no status > returns"!! Hmm... > > > > Thanks, > > Mathi. > > > > > Thanks, > > > Hans > > > > > > > > > > > > In 4.x, amf does not have internal heartbeat. Amf uses on MDS > > > > > down events (NOT CLM). Amf will not do any node level fencing. > > > > > If CLM does fencing it would be UNCOORDINATED with AMF. Ticket > > > > > #73 > > > suggests > > > > > AMF should use CLM events to trigger failover. > > > > > > > > > > I am starting to think #73 needs to be fixed asap. At least for > > > > > payload > > > nodes. > > > > > Not sure AMF can use CLM to failover the controller role. > > > > > > > > > > Mathi: I don't see CLM doing node fencing in case of NODE DOWN? > > > > > > > > > [Mathi] > > > > Well, in the flows of controller-failover and > > > > payload-disconnecting-from-controller > > > > multiple (AMF, FM) modules are taking fencing decisions. *That is > > > > the > > > current OpenSAF architecture*. > > > > And that is precisely why the saclmnodedisablereboot flag is not > > > > supported in these flows that are handled by the AMF and helper > services. > > > > In the context of CLM it really cannot be black or white approach > > > > to fencing, i.e. uncoordinated reboot would have its own problems > > > > when applications(when controlled by AMF) have dependencies > and/or > > > preferences for how they(critical applications) want the node to be > > > evicted(gracefully and/or immediate reboot). > > > > > > > > And yes, OpenSAF today does relies on MDS/TIPC for providing > > > cluster/connection management evidences. > > > > > > > > > > > > Thanks, > > > > Mathi. > > > > > > > > > /Hans > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: Nagendra Kumar [mailto:nagendr...@oracle.com] > > > > > > Sent: den 18 november 2013 09:34 > > > > > > To: Hans Feldt; Suryanarayana Garlapati; Hans Nordebäck; > > > > > > Praveen Malviya; Mathivanan Naickan Palanivelu > > > > > > Cc: opensaf-devel@lists.sourceforge.net > > > > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when > > > > > > link between Controller and Payload flickers [#600] > > > > > > > > > > > > >> Does amfd try to fence the payload when this happens? > > > > > > Amfd reset Amfnd information, but Amfnd lives like an orphan > > > > > > as Amfd will > > > > > not entertain any requests from Amfnd. > > > > > > > > > > > > Thanks > > > > > > -Nagu > > > > > > > > > > > > -----Original Message----- > > > > > > From: Hans Feldt [mailto:hans.fe...@ericsson.com] > > > > > > Sent: 18 November 2013 12:56 > > > > > > To: Nagendra Kumar; Suryanarayana Garlapati; Hans Nordebäck; > > > > > > Praveen Malviya; Mathivanan Naickan Palanivelu > > > > > > Cc: opensaf-devel@lists.sourceforge.net > > > > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when > > > > > > link between Controller and Payload flickers [#600] > > > > > > > > > > > > As I said this problem originates from a low memory condition > > > > > > on the > > > > > payload that results in TIPC on that payload resets the link(s). > > > > > > Once TIPC links are re-established, the AMF cluster cannot be > > > > > > re-established. I think TIPC has now been changed to handle > > > > > > this better so > > > > > the likeliness of this to happen will be dramatically reduced. > > > > > > > > > > > > Other triggers for this could be long network latencies as can > > > > > > happen > > > > > running virtualized without quality of service guarantees. > > > > > > > > > > > > This problem is of course part of the bigger problem of a too > > > > > > simplistic > > > > > cluster management in OpenSAF. > > > > > > > > > > > > Does amfd try to fence the payload when this happens? > > > > > > > > > > > > /Hans > > > > > > > > > > > > > -----Original Message----- > > > > > > > From: Nagendra Kumar [mailto:nagendr...@oracle.com] > > > > > > > Sent: den 18 november 2013 07:57 > > > > > > > To: Suryanarayana Garlapati; Hans Feldt; Hans Nordebäck; > > > > > > > Praveen Malviya; Mathivanan Naickan Palanivelu > > > > > > > Cc: opensaf-devel@lists.sourceforge.net > > > > > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload > > > > > > > when link between Controller and Payload flickers [#600] > > > > > > > > > > > > > > Hi Hans, > > > > > > > Any response ? > > > > > > > > > > > > > > -Nagu > > > > > > > -----Original Message----- > > > > > > > From: Nagendra Kumar > > > > > > > Sent: 15 November 2013 14:39 > > > > > > > To: Suryanarayana Garlapati; hans.fe...@ericsson.com; > > > > > > > hans.nordeb...@ericsson.com; Praveen Malviya; Mathivanan > > > Naickan > > > > > > > Palanivelu > > > > > > > Cc: opensaf-devel@lists.sourceforge.net > > > > > > > Subject: Re: [devel] [PATCH 1 of 1] amfnd: Reboot payload > > > > > > > when link between Controller and Payload flickers [#600] > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I checked and it looks no easy possibility for detecting > > > > > > > link loss at > > > > > controller AvD as it gets AvND down and Avnd UP in the tipc flicker. > > > > > > > This is the same even when payload goes down and rejoins. > > > > > > > Only time > > > > > difference is the differentiator between these two scenario. > > > > > > > > > > > > > > Hans: Can you please check any other possibility at Amfd to > > > > > > > detect tipc > > > > > flicker? > > > > > > > > > > > > > > Thanks > > > > > > > -Nagu > > > > > > > -----Original Message----- > > > > > > > From: Suryanarayana Garlapati > > > > > > > Sent: 22 October 2013 14:41 > > > > > > > To: Nagendra Kumar; hans.fe...@ericsson.com; > > > > > > > hans.nordeb...@ericsson.com; Praveen Malviya; Mathivanan > > > Naickan > > > > > > > Palanivelu > > > > > > > Cc: opensaf-devel@lists.sourceforge.net > > > > > > > Subject: Re: [PATCH 1 of 1] amfnd: Reboot payload when link > > > > > > > between Controller and Payload flickers [#600] > > > > > > > > > > > > > > Hi Nagu, > > > > > > > What is the discrimination point that the link flap has > > > > > > > occurred with only one PLD? The chances of getting link flap > > > > > > > with one Payload is less when compared wiht link flap with > > > > > > > all the payloads. With my suggestion, there will be a > > > > > > > failover. but with the present patch if a link flap happens > > > > > > > with the payload nodes, all the payload nodes > > > > > > will go for reboot. So considering in total, i guess we should > > > > > > reboot the > > > > > Active controller only. > > > > > > > > > > > > > > Regards > > > > > > > Surya > > > > > > > > > > > > > > > > > > > > > On Monday 21 October 2013 06:40 PM, Nagendra Kumar wrote: > > > > > > > > Hi Surya, > > > > > > > > > > > > > > > > The problem I see with the approach is : because of > > > > > > > > problem in payload, others payload is impacted because of > > > > > > > > Act controller > > > > > > > failover. > > > > > > > > > > > > > > > > Thanks > > > > > > > > -Nagu > > > > > > > > -----Original Message----- > > > > > > > > From: Suryanarayana Garlapati > > > > > > > > Sent: 21 October 2013 18:22 > > > > > > > > To: Nagendra Kumar; hans.fe...@ericsson.com; > > > > > > > > hans.nordeb...@ericsson.com; Praveen Malviya; Mathivanan > > > > > > > > Naickan Palanivelu > > > > > > > > Cc: opensaf-devel@lists.sourceforge.net > > > > > > > > Subject: Re: [PATCH 1 of 1] amfnd: Reboot payload when > > > > > > > > link between Controller and Payload flickers [#600] > > > > > > > > > > > > > > > > Hi Nagu, > > > > > > > > I am not comfortable with this approach. > > > > > > > > I think its better to reboot the active controller if link > > > > > > > > flaps and not the payload node. If the link flaps between > > > > > > > > the active controller > > > > > > > and payload nodes, then there will total payload cluster > > > > > > > reset which we > > > > > can avoid by just rebooting the active controller. > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > Regards > > > > > > > > Surya > > > > > > > > > > > > > > > > On Monday 21 October 2013 05:03 PM, nagendr...@oracle.com > > > wrote: > > > > > > > >> osaf/services/saf/amf/amfnd/di.cc | 13 > > > > > > > >> +++++++++---- > > > > > > > >> osaf/services/saf/amf/amfnd/include/avnd_cb.h | 1 + > > > > > > > >> osaf/services/saf/amf/amfnd/mds.cc | 11 > +++++++++++ > > > > > > > >> 3 files changed, 21 insertions(+), 4 deletions(-) > > > > > > > >> > > > > > > > >> > > > > > > > >> diff --git a/osaf/services/saf/amf/amfnd/di.cc > > > > > > > >> b/osaf/services/saf/amf/amfnd/di.cc > > > > > > > >> --- a/osaf/services/saf/amf/amfnd/di.cc > > > > > > > >> +++ b/osaf/services/saf/amf/amfnd/di.cc > > > > > > > >> @@ -437,13 +437,18 @@ uint32_t > > > > > avnd_evt_mds_avd_dn_evh(AVND_CB > > > > > > > >> > > > > > > > >> TRACE_ENTER(); > > > > > > > >> > > > > > > > >> - LOG_ER("AMF director unexpectedly crashed"); > > > > > > > >> - > > > > > > > >> /* Don't issue reboot if it has been already issued.*/ > > > > > > > >> if (false == cb->reboot_in_progress) { > > > > > > > >> cb->reboot_in_progress = true; > > > > > > > >> - opensaf_reboot(avnd_cb->node_info.nodeId, (char > > > > > *)avnd_cb->node_info.executionEnvironment.value, > > > > > > > >> - "local AVD down(Adest) or both > > > > > > > >> AVD > > > > > down(Vdest) received"); > > > > > > > >> + if(cb->cont_reboot_in_progress == false) { > > > > > > > >> + LOG_ER("AMF director unexpectedly > > > > > crashed"); > > > > > > > >> + opensaf_reboot(avnd_cb- > > > > > >node_info.nodeId, (char *)avnd_cb- > > > > > >node_info.executionEnvironment.value, > > > > > > > >> + "local AVD down(Adest) > > > > > > > >> or > > > > > both AVD down(Vdest) received"); > > > > > > > >> + } else { > > > > > > > >> + opensaf_reboot(avnd_cb- > > > > > >node_info.nodeId, (char *)avnd_cb- > > > > > >node_info.executionEnvironment.value, > > > > > > > >> + "Link reset with Act > > > > > controller"); > > > > > > > >> + } > > > > > > > >> + > > > > > > > >> } > > > > > > > >> > > > > > > > >> TRACE_LEAVE(); > > > > > > > >> diff --git > > > > > > > >> a/osaf/services/saf/amf/amfnd/include/avnd_cb.h > > > > > > > >> b/osaf/services/saf/amf/amfnd/include/avnd_cb.h > > > > > > > >> --- a/osaf/services/saf/amf/amfnd/include/avnd_cb.h > > > > > > > >> +++ b/osaf/services/saf/amf/amfnd/include/avnd_cb.h > > > > > > > >> @@ -130,6 +130,7 @@ typedef struct avnd_cb_tag { > > > > > > > >> SaBoolT first_time_up; > > > > > > > >> bool reboot_in_progress; > > > > > > > >> AVND_SU *failed_su; > > > > > > > >> + bool cont_reboot_in_progress; > > > > > > > >> } AVND_CB; > > > > > > > >> > > > > > > > >> #define AVND_CB_NULL ((AVND_CB *)0) diff --git > > > > > > > >> a/osaf/services/saf/amf/amfnd/mds.cc > > > > > > > >> b/osaf/services/saf/amf/amfnd/mds.cc > > > > > > > >> --- a/osaf/services/saf/amf/amfnd/mds.cc > > > > > > > >> +++ b/osaf/services/saf/amf/amfnd/mds.cc > > > > > > > >> @@ -386,6 +386,7 @@ uint32_t avnd_mds_rcv(AVND_CB *cb, > > > MDS_C > > > > > > > >> if ((AVSV_D2N_NODE_UP_MSG == > > > > > ((AVSV_DND_MSG *)(rcv_info->i_msg))->msg_type) || > > > > > > > >> (AVSV_D2N_DATA_VERIFY_MSG == > > > > > ((AVSV_DND_MSG *)(rcv_info->i_msg))->msg_type)) { > > > > > > > >> cb->active_avd_adest = > > > > > > > >> rcv_info->i_fr_dest; > > > > > > > >> + avnd_cb->cont_reboot_in_progress = > > > > > > > >> false; > > > > > > > >> TRACE_1("Active AVD Adest = %" PRIu64 > > > > > > > >> ,cb- > > > > > >active_avd_adest); > > > > > > > >> } > > > > > > > >> > > > > > > > >> @@ -560,6 +561,14 @@ uint32_t > avnd_mds_svc_evt(AVND_CB > > > *cb, > > > > > M > > > > > > > >> case NCSMDS_UP: > > > > > > > >> switch (evt_info->i_svc_id) { > > > > > > > >> case NCSMDS_SVC_ID_AVD: > > > > > > > >> + > > > > > > > >> + if ((m_MDS_DEST_IS_AN_ADEST(evt_info- > > > > > >i_dest) && avnd_cb->cont_reboot_in_progress) && > > > > > > > >> + > > > > > (m_NCS_NODE_ID_FROM_MDS_DEST(evt_info->i_dest) == cb- > > > > > >active_avd_adest)) { > > > > > > > >> + memset(&cb->avd_dest, 0, > > > > > sizeof(MDS_DEST)); > > > > > > > >> + evt = avnd_evt_create(cb, > > > > > AVND_EVT_MDS_AVD_DN, 0, &evt_info->i_dest, 0, 0, 0); > > > > > > > >> + break; > > > > > > > >> + } > > > > > > > >> + > > > > > > > >> /* create the mds event */ > > > > > > > >> evt = avnd_evt_create(cb, > > > > > AVND_EVT_MDS_AVD_UP, 0, &evt_info->i_dest, 0, 0, 0); > > > > > > > >> break; > > > > > > > >> @@ -606,6 +615,8 @@ uint32_t avnd_mds_svc_evt(AVND_CB > > > *cb, M > > > > > > > >> /* Supervise our node local > > > > > > > >> director > > > > > */ > > > > > > > >> if (evt_info->i_node_id != > > > > > ncs_get_node_id()) { > > > > > > > >> /* Ignore the other AVD > > > > > Adest Down.*/ > > > > > > > >> + > > > > > if(m_NCS_NODE_ID_FROM_MDS_DEST(evt_info->i_dest) == cb- > > > > > >active_avd_adest) > > > > > > > >> + avnd_cb- > > > > > >cont_reboot_in_progress = true; > > > > > > > >> return rc; > > > > > > > >> } > > > > > > > >> } > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------ > > > > > > > ---- > > > > > > > ---- > > > > > > > -- > > > > > > > -------- DreamFactory - Open Source REST & JSON Services for > > > > > > > HTML5 & Native Apps OAuth, Users, Roles, SQL, NoSQL, BLOB > > > > > > > Storage and External API Access Free app hosting. Or install > > > > > > > the open source package > > > > > on any LAMP server. > > > > > > > Sign up and see examples for AngularJS, jQuery, Sencha Touch > > > > > > > and > > > > > Native! > > > > > > > > > > > > > > > > http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg > > > > > > > .c lktrk > _______________________________________________ > > > > > > > Opensaf-devel mailing list > > > > > > > Opensaf-devel@lists.sourceforge.net > > > > > > > https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ Shape the Mobile Experience: Free Subscription Software experts and developers: Be at the forefront of tech innovation. Intel(R) Software Adrenaline delivers strategic insight and game-changing conversations that shape the rapidly evolving mobile landscape. Sign up now. http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-devel mailing list Opensaf-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-devel