Re: [devel] [PATCH 1 of 1] amfnd: Reboot payload when link between Controller and Payload flickers [#600]

Mathivanan Naickan Palanivelu Mon, 18 Nov 2013 22:47:36 -0800

> -----Original Message-----
> From: Hans Feldt [mailto:hans.fe...@ericsson.com]
> Sent: Monday, November 18, 2013 9:47 PM
> To: Mathivanan Naickan Palanivelu; Nagendra Kumar
> Cc: opensaf-devel@lists.sourceforge.net
> Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when link
> between Controller and Payload flickers [#600]
> 
> 
> 
> > -----Original Message-----
> > From: Mathivanan Naickan Palanivelu [mailto:mathi.naic...@oracle.com]
> > Sent: den 18 november 2013 14:05
> > To: Hans Feldt; Nagendra Kumar
> > Cc: opensaf-devel@lists.sourceforge.net
> > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when link
> > between Controller and Payload flickers [#600]
> >
> > Comments inline:
> >
> > > -----Original Message-----
> > > From: Hans Feldt [mailto:hans.fe...@ericsson.com]
> > > Sent: Monday, November 18, 2013 5:53 PM
> > > To: Mathivanan Naickan Palanivelu; Nagendra Kumar
> > > Cc: opensaf-devel@lists.sourceforge.net
> > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when link
> > > between Controller and Payload flickers [#600]
> > >
> > >
> > > > -----Original Message-----
> > > > From: Mathivanan Naickan Palanivelu
> > > > [mailto:mathi.naic...@oracle.com]
> > > > Sent: den 18 november 2013 11:07
> > > > To: Hans Feldt; Nagendra Kumar
> > > > Cc: opensaf-devel@lists.sourceforge.net
> > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when
> > > > link between Controller and Payload flickers [#600]
> > > >
> > > > From an OpenSAF architecture perspective, FM is designated to
> > > orchestrate controller failover.
> > >
> > > Yes _only_ controller failover and it will try do fencing by calling
> > > the opensaf_reboot script as seen in syslog.
> > >
> > > > Comments inline:
> > > >
> > > > > -----Original Message-----
> > > > > From: Hans Feldt [mailto:hans.fe...@ericsson.com]
> > > > > Sent: Monday, November 18, 2013 2:42 PM
> > > > > To: Nagendra Kumar; Mathivanan Naickan Palanivelu
> > > > > Cc: opensaf-devel@lists.sourceforge.net
> > > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when
> > > > > link between Controller and Payload flickers [#600]
> > > > >
> > > > > Just to sort out node fencing and recap the architecture we have
> > > > > I checked old and new code.
> > > > >
> > > > > In 3.x, avd upon detecting heartbeat loss, would inform avm
> > > > > about this. avm would inform fm. fm would I guess issue HPI
> > > > > reboot of the node. When the reboot response is received, only
> > > > > then would avd failover services. In 3.x mds down for avnd does not
> result in any action.
> > > > >
> > > > [Mathi]
> > > > In 3.x and 4.x, FM is the one that orchestrates failover.
> > > > In 3.x, 4.x AMF indeed reboots the payload node upon losing
> > > > connection
> > > with the controller.
> > > > If that's not happening in 4.x, then that means this code path got
> > > > broken
> > > and that has to be fixed.
> > >
> > > I think it has. I am sure it worked in 3.x. Now there seems to be no
> > > node fencing done in the case of MDS NODE DOWN for a payload node.
> > >
> > > CLM is not doing it (as far as I can tell) and AMF is not doing it.
> > > According to the spec CLM is supposed to do it:
> > >
> > > " The Cluster Membership Service will initiate a cluster node reboot
> > > as a fencing and repair action if the value of the
> > > saClmNodeDisableReboot attribute of the node (defined in the
> > > SaClmNode configuration object class, see Section 4.4) is SA_FALSE. "
> > >
> > > To me it sounds like CLM should do fencing and AMF should use CLM.
> > > Controller node handling is a special case though.
> > >
> > > Mathi what do you think, how should we change the architecture?
> > >
> > [Mathi]
> > Well, there are problems if we rely on CLM alone to do the payload fencing.
> > If user wants the saclmnodedisablereboot to be ON because the node is
> > running a Database and/or has special constraints because of which the
> > user does not want the node to be rebooted, but rather only wants to
> > "fence only the shared resources" and not poweroff/reboot and then mark
> the node as non-member!
> 
> I am not sure I understand...
> 
> Are you talking about a CLM node that is not an AMF node?
>
[Mathi] 
Both scenarios.
 
> What would saClmNodeDisableReboot=TRUE mean from an AMF
> perspective?
> 
[Mathi] 
Suppose If AMF does not fences and fencing is left to be handled by CLM && if
This variable is TRUE, and then this scenario can potentially disrupt(by 
continuing to be UP) a distributed application that runs on two payloads 
that just got cut off from the controller but are reachable between them and is 
accessing a shared resource!


Imagine that a fencing action (*fencing != reboot always*) is controlled by 
fault policies.
In terms of policy hierarchy, A fencing action done by CLM could be seen a 
'superset' policy and a
Fencing action done by AMF could be seen as a 'subset' policy. There could also 
be 'node-global' or a 'global' policy
That would overrides a CLM/AMF policy(like avoiding reboot in some scenarios, 
or rebooting forcefully over-wriding
Some flags, etc.)

> > In an implementation of the 'membership' service, this line from the
> > spec has to be taken with a pinch of salt(read - in conjuction with a 
> > cluster
> 'manager').
> > i.e. A generic clustermangement solution enables a deployment to have
> > a "fault handling" that allows user to create custom policies that
> > define "various criteria" to be considered when a node is fenced.
> >
> > In the absence of a good cluster manager that can feed evidences to
> > CLM (evidences that can also overwride the saclmnodedisablereboot when
> > the membership algorithm is run), I think we could continue to handle it in
> AMF(revert the old behavior), unless it is a complex fix in AMF!
> 
> Continue? Quite a few years ago it was a part of avsv which was a combined
> CLM/AMF service...
> 
[Mathi] 
If 'continue' is not comfortable, then 'reverting back' should atleast be 
comfortable!  :-)
Anyways, my point is that we should aim to take a formal stab/approach to 
change the architecture in 4.5, 
till then I think we should continue to do the reboot from within AMF in the 
scenario when
payload loses connection with controller.

Cheers,
Mathi.


> Thanks,
> Hans
> 
> >
> > In the long run(4.5), we should move this and also the FM functionality into
> CLM.
> >
> > > In 3.x avd also directly fenced nodes for various others reasons
> > > such as no success returns in responses. But not for heartbeat loss,
> > > that seemed to take a different route. This is also removed.
> > >
> > [Mathi]
> > Yeah, I think this thing has popped up multiple times now, i.e. "no status
> returns"!! Hmm...
> >
> > Thanks,
> > Mathi.
> >
> > > Thanks,
> > > Hans
> > >
> > > >
> > > > > In 4.x, amf does not have internal heartbeat. Amf uses on MDS
> > > > > down events (NOT CLM). Amf will not do any node level fencing.
> > > > > If CLM does fencing it would be UNCOORDINATED with AMF. Ticket
> > > > > #73
> > > suggests
> > > > > AMF should use CLM events to trigger failover.
> > > > >
> > > > > I am starting to think #73 needs to be fixed asap. At least for
> > > > > payload
> > > nodes.
> > > > > Not sure AMF can use CLM to failover the controller role.
> > > > >
> > > > > Mathi: I don't see CLM doing node fencing in case of NODE DOWN?
> > > > >
> > > > [Mathi]
> > > > Well, in the flows of controller-failover and
> > > > payload-disconnecting-from-controller
> > > > multiple (AMF, FM) modules are taking fencing decisions. *That is
> > > > the
> > > current OpenSAF architecture*.
> > > > And that is precisely why the saclmnodedisablereboot flag is not
> > > > supported in these flows that are handled by the AMF and helper
> services.
> > > > In the context of CLM it really cannot be black or white approach
> > > > to fencing, i.e. uncoordinated reboot would have its own problems
> > > > when applications(when controlled by AMF) have dependencies
> and/or
> > > preferences for how they(critical applications) want the node to be
> > > evicted(gracefully and/or immediate reboot).
> > > >
> > > > And yes, OpenSAF today does relies on MDS/TIPC for providing
> > > cluster/connection management evidences.
> > > >
> > > >
> > > > Thanks,
> > > > Mathi.
> > > >
> > > > > /Hans
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Nagendra Kumar [mailto:nagendr...@oracle.com]
> > > > > > Sent: den 18 november 2013 09:34
> > > > > > To: Hans Feldt; Suryanarayana Garlapati; Hans Nordebäck;
> > > > > > Praveen Malviya; Mathivanan Naickan Palanivelu
> > > > > > Cc: opensaf-devel@lists.sourceforge.net
> > > > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when
> > > > > > link between Controller and Payload flickers [#600]
> > > > > >
> > > > > > >> Does amfd try to fence the payload when this happens?
> > > > > > Amfd reset Amfnd information, but Amfnd lives like an orphan
> > > > > > as Amfd will
> > > > > not entertain any requests from Amfnd.
> > > > > >
> > > > > > Thanks
> > > > > > -Nagu
> > > > > >
> > > > > > -----Original Message-----
> > > > > > From: Hans Feldt [mailto:hans.fe...@ericsson.com]
> > > > > > Sent: 18 November 2013 12:56
> > > > > > To: Nagendra Kumar; Suryanarayana Garlapati; Hans Nordebäck;
> > > > > > Praveen Malviya; Mathivanan Naickan Palanivelu
> > > > > > Cc: opensaf-devel@lists.sourceforge.net
> > > > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload when
> > > > > > link between Controller and Payload flickers [#600]
> > > > > >
> > > > > > As I said this problem originates from a low memory condition
> > > > > > on the
> > > > > payload that results in TIPC on that payload resets the link(s).
> > > > > > Once TIPC links are re-established, the AMF cluster cannot be
> > > > > > re-established. I think TIPC has now been changed to handle
> > > > > > this better so
> > > > > the likeliness of this to happen will be dramatically reduced.
> > > > > >
> > > > > > Other triggers for this could be long network latencies as can
> > > > > > happen
> > > > > running virtualized without quality of service guarantees.
> > > > > >
> > > > > > This problem is of course part of the bigger problem of a too
> > > > > > simplistic
> > > > > cluster management in OpenSAF.
> > > > > >
> > > > > > Does amfd try to fence the payload when this happens?
> > > > > >
> > > > > > /Hans
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Nagendra Kumar [mailto:nagendr...@oracle.com]
> > > > > > > Sent: den 18 november 2013 07:57
> > > > > > > To: Suryanarayana Garlapati; Hans Feldt; Hans Nordebäck;
> > > > > > > Praveen Malviya; Mathivanan Naickan Palanivelu
> > > > > > > Cc: opensaf-devel@lists.sourceforge.net
> > > > > > > Subject: RE: [devel] [PATCH 1 of 1] amfnd: Reboot payload
> > > > > > > when link between Controller and Payload flickers [#600]
> > > > > > >
> > > > > > > Hi Hans,
> > > > > > >   Any response ?
> > > > > > >
> > > > > > > -Nagu
> > > > > > > -----Original Message-----
> > > > > > > From: Nagendra Kumar
> > > > > > > Sent: 15 November 2013 14:39
> > > > > > > To: Suryanarayana Garlapati; hans.fe...@ericsson.com;
> > > > > > > hans.nordeb...@ericsson.com; Praveen Malviya; Mathivanan
> > > Naickan
> > > > > > > Palanivelu
> > > > > > > Cc: opensaf-devel@lists.sourceforge.net
> > > > > > > Subject: Re: [devel] [PATCH 1 of 1] amfnd: Reboot payload
> > > > > > > when link between Controller and Payload flickers [#600]
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I checked and it looks no easy possibility for detecting
> > > > > > > link loss at
> > > > > controller AvD as it gets AvND down and Avnd UP in the tipc flicker.
> > > > > > > This is the same even when payload goes down and rejoins.
> > > > > > > Only time
> > > > > difference is the differentiator between these two scenario.
> > > > > > >
> > > > > > > Hans: Can you please check any other possibility at Amfd to
> > > > > > > detect tipc
> > > > > flicker?
> > > > > > >
> > > > > > > Thanks
> > > > > > > -Nagu
> > > > > > > -----Original Message-----
> > > > > > > From: Suryanarayana Garlapati
> > > > > > > Sent: 22 October 2013 14:41
> > > > > > > To: Nagendra Kumar; hans.fe...@ericsson.com;
> > > > > > > hans.nordeb...@ericsson.com; Praveen Malviya; Mathivanan
> > > Naickan
> > > > > > > Palanivelu
> > > > > > > Cc: opensaf-devel@lists.sourceforge.net
> > > > > > > Subject: Re: [PATCH 1 of 1] amfnd: Reboot payload when link
> > > > > > > between Controller and Payload flickers [#600]
> > > > > > >
> > > > > > > Hi Nagu,
> > > > > > > What is the discrimination point that the link flap has
> > > > > > > occurred with only one PLD? The chances of getting link flap
> > > > > > > with one Payload is less when compared wiht link flap with
> > > > > > > all the payloads. With my suggestion, there will be a
> > > > > > > failover. but with the present patch if a link flap happens
> > > > > > > with the payload nodes, all the payload nodes
> > > > > > will go for reboot. So considering in total, i guess we should
> > > > > > reboot the
> > > > > Active controller only.
> > > > > > >
> > > > > > > Regards
> > > > > > > Surya
> > > > > > >
> > > > > > >
> > > > > > > On Monday 21 October 2013 06:40 PM, Nagendra Kumar wrote:
> > > > > > > > Hi Surya,
> > > > > > > >
> > > > > > > > The problem I see with the approach is : because of
> > > > > > > > problem in payload, others payload is impacted because of
> > > > > > > > Act controller
> > > > > > > failover.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > -Nagu
> > > > > > > > -----Original Message-----
> > > > > > > > From: Suryanarayana Garlapati
> > > > > > > > Sent: 21 October 2013 18:22
> > > > > > > > To: Nagendra Kumar; hans.fe...@ericsson.com;
> > > > > > > > hans.nordeb...@ericsson.com; Praveen Malviya; Mathivanan
> > > > > > > > Naickan Palanivelu
> > > > > > > > Cc: opensaf-devel@lists.sourceforge.net
> > > > > > > > Subject: Re: [PATCH 1 of 1] amfnd: Reboot payload when
> > > > > > > > link between Controller and Payload flickers [#600]
> > > > > > > >
> > > > > > > > Hi Nagu,
> > > > > > > > I am not comfortable with this approach.
> > > > > > > > I think its better to reboot the active controller if link
> > > > > > > > flaps and not the payload node. If the link flaps between
> > > > > > > > the active controller
> > > > > > > and payload nodes, then there will total payload cluster
> > > > > > > reset which we
> > > > > can avoid by just rebooting the active controller.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > Regards
> > > > > > > > Surya
> > > > > > > >
> > > > > > > > On Monday 21 October 2013 05:03 PM, nagendr...@oracle.com
> > > wrote:
> > > > > > > >>    osaf/services/saf/amf/amfnd/di.cc             |  13 
> > > > > > > >> +++++++++----
> > > > > > > >>    osaf/services/saf/amf/amfnd/include/avnd_cb.h |   1 +
> > > > > > > >>    osaf/services/saf/amf/amfnd/mds.cc            |  11
> +++++++++++
> > > > > > > >>    3 files changed, 21 insertions(+), 4 deletions(-)
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> diff --git a/osaf/services/saf/amf/amfnd/di.cc
> > > > > > > >> b/osaf/services/saf/amf/amfnd/di.cc
> > > > > > > >> --- a/osaf/services/saf/amf/amfnd/di.cc
> > > > > > > >> +++ b/osaf/services/saf/amf/amfnd/di.cc
> > > > > > > >> @@ -437,13 +437,18 @@ uint32_t
> > > > > avnd_evt_mds_avd_dn_evh(AVND_CB
> > > > > > > >>
> > > > > > > >>        TRACE_ENTER();
> > > > > > > >>
> > > > > > > >> -      LOG_ER("AMF director unexpectedly crashed");
> > > > > > > >> -
> > > > > > > >>        /* Don't issue reboot if it has been already issued.*/
> > > > > > > >>        if (false == cb->reboot_in_progress) {
> > > > > > > >>                cb->reboot_in_progress = true;
> > > > > > > >> -              opensaf_reboot(avnd_cb->node_info.nodeId, (char
> > > > > *)avnd_cb->node_info.executionEnvironment.value,
> > > > > > > >> -                              "local AVD down(Adest) or both 
> > > > > > > >> AVD
> > > > > down(Vdest) received");
> > > > > > > >> +              if(cb->cont_reboot_in_progress == false) {
> > > > > > > >> +                      LOG_ER("AMF director unexpectedly
> > > > > crashed");
> > > > > > > >> +                      opensaf_reboot(avnd_cb-
> > > > > >node_info.nodeId, (char *)avnd_cb-
> > > > > >node_info.executionEnvironment.value,
> > > > > > > >> +                                      "local AVD down(Adest) 
> > > > > > > >> or
> > > > > both AVD down(Vdest) received");
> > > > > > > >> +              } else {
> > > > > > > >> +                      opensaf_reboot(avnd_cb-
> > > > > >node_info.nodeId, (char *)avnd_cb-
> > > > > >node_info.executionEnvironment.value,
> > > > > > > >> +                                      "Link reset with Act
> > > > > controller");
> > > > > > > >> +              }
> > > > > > > >> +
> > > > > > > >>        }
> > > > > > > >>
> > > > > > > >>        TRACE_LEAVE();
> > > > > > > >> diff --git
> > > > > > > >> a/osaf/services/saf/amf/amfnd/include/avnd_cb.h
> > > > > > > >> b/osaf/services/saf/amf/amfnd/include/avnd_cb.h
> > > > > > > >> --- a/osaf/services/saf/amf/amfnd/include/avnd_cb.h
> > > > > > > >> +++ b/osaf/services/saf/amf/amfnd/include/avnd_cb.h
> > > > > > > >> @@ -130,6 +130,7 @@ typedef struct avnd_cb_tag {
> > > > > > > >>        SaBoolT first_time_up;
> > > > > > > >>        bool reboot_in_progress;
> > > > > > > >>        AVND_SU *failed_su;
> > > > > > > >> +      bool cont_reboot_in_progress;
> > > > > > > >>    } AVND_CB;
> > > > > > > >>
> > > > > > > >>    #define AVND_CB_NULL ((AVND_CB *)0) diff --git
> > > > > > > >> a/osaf/services/saf/amf/amfnd/mds.cc
> > > > > > > >> b/osaf/services/saf/amf/amfnd/mds.cc
> > > > > > > >> --- a/osaf/services/saf/amf/amfnd/mds.cc
> > > > > > > >> +++ b/osaf/services/saf/amf/amfnd/mds.cc
> > > > > > > >> @@ -386,6 +386,7 @@ uint32_t avnd_mds_rcv(AVND_CB *cb,
> > > MDS_C
> > > > > > > >>                if ((AVSV_D2N_NODE_UP_MSG ==
> > > > > ((AVSV_DND_MSG *)(rcv_info->i_msg))->msg_type) ||
> > > > > > > >>                    (AVSV_D2N_DATA_VERIFY_MSG ==
> > > > > ((AVSV_DND_MSG *)(rcv_info->i_msg))->msg_type)) {
> > > > > > > >>                        cb->active_avd_adest = 
> > > > > > > >> rcv_info->i_fr_dest;
> > > > > > > >> +                      avnd_cb->cont_reboot_in_progress = 
> > > > > > > >> false;
> > > > > > > >>                        TRACE_1("Active AVD Adest = %" PRIu64 
> > > > > > > >> ,cb-
> > > > > >active_avd_adest);
> > > > > > > >>                }
> > > > > > > >>
> > > > > > > >> @@ -560,6 +561,14 @@ uint32_t
> avnd_mds_svc_evt(AVND_CB
> > > *cb,
> > > > > M
> > > > > > > >>        case NCSMDS_UP:
> > > > > > > >>                switch (evt_info->i_svc_id) {
> > > > > > > >>                case NCSMDS_SVC_ID_AVD:
> > > > > > > >> +
> > > > > > > >> +                      if ((m_MDS_DEST_IS_AN_ADEST(evt_info-
> > > > > >i_dest) && avnd_cb->cont_reboot_in_progress) &&
> > > > > > > >> +
> > > > >       (m_NCS_NODE_ID_FROM_MDS_DEST(evt_info->i_dest) == cb-
> > > > > >active_avd_adest)) {
> > > > > > > >> +                              memset(&cb->avd_dest, 0,
> > > > > sizeof(MDS_DEST));
> > > > > > > >> +                              evt = avnd_evt_create(cb,
> > > > > AVND_EVT_MDS_AVD_DN, 0, &evt_info->i_dest, 0, 0, 0);
> > > > > > > >> +                              break;
> > > > > > > >> +                      }
> > > > > > > >> +
> > > > > > > >>                        /* create the mds event */
> > > > > > > >>                        evt = avnd_evt_create(cb,
> > > > > AVND_EVT_MDS_AVD_UP, 0, &evt_info->i_dest, 0, 0, 0);
> > > > > > > >>                        break;
> > > > > > > >> @@ -606,6 +615,8 @@ uint32_t avnd_mds_svc_evt(AVND_CB
> > > *cb, M
> > > > > > > >>                                /* Supervise our node local 
> > > > > > > >> director
> > > > > */
> > > > > > > >>                                if (evt_info->i_node_id !=
> > > > > ncs_get_node_id()) {
> > > > > > > >>                                        /* Ignore the other AVD
> > > > > Adest Down.*/
> > > > > > > >> +
> > > > >       if(m_NCS_NODE_ID_FROM_MDS_DEST(evt_info->i_dest) == cb-
> > > > > >active_avd_adest)
> > > > > > > >> +                                              avnd_cb-
> > > > > >cont_reboot_in_progress = true;
> > > > > > > >>                                        return rc;
> > > > > > > >>                                }
> > > > > > > >>                        }
> > > > > > >
> > > > > > >
> > > > > > > ------------------------------------------------------------
> > > > > > > ----
> > > > > > > ----
> > > > > > > --
> > > > > > > -------- DreamFactory - Open Source REST & JSON Services for
> > > > > > > HTML5 & Native Apps OAuth, Users, Roles, SQL, NoSQL, BLOB
> > > > > > > Storage and External API Access Free app hosting. Or install
> > > > > > > the open source package
> > > > > on any LAMP server.
> > > > > > > Sign up and see examples for AngularJS, jQuery, Sencha Touch
> > > > > > > and
> > > > > Native!
> > > > > > >
> > > > >
> > >
> http://pubads.g.doubleclick.net/gampad/clk?id=63469471&iu=/4140/ostg
> > > > > > > .c lktrk
> _______________________________________________
> > > > > > > Opensaf-devel mailing list
> > > > > > > Opensaf-devel@lists.sourceforge.net
> > > > > > > https://lists.sourceforge.net/lists/listinfo/opensaf-devel

------------------------------------------------------------------------------
Shape the Mobile Experience: Free Subscription
Software experts and developers: Be at the forefront of tech innovation.
Intel(R) Software Adrenaline delivers strategic insight and game-changing 
conversations that shape the rapidly evolving mobile landscape. Sign up now. 
http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-devel mailing list
Opensaf-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Re: [devel] [PATCH 1 of 1] amfnd: Reboot payload when link between Controller and Payload flickers [#600]

Reply via email to