I think the send fail is seen because the destination exited.
Yeah, calls to CLM API on unconfigured non member nodes will
always return ERR_UNAVAILABLE.

Yes, we need to create a generic solution around opensaf startup
when NID is removed. retry(at some level) and reboot(if desired)
if required would be one aspect of that generic solution.

Today, 
(a) This patch 3 of 3 only reverts things back to original behaviour.

(b) Apart from this, if we want to introduce a reboot for opensaf startup 
failure,
then we could make AMFND return an error to the opensafd script
and the opensafd script could triger a reboot. In this case,
we don't need this 3 of 3 at all. If not now, we could aim for (b) some time.

- Mathi.



----- [email protected] wrote:

> Fine let's see that as a possible future change. Remember removing
> nid
> has been discussed, then we will have no retry at all without my
> proposed change.
> 
> About the log "NO proc_initialize_msg: send failed", do you think it
> comes from the fact that the payload start is not correct? clmna
> seemed to return to nid even not a member and then amfnd tried to use
> CLM? Should it be OK to use the CLM API on a non member node?
> 
> Thanks,
> Hans
> 
> On 3 April 2014 14:12, Mathivanan Naickan Palanivelu
> <[email protected]> wrote:
> > There is no functional need for a retry when it is not a
> communication problem there.
> > The default number of re-spawns has been 3 all along. Yes, we can
> increase it to a bigger value.
> > It is up to the system integrator to decide what to do when OpenSAF
> startup fails.
> >
> > Thanks,
> > Mathi.
> >
> > ----- [email protected] wrote:
> >
> >> On the controller I get an extra "send failed" log message that is
> >> ugly:
> >>
> >> Apr  3 13:40:01 SC-1 local0.notice osafclmd[417]: NO CLM NodeName:
> >> 'PL-6' is not a configured cluster node.
> >> Apr  3 13:40:01 SC-1 local0.notice osafclmd[417]: NO
> >> /etc/opensaf/node_name should contain the rdn value of configured
> >> CLM node object name
> >> Apr  3 13:40:02 SC-1 local0.notice osafclmd[417]: NO
> >> proc_initialize_msg: send failed. dest:2060f19cd6007
> >> Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO CLM NodeName:
> >> 'PL-6' is not a configured cluster node.
> >> Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO
> >> /etc/opensaf/node_name should contain the rdn value of configured
> >> CLM node object name
> >> Apr  3 13:40:17 SC-1 local0.notice osafclmd[417]: NO
> >> proc_initialize_msg: send failed. dest:2060f19cda006
> >> Apr  3 13:40:32 SC-1 local0.notice osafclmd[417]: NO CLM NodeName:
> >> 'PL-6' is not a configured cluster node.
> >> Apr  3 13:40:32 SC-1 local0.notice osafclmd[417]: NO
> >> /etc/opensaf/node_name should contain the rdn value of configured
> >> CLM node object name
> >> Apr  3 13:40:33 SC-1 local0.notice osafclmd[417]: NO
> >> proc_initialize_msg: send failed. dest:2060f19cda009
> >> Apr  3 13:40:34 SC-1 local0.notice osafimmnd[388]: NO Global
> discard
> >> node received for nodeId:2060f pid:359
> >> Apr  3 13:40:39 SC-1 user.warn kernel: tipc: Resetting link
> >> <1.1.1:eth0-1.1.6:eth0>, peer not responding
> >>
> >> I would like to have a try again loop inside clmna instead. Then
> we
> >> could have the nid to supervise and every now and
> >> then do a node reboot. There is not much point in trying three
> times
> >> and give up. I think clmna should retry forever and
> >> let nodeinit handle the bigger loop. The try interval needs to be
> >> configured but should have good default such as every
> >> 30 sec.
> >>
> >> Thanks,
> >> Hans
> >>
> >> On 03/27/2014 02:13 AM, [email protected] wrote:
> >> >   osaf/services/saf/clmsv/nodeagent/main.c |  8 ++++++++
> >> >   1 files changed, 8 insertions(+), 0 deletions(-)
> >> >
> >> >
> >> > When a node join request for an unconfigured/misconfigured node
> or
> >> > when a node join request with a duplicate node_name is
> attempted,
> >> then
> >> > clmna should report those errors to NID such that NID attempts
> to
> >> > respawan clmna.
> >> > With the introduction of this change, the following happens(can
> be
> >> seen in the syslog) in
> >> > the case of a unconfigured/misconfigured node join request:
> >> > At the ACTIVE controller syslog:
> >> > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO CLM
> NodeName:
> >> 'PL-8' is not a configured cluster node.
> >> > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO
> >> /etc/opensaf/node_name should contain the rdn value of a
> configured
> >> CLM node object name
> >> >
> >> > At the unconfigured/misconfigured node, the syslog will be like
> as
> >> below:
> >> > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: Started
> >> > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER PL-8 is not a
> >> configured node
> >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Failed
> >> DESC:CLMNA
> >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Going for
> >> recovery
> >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Trying To
> RESPAWN
> >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1
> >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Sending SIGKILL
> to
> >> CLMNA, pid=868
> >> > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER Exiting
> >> > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: exiting for
> >> shutdown
> >> > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: Started
> >> > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER PL-8 is not a
> >> configured node
> >> > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER Exiting
> >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Could Not
> RESPAWN
> >> CLMNA
> >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Failed
> >> DESC:CLMNA
> >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Trying To
> RESPAWN
> >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2
> >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Sending SIGKILL
> to
> >> CLMNA, pid=892
> >> > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: exiting for
> >> shutdown
> >> > Mar 26 19:04:24 PL-3 local0.notice osafclmna[919]: Started
> >> > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER PL-8 is not a
> >> configured node
> >> > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER Exiting
> >> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Could Not
> RESPAWN
> >> CLMNA
> >> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Failed
> >> DESC:CLMNA
> >> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER FAILED TO
> RESPAWN
> >> > Mar 26 19:04:27 PL-3 local0.notice osafclmna[919]: exiting for
> >> shutdown
> >> > Mar 26 19:04:28 PL-3 local0.notice osafimmnd[864]: exiting for
> >> shutdown
> >> >
> >> >
> >> > For cases when a duplicate node join request comes, the
> following
> >> syslog
> >> > message will be seen at the ACTIVE controller:
> >> > Mar 26 19:07:43 SC-1 local0.err osafclmd[418]: ER Duplicate node
> >> join request for CLM node: 'SC-2'. Specify a unique node name
> >> in/etc/opensaf/node_name
> >> > Mar 26 19:07:59 SC-1 local0.err osafclmd[418]: ER Duplicate node
> >> join request for CLM node: 'SC-2'. Specify a unique node name
> >> in/etc/opensaf/node_name
> >> >
> >> > And the following will be seen at the node on which the
> duplicate
> >> request is attempted:
> >> > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER SC-2 is
> already
> >> up. Specify a unique name in/etc/opensaf/node_name
> >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Failed
> >> DESC:CLMNA
> >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Going for
> >> recovery
> >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Trying To
> RESPAWN
> >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1
> >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Sending
> SIGKILL
> >> to CLMNA, pid=1456
> >> > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER Exiting
> >> > Mar 26 19:07:44 PL-3 local0.notice osafclmna[1459]: exiting for
> >> shutdown
> >> > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: Started
> >> > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER SC-2 is
> already
> >> up. Specify a unique name in/etc/opensaf/node_name
> >> > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER Exiting
> >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Could Not
> RESPAWN
> >> CLMNA
> >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Failed
> >> DESC:CLMNA
> >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Trying To
> RESPAWN
> >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2
> >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Sending
> SIGKILL
> >> to CLMNA, pid=1480
> >> > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: exiting for
> >> shutdown
> >> >
> >> > diff --git a/osaf/services/saf/clmsv/nodeagent/main.c
> >> b/osaf/services/saf/clmsv/nodeagent/main.c
> >> > --- a/osaf/services/saf/clmsv/nodeagent/main.c
> >> > +++ b/osaf/services/saf/clmsv/nodeagent/main.c
> >> > @@ -458,11 +458,19 @@ SaAisErrorT clmna_process_dummyup_msg(vo
> >> >                     LOG_ER("%s is not a configured node",
> >> >                            
> o_msg->info.api_resp_info.param.node_name.value);
> >> >                     free(o_msg);
> >> > +                   rc = error; /* For now, just pass on the
> error to nid.
> >> > +                                * This is not needed in future
> when node local
> >> > +                                * cluster management policy
> based decisions can be made.
> >> > +                                */
> >> >                     goto done;
> >> >             } else if (error == SA_AIS_ERR_EXIST) {
> >> >                     LOG_ER("%s is already up. Specify a unique
> name in"
> >> PKGSYSCONFDIR "/node_name",
> >> >                            
> o_msg->info.api_resp_info.param.node_name.value);
> >> >                     free(o_msg);
> >> > +                   rc = error; /* This is not needed in future
> when node local
> >> > +                                * cluster management policy
> based decisions can be made.
> >> > +                                * For now, just pass on the
> error to nid.
> >> > +                                */
> >> >                     goto done;
> >> >             }
> >> >
> >> >
> >> >
> >
> >
> ------------------------------------------------------------------------------
> > _______________________________________________
> > Opensaf-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/opensaf-devel

------------------------------------------------------------------------------
_______________________________________________
Opensaf-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-devel

Reply via email to