I think the send fail is seen because the destination exited. Yeah, calls to CLM API on unconfigured non member nodes will always return ERR_UNAVAILABLE.
Yes, we need to create a generic solution around opensaf startup when NID is removed. retry(at some level) and reboot(if desired) if required would be one aspect of that generic solution. Today, (a) This patch 3 of 3 only reverts things back to original behaviour. (b) Apart from this, if we want to introduce a reboot for opensaf startup failure, then we could make AMFND return an error to the opensafd script and the opensafd script could triger a reboot. In this case, we don't need this 3 of 3 at all. If not now, we could aim for (b) some time. - Mathi. ----- [email protected] wrote: > Fine let's see that as a possible future change. Remember removing > nid > has been discussed, then we will have no retry at all without my > proposed change. > > About the log "NO proc_initialize_msg: send failed", do you think it > comes from the fact that the payload start is not correct? clmna > seemed to return to nid even not a member and then amfnd tried to use > CLM? Should it be OK to use the CLM API on a non member node? > > Thanks, > Hans > > On 3 April 2014 14:12, Mathivanan Naickan Palanivelu > <[email protected]> wrote: > > There is no functional need for a retry when it is not a > communication problem there. > > The default number of re-spawns has been 3 all along. Yes, we can > increase it to a bigger value. > > It is up to the system integrator to decide what to do when OpenSAF > startup fails. > > > > Thanks, > > Mathi. > > > > ----- [email protected] wrote: > > > >> On the controller I get an extra "send failed" log message that is > >> ugly: > >> > >> Apr 3 13:40:01 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: > >> 'PL-6' is not a configured cluster node. > >> Apr 3 13:40:01 SC-1 local0.notice osafclmd[417]: NO > >> /etc/opensaf/node_name should contain the rdn value of configured > >> CLM node object name > >> Apr 3 13:40:02 SC-1 local0.notice osafclmd[417]: NO > >> proc_initialize_msg: send failed. dest:2060f19cd6007 > >> Apr 3 13:40:17 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: > >> 'PL-6' is not a configured cluster node. > >> Apr 3 13:40:17 SC-1 local0.notice osafclmd[417]: NO > >> /etc/opensaf/node_name should contain the rdn value of configured > >> CLM node object name > >> Apr 3 13:40:17 SC-1 local0.notice osafclmd[417]: NO > >> proc_initialize_msg: send failed. dest:2060f19cda006 > >> Apr 3 13:40:32 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: > >> 'PL-6' is not a configured cluster node. > >> Apr 3 13:40:32 SC-1 local0.notice osafclmd[417]: NO > >> /etc/opensaf/node_name should contain the rdn value of configured > >> CLM node object name > >> Apr 3 13:40:33 SC-1 local0.notice osafclmd[417]: NO > >> proc_initialize_msg: send failed. dest:2060f19cda009 > >> Apr 3 13:40:34 SC-1 local0.notice osafimmnd[388]: NO Global > discard > >> node received for nodeId:2060f pid:359 > >> Apr 3 13:40:39 SC-1 user.warn kernel: tipc: Resetting link > >> <1.1.1:eth0-1.1.6:eth0>, peer not responding > >> > >> I would like to have a try again loop inside clmna instead. Then > we > >> could have the nid to supervise and every now and > >> then do a node reboot. There is not much point in trying three > times > >> and give up. I think clmna should retry forever and > >> let nodeinit handle the bigger loop. The try interval needs to be > >> configured but should have good default such as every > >> 30 sec. > >> > >> Thanks, > >> Hans > >> > >> On 03/27/2014 02:13 AM, [email protected] wrote: > >> > osaf/services/saf/clmsv/nodeagent/main.c | 8 ++++++++ > >> > 1 files changed, 8 insertions(+), 0 deletions(-) > >> > > >> > > >> > When a node join request for an unconfigured/misconfigured node > or > >> > when a node join request with a duplicate node_name is > attempted, > >> then > >> > clmna should report those errors to NID such that NID attempts > to > >> > respawan clmna. > >> > With the introduction of this change, the following happens(can > be > >> seen in the syslog) in > >> > the case of a unconfigured/misconfigured node join request: > >> > At the ACTIVE controller syslog: > >> > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO CLM > NodeName: > >> 'PL-8' is not a configured cluster node. > >> > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO > >> /etc/opensaf/node_name should contain the rdn value of a > configured > >> CLM node object name > >> > > >> > At the unconfigured/misconfigured node, the syslog will be like > as > >> below: > >> > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: Started > >> > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER PL-8 is not a > >> configured node > >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Failed > >> DESC:CLMNA > >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Going for > >> recovery > >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Trying To > RESPAWN > >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1 > >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Sending SIGKILL > to > >> CLMNA, pid=868 > >> > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER Exiting > >> > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: exiting for > >> shutdown > >> > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: Started > >> > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER PL-8 is not a > >> configured node > >> > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER Exiting > >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Could Not > RESPAWN > >> CLMNA > >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Failed > >> DESC:CLMNA > >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Trying To > RESPAWN > >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2 > >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Sending SIGKILL > to > >> CLMNA, pid=892 > >> > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: exiting for > >> shutdown > >> > Mar 26 19:04:24 PL-3 local0.notice osafclmna[919]: Started > >> > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER PL-8 is not a > >> configured node > >> > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER Exiting > >> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Could Not > RESPAWN > >> CLMNA > >> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Failed > >> DESC:CLMNA > >> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER FAILED TO > RESPAWN > >> > Mar 26 19:04:27 PL-3 local0.notice osafclmna[919]: exiting for > >> shutdown > >> > Mar 26 19:04:28 PL-3 local0.notice osafimmnd[864]: exiting for > >> shutdown > >> > > >> > > >> > For cases when a duplicate node join request comes, the > following > >> syslog > >> > message will be seen at the ACTIVE controller: > >> > Mar 26 19:07:43 SC-1 local0.err osafclmd[418]: ER Duplicate node > >> join request for CLM node: 'SC-2'. Specify a unique node name > >> in/etc/opensaf/node_name > >> > Mar 26 19:07:59 SC-1 local0.err osafclmd[418]: ER Duplicate node > >> join request for CLM node: 'SC-2'. Specify a unique node name > >> in/etc/opensaf/node_name > >> > > >> > And the following will be seen at the node on which the > duplicate > >> request is attempted: > >> > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER SC-2 is > already > >> up. Specify a unique name in/etc/opensaf/node_name > >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Failed > >> DESC:CLMNA > >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Going for > >> recovery > >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Trying To > RESPAWN > >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1 > >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Sending > SIGKILL > >> to CLMNA, pid=1456 > >> > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER Exiting > >> > Mar 26 19:07:44 PL-3 local0.notice osafclmna[1459]: exiting for > >> shutdown > >> > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: Started > >> > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER SC-2 is > already > >> up. Specify a unique name in/etc/opensaf/node_name > >> > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER Exiting > >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Could Not > RESPAWN > >> CLMNA > >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Failed > >> DESC:CLMNA > >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Trying To > RESPAWN > >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2 > >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Sending > SIGKILL > >> to CLMNA, pid=1480 > >> > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: exiting for > >> shutdown > >> > > >> > diff --git a/osaf/services/saf/clmsv/nodeagent/main.c > >> b/osaf/services/saf/clmsv/nodeagent/main.c > >> > --- a/osaf/services/saf/clmsv/nodeagent/main.c > >> > +++ b/osaf/services/saf/clmsv/nodeagent/main.c > >> > @@ -458,11 +458,19 @@ SaAisErrorT clmna_process_dummyup_msg(vo > >> > LOG_ER("%s is not a configured node", > >> > > o_msg->info.api_resp_info.param.node_name.value); > >> > free(o_msg); > >> > + rc = error; /* For now, just pass on the > error to nid. > >> > + * This is not needed in future > when node local > >> > + * cluster management policy > based decisions can be made. > >> > + */ > >> > goto done; > >> > } else if (error == SA_AIS_ERR_EXIST) { > >> > LOG_ER("%s is already up. Specify a unique > name in" > >> PKGSYSCONFDIR "/node_name", > >> > > o_msg->info.api_resp_info.param.node_name.value); > >> > free(o_msg); > >> > + rc = error; /* This is not needed in future > when node local > >> > + * cluster management policy > based decisions can be made. > >> > + * For now, just pass on the > error to nid. > >> > + */ > >> > goto done; > >> > } > >> > > >> > > >> > > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Opensaf-devel mailing list > > [email protected] > > https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
