Fine let's see that as a possible future change. Remember removing nid has been discussed, then we will have no retry at all without my proposed change.
About the log "NO proc_initialize_msg: send failed", do you think it comes from the fact that the payload start is not correct? clmna seemed to return to nid even not a member and then amfnd tried to use CLM? Should it be OK to use the CLM API on a non member node? Thanks, Hans On 3 April 2014 14:12, Mathivanan Naickan Palanivelu <[email protected]> wrote: > There is no functional need for a retry when it is not a communication > problem there. > The default number of re-spawns has been 3 all along. Yes, we can increase it > to a bigger value. > It is up to the system integrator to decide what to do when OpenSAF startup > fails. > > Thanks, > Mathi. > > ----- [email protected] wrote: > >> On the controller I get an extra "send failed" log message that is >> ugly: >> >> Apr 3 13:40:01 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: >> 'PL-6' is not a configured cluster node. >> Apr 3 13:40:01 SC-1 local0.notice osafclmd[417]: NO >> /etc/opensaf/node_name should contain the rdn value of configured >> CLM node object name >> Apr 3 13:40:02 SC-1 local0.notice osafclmd[417]: NO >> proc_initialize_msg: send failed. dest:2060f19cd6007 >> Apr 3 13:40:17 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: >> 'PL-6' is not a configured cluster node. >> Apr 3 13:40:17 SC-1 local0.notice osafclmd[417]: NO >> /etc/opensaf/node_name should contain the rdn value of configured >> CLM node object name >> Apr 3 13:40:17 SC-1 local0.notice osafclmd[417]: NO >> proc_initialize_msg: send failed. dest:2060f19cda006 >> Apr 3 13:40:32 SC-1 local0.notice osafclmd[417]: NO CLM NodeName: >> 'PL-6' is not a configured cluster node. >> Apr 3 13:40:32 SC-1 local0.notice osafclmd[417]: NO >> /etc/opensaf/node_name should contain the rdn value of configured >> CLM node object name >> Apr 3 13:40:33 SC-1 local0.notice osafclmd[417]: NO >> proc_initialize_msg: send failed. dest:2060f19cda009 >> Apr 3 13:40:34 SC-1 local0.notice osafimmnd[388]: NO Global discard >> node received for nodeId:2060f pid:359 >> Apr 3 13:40:39 SC-1 user.warn kernel: tipc: Resetting link >> <1.1.1:eth0-1.1.6:eth0>, peer not responding >> >> I would like to have a try again loop inside clmna instead. Then we >> could have the nid to supervise and every now and >> then do a node reboot. There is not much point in trying three times >> and give up. I think clmna should retry forever and >> let nodeinit handle the bigger loop. The try interval needs to be >> configured but should have good default such as every >> 30 sec. >> >> Thanks, >> Hans >> >> On 03/27/2014 02:13 AM, [email protected] wrote: >> > osaf/services/saf/clmsv/nodeagent/main.c | 8 ++++++++ >> > 1 files changed, 8 insertions(+), 0 deletions(-) >> > >> > >> > When a node join request for an unconfigured/misconfigured node or >> > when a node join request with a duplicate node_name is attempted, >> then >> > clmna should report those errors to NID such that NID attempts to >> > respawan clmna. >> > With the introduction of this change, the following happens(can be >> seen in the syslog) in >> > the case of a unconfigured/misconfigured node join request: >> > At the ACTIVE controller syslog: >> > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO CLM NodeName: >> 'PL-8' is not a configured cluster node. >> > Mar 26 20:23:32 SC-1 local0.notice osafclmd[420]: NO >> /etc/opensaf/node_name should contain the rdn value of a configured >> CLM node object name >> > >> > At the unconfigured/misconfigured node, the syslog will be like as >> below: >> > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: Started >> > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER PL-8 is not a >> configured node >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Failed >> DESC:CLMNA >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Going for >> recovery >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Trying To RESPAWN >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1 >> > Mar 26 19:03:54 PL-3 local0.err opensafd[837]: ER Sending SIGKILL to >> CLMNA, pid=868 >> > Mar 26 19:03:54 PL-3 local0.err osafclmna[871]: ER Exiting >> > Mar 26 19:03:54 PL-3 local0.notice osafclmna[871]: exiting for >> shutdown >> > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: Started >> > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER PL-8 is not a >> configured node >> > Mar 26 19:04:09 PL-3 local0.err osafclmna[895]: ER Exiting >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Could Not RESPAWN >> CLMNA >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Failed >> DESC:CLMNA >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Trying To RESPAWN >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2 >> > Mar 26 19:04:09 PL-3 local0.err opensafd[837]: ER Sending SIGKILL to >> CLMNA, pid=892 >> > Mar 26 19:04:09 PL-3 local0.notice osafclmna[895]: exiting for >> shutdown >> > Mar 26 19:04:24 PL-3 local0.notice osafclmna[919]: Started >> > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER PL-8 is not a >> configured node >> > Mar 26 19:04:25 PL-3 local0.err osafclmna[919]: ER Exiting >> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Could Not RESPAWN >> CLMNA >> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER Failed >> DESC:CLMNA >> > Mar 26 19:04:25 PL-3 local0.err opensafd[837]: ER FAILED TO RESPAWN >> > Mar 26 19:04:27 PL-3 local0.notice osafclmna[919]: exiting for >> shutdown >> > Mar 26 19:04:28 PL-3 local0.notice osafimmnd[864]: exiting for >> shutdown >> > >> > >> > For cases when a duplicate node join request comes, the following >> syslog >> > message will be seen at the ACTIVE controller: >> > Mar 26 19:07:43 SC-1 local0.err osafclmd[418]: ER Duplicate node >> join request for CLM node: 'SC-2'. Specify a unique node name >> in/etc/opensaf/node_name >> > Mar 26 19:07:59 SC-1 local0.err osafclmd[418]: ER Duplicate node >> join request for CLM node: 'SC-2'. Specify a unique node name >> in/etc/opensaf/node_name >> > >> > And the following will be seen at the node on which the duplicate >> request is attempted: >> > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER SC-2 is already >> up. Specify a unique name in/etc/opensaf/node_name >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Failed >> DESC:CLMNA >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Going for >> recovery >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Trying To RESPAWN >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #1 >> > Mar 26 19:07:43 PL-3 local0.err opensafd[1425]: ER Sending SIGKILL >> to CLMNA, pid=1456 >> > Mar 26 19:07:43 PL-3 local0.err osafclmna[1459]: ER Exiting >> > Mar 26 19:07:44 PL-3 local0.notice osafclmna[1459]: exiting for >> shutdown >> > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: Started >> > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER SC-2 is already >> up. Specify a unique name in/etc/opensaf/node_name >> > Mar 26 19:07:59 PL-3 local0.err osafclmna[1483]: ER Exiting >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Could Not RESPAWN >> CLMNA >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Failed >> DESC:CLMNA >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Trying To RESPAWN >> /usr/local/lib/opensaf/clc-cli/osaf-clmna attempt #2 >> > Mar 26 19:07:59 PL-3 local0.err opensafd[1425]: ER Sending SIGKILL >> to CLMNA, pid=1480 >> > Mar 26 19:07:59 PL-3 local0.notice osafclmna[1483]: exiting for >> shutdown >> > >> > diff --git a/osaf/services/saf/clmsv/nodeagent/main.c >> b/osaf/services/saf/clmsv/nodeagent/main.c >> > --- a/osaf/services/saf/clmsv/nodeagent/main.c >> > +++ b/osaf/services/saf/clmsv/nodeagent/main.c >> > @@ -458,11 +458,19 @@ SaAisErrorT clmna_process_dummyup_msg(vo >> > LOG_ER("%s is not a configured node", >> > >> > o_msg->info.api_resp_info.param.node_name.value); >> > free(o_msg); >> > + rc = error; /* For now, just pass on the error to nid. >> > + * This is not needed in future when node >> > local >> > + * cluster management policy based >> > decisions can be made. >> > + */ >> > goto done; >> > } else if (error == SA_AIS_ERR_EXIST) { >> > LOG_ER("%s is already up. Specify a unique name in" >> PKGSYSCONFDIR "/node_name", >> > >> > o_msg->info.api_resp_info.param.node_name.value); >> > free(o_msg); >> > + rc = error; /* This is not needed in future when node >> > local >> > + * cluster management policy based >> > decisions can be made. >> > + * For now, just pass on the error to nid. >> > + */ >> > goto done; >> > } >> > >> > >> > > > ------------------------------------------------------------------------------ > _______________________________________________ > Opensaf-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-devel ------------------------------------------------------------------------------ _______________________________________________ Opensaf-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-devel
